Fish Speech V1.5

Property	Value
Developer	Fishaudio
License	BY-CC-NC-SA-4.0
Paper	arXiv:2411.01156
Model URL	HuggingFace

What is fish-speech-1.5?

Fish Speech V1.5 represents a significant advancement in multilingual text-to-speech synthesis, leveraging large language models to deliver high-quality voice generation across 13 languages. The model has been trained on an impressive dataset exceeding 1 million hours of audio, with particular strength in English and Chinese content.

Implementation Details

The model architecture utilizes advanced LLM techniques for speech synthesis, with specialized training across varying quantities of language-specific data. The implementation particularly excels in major languages with English and Chinese each having over 300,000 hours of training data, while Japanese follows with 100,000+ hours.

Extensive language support with tiered training data volumes
Research-backed architecture documented in academic publication
Optimized for high-quality multilingual synthesis

Core Capabilities

Support for 13 distinct languages with varying levels of proficiency
Primary languages (English, Chinese): 300,000+ hours of training
Secondary languages (Japanese): 100,000+ hours
Tertiary languages (German, French, Spanish, Korean, Arabic, Russian): ~20,000 hours each
Additional languages (Dutch, Italian, Polish, Portuguese): <10,000 hours each

Frequently Asked Questions

Q: What makes this model unique?

Fish Speech V1.5's uniqueness lies in its extensive training data coverage and the breadth of supported languages, making it one of the most comprehensive multilingual TTS solutions available. The tiered approach to language support ensures optimal performance for widely-used languages while maintaining capabilities across lesser-used ones.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring high-quality multilingual TTS capabilities, especially those focusing on English, Chinese, or Japanese content. It's ideal for educational platforms, content localization, accessibility tools, and international business applications requiring natural-sounding speech synthesis.

fish-speech-1.5