Llama-OuteTTS-1.0-1B-GGUF
Property | Value |
---|---|
Model Size | 1B parameters |
Context Length | 8,192 tokens |
Training Data | ~60k hours of audio |
License | CC-BY-NC-SA-4.0 / Llama 3.2 Community License |
Author | OuteAI |
What is Llama-OuteTTS-1.0-1B-GGUF?
Llama-OuteTTS is a state-of-the-art text-to-speech model that combines the power of Llama 3.2 architecture with advanced audio synthesis capabilities. This model represents a significant advancement in multilingual speech synthesis, offering one-shot voice cloning with just 10 seconds of reference audio and supporting direct text processing across multiple languages.
Implementation Details
The model utilizes a DAC audio encoder from IBM Research, implementing two codebooks for high-quality audio reconstruction. It features a maximum context window of 8,192 tokens and operates with specific sampling parameters including a temperature of 0.4 and a repetition penalty of 1.1 for optimal performance.
- Automatic word alignment with internal processing
- Native multilingual text support without romanization
- Enhanced metadata integration for improved speaker flow
- DAC encoder integration for high-quality audio reconstruction
- Token generation rate of 150 tokens per second
Core Capabilities
- Support for 12+ high-proficiency languages including English, Arabic, Chinese, and Japanese
- One-shot voice cloning with ~10 seconds of reference audio
- Automatic text alignment for languages without clear word boundaries
- Direct numerical input support with multilingual capabilities
- Maximum audio generation of 42 seconds in a single run
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its ability to perform one-shot voice cloning with minimal reference audio, combined with native multilingual support and automatic word alignment capabilities. The integration of the DAC encoder ensures high-quality audio output while maintaining a relatively compact model size.
Q: What are the recommended use cases?
The model is ideal for applications requiring high-quality text-to-speech conversion, voice cloning for personalized audio content, multilingual audio generation, and scenarios where accurate pronunciation and natural speech flow are crucial. It's particularly suited for projects requiring support across multiple languages without the need for text pre-processing.