Llama-OuteTTS-1.0-1B

Property	Value
Model Size	1B parameters
Training Data	~60k hours of audio
Context Length	8,192 tokens
License	CC-BY-NC-SA-4.0 (fine-tuning), Llama 3.2 Community License (base)
Author	OuteAI

What is Llama-OuteTTS-1.0-1B?

Llama-OuteTTS-1.0-1B is a sophisticated text-to-speech model built on Llama3.2-1B architecture, specifically designed for multilingual speech synthesis and voice cloning. The model incorporates a DAC audio encoder and supports automatic word alignment, making it particularly effective for generating natural-sounding speech across multiple languages without requiring pre-processing.

Implementation Details

The model utilizes an advanced DAC Encoder from IBM Research, implementing two codebooks for high-quality audio reconstruction. It features a context window of 8,192 tokens and operates with carefully tuned sampling parameters, including a recommended temperature of 0.4 and a repetition penalty of 1.1 within a 64-token window.

Integrated DAC audio encoder for improved fidelity
Automatic word alignment system
Native multilingual text support
Enhanced metadata integration for improved speaker flow
One-shot voice cloning capability

Core Capabilities

Supports 12+ languages with high proficiency (including English, Arabic, Chinese, Japanese)
One-shot voice cloning with just 10 seconds of reference audio
Direct numerical input support across languages
Automatic text alignment for languages without clear word boundaries
Generation of approximately 42 seconds of audio in a single run

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its ability to perform one-shot voice cloning with minimal reference audio, automatic word alignment, and support for native text across multiple languages without requiring romanization. It also integrates advanced metadata handling for improved synthesis quality.

Q: What are the recommended use cases?

The model is ideal for applications requiring multilingual text-to-speech conversion, voice cloning for personalized audio content, and scenarios where high-quality speech synthesis is needed across different languages. It's particularly suitable for applications requiring natural-sounding speech with accurate pronunciation and emotional expression.