Llama-OuteTTS-1.0-1B

Maintained By
OuteAI

Llama-OuteTTS-1.0-1B

PropertyValue
Model Size1B parameters
Training Data~60k hours of audio
Context Length8,192 tokens
LicenseCC-BY-NC-SA-4.0 (fine-tuning), Llama 3.2 Community License (base)
AuthorOuteAI

What is Llama-OuteTTS-1.0-1B?

Llama-OuteTTS-1.0-1B is a sophisticated text-to-speech model built on Llama3.2-1B architecture, specifically designed for multilingual speech synthesis and voice cloning. The model incorporates a DAC audio encoder and supports automatic word alignment, making it particularly effective for generating natural-sounding speech across multiple languages without requiring pre-processing.

Implementation Details

The model utilizes an advanced DAC Encoder from IBM Research, implementing two codebooks for high-quality audio reconstruction. It features a context window of 8,192 tokens and operates with carefully tuned sampling parameters, including a recommended temperature of 0.4 and a repetition penalty of 1.1 within a 64-token window.

  • Integrated DAC audio encoder for improved fidelity
  • Automatic word alignment system
  • Native multilingual text support
  • Enhanced metadata integration for improved speaker flow
  • One-shot voice cloning capability

Core Capabilities

  • Supports 12+ languages with high proficiency (including English, Arabic, Chinese, Japanese)
  • One-shot voice cloning with just 10 seconds of reference audio
  • Direct numerical input support across languages
  • Automatic text alignment for languages without clear word boundaries
  • Generation of approximately 42 seconds of audio in a single run

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its ability to perform one-shot voice cloning with minimal reference audio, automatic word alignment, and support for native text across multiple languages without requiring romanization. It also integrates advanced metadata handling for improved synthesis quality.

Q: What are the recommended use cases?

The model is ideal for applications requiring multilingual text-to-speech conversion, voice cloning for personalized audio content, and scenarios where high-quality speech synthesis is needed across different languages. It's particularly suitable for applications requiring natural-sounding speech with accurate pronunciation and emotional expression.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.