MARS5-TTS
Property | Value |
---|---|
License | AGPL-3.0 |
Model Type | Text-to-Speech |
Architecture | Two-stage AR-NAR pipeline |
Parameters | ~1.2B (AR: 750M, NAR: 450M) |
What is MARS5-TTS?
MARS5-TTS is a cutting-edge text-to-speech model developed by CAMB.AI that represents a significant advancement in voice synthesis technology. The model employs a novel two-stage AR-NAR (Autoregressive-Non-Autoregressive) pipeline with a distinctive NAR component, enabling it to generate highly natural speech with impressive prosody control using just 5 seconds of reference audio.
Implementation Details
The model architecture consists of two main components: an AR model with 750M parameters and a NAR model with 450M parameters. It utilizes byte-pair encoding for both text and encodec codes, and requires at least 20GB of GPU VRAM for inference. The system operates at 24kHz and supports both shallow and deep cloning approaches for voice synthesis.
- Two-stage processing pipeline with AR and NAR components
- Byte-pair encoding tokenization for text and audio features
- Support for both fast (shallow) and high-quality (deep) cloning
- Prosody control through punctuation and capitalization
Core Capabilities
- Voice cloning from just 5 seconds of reference audio
- Natural prosody generation for diverse scenarios including sports commentary and anime li>Controllable speech characteristics through text formatting
- Support for reference transcripts in deep cloning mode
Frequently Asked Questions
Q: What makes this model unique?
MARS5-TTS stands out for its ability to generate high-quality speech with natural prosody using minimal reference audio. Its novel NAR component and the ability to control speech characteristics through simple text formatting make it particularly versatile.
Q: What are the recommended use cases?
The model is well-suited for applications requiring high-quality voice cloning, including dubbing, content creation, and personalized speech synthesis. It's particularly effective for scenarios requiring dynamic prosody like sports commentary or character voice acting.