MetaVoice-1B-v0.1
Property | Value |
---|---|
Model Size | 1.2B parameters |
License | Apache 2.0 |
Language | English |
Training Data | 100K hours of speech |
What is metavoice-1B-v0.1?
MetaVoice-1B-v0.1 is a sophisticated text-to-speech model designed to generate natural and emotional speech. Built with 1.2 billion parameters and trained on 100,000 hours of speech data, it represents a significant advancement in voice synthesis technology, offering voice cloning capabilities with minimal training data requirements.
Implementation Details
The model employs a multi-stage architecture that includes a causal GPT model for predicting EnCodec tokens, a non-causal transformer for hierarchy prediction, and multi-band diffusion for waveform generation. It utilizes advanced techniques like Flash Decoding for KV-caching and supports efficient batching operations.
- Custom BPE tokenizer with 512 tokens
- Two-hierarchy prediction system with flattened interleaving
- Condition-free sampling for enhanced cloning
- DeepFilterNet for artifact cleanup
Core Capabilities
- Emotional speech synthesis with natural rhythm and tone
- Voice cloning with as little as 1 minute of training data
- Zero-shot cloning for American & British voices (30s reference)
- Long-form synthesis support
- Batch processing of varying text lengths
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to achieve high-quality voice cloning with minimal training data (1 minute) and zero-shot capabilities for specific accents sets it apart. It also maintains emotional fidelity without hallucinations, making it particularly reliable for production use.
Q: What are the recommended use cases?
The model is ideal for applications requiring emotional text-to-speech conversion, voice cloning services, and long-form content generation. It's particularly suitable for projects needing quick voice adaptation with minimal training data.