MetaVoice-1B-v0.1

Property	Value
Model Size	1.2B parameters
License	Apache 2.0
Language	English
Training Data	100K hours of speech

What is metavoice-1B-v0.1?

MetaVoice-1B-v0.1 is a sophisticated text-to-speech model designed to generate natural and emotional speech. Built with 1.2 billion parameters and trained on 100,000 hours of speech data, it represents a significant advancement in voice synthesis technology, offering voice cloning capabilities with minimal training data requirements.

Implementation Details

The model employs a multi-stage architecture that includes a causal GPT model for predicting EnCodec tokens, a non-causal transformer for hierarchy prediction, and multi-band diffusion for waveform generation. It utilizes advanced techniques like Flash Decoding for KV-caching and supports efficient batching operations.

Custom BPE tokenizer with 512 tokens
Two-hierarchy prediction system with flattened interleaving
Condition-free sampling for enhanced cloning
DeepFilterNet for artifact cleanup

Core Capabilities

Emotional speech synthesis with natural rhythm and tone
Voice cloning with as little as 1 minute of training data
Zero-shot cloning for American & British voices (30s reference)
Long-form synthesis support
Batch processing of varying text lengths

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to achieve high-quality voice cloning with minimal training data (1 minute) and zero-shot capabilities for specific accents sets it apart. It also maintains emotional fidelity without hallucinations, making it particularly reliable for production use.

Q: What are the recommended use cases?

The model is ideal for applications requiring emotional text-to-speech conversion, voice cloning services, and long-form content generation. It's particularly suitable for projects needing quick voice adaptation with minimal training data.

metavoice-1B-v0.1