Ultravox v0.4 Llama 3.1 70B
Property | Value |
---|---|
Parameter Count | 50.3M (adapter) |
License | MIT |
Supported Languages | 9 (en, ar, de, es, fr, it, ja, pt, ru) |
Architecture | Llama 3.1 70B + Whisper-medium |
Training Hardware | 8x H100 GPUs |
What is ultravox-v0_4-llama-3_1-70b?
Ultravox is a sophisticated multimodal Speech Language Model that combines the power of Llama 3.1-70B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand and process audio inputs alongside text, making it particularly suitable for voice-based applications and multilingual speech processing.
Implementation Details
The model employs a unique architecture where audio inputs are processed through a special <|audio|> pseudo-token, which gets replaced with embeddings derived from the input audio. The implementation uses BF16 mixed precision training and achieves impressive performance metrics, including a time-to-first-token of approximately 400ms and a generation speed of 50-100 tokens per second on 4xH100 SXM GPU.
- Built on Llama 3.1-70B-Instruct backbone
- Incorporates Whisper-medium encoder
- Trained using knowledge-distillation loss
- Supports 9 different languages
- Achieves 30.30 BLEU score for en_de translation
Core Capabilities
- Speech-to-text processing
- Multilingual support
- Voice agent functionality
- Speech-to-speech translation
- Spoken audio analysis
- Low latency response generation
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to process both speech and text inputs simultaneously, combined with its multilingual capabilities and high-performance metrics, makes it stand out. It's particularly notable for its efficient adapter-based architecture that keeps the base models frozen while training only the multimodal adapter.
Q: What are the recommended use cases?
The model is ideal for voice agents, speech-to-speech translation, audio analysis, and any application requiring multilingual speech understanding. It's particularly well-suited for interactive voice applications requiring quick response times.