Ultravox v0.4 Llama 3.1 70B

Property	Value
Parameter Count	50.3M (adapter)
License	MIT
Supported Languages	9 (en, ar, de, es, fr, it, ja, pt, ru)
Architecture	Llama 3.1 70B + Whisper-medium
Training Hardware	8x H100 GPUs

What is ultravox-v0_4-llama-3_1-70b?

Ultravox is a sophisticated multimodal Speech Language Model that combines the power of Llama 3.1-70B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand and process audio inputs alongside text, making it particularly suitable for voice-based applications and multilingual speech processing.

Implementation Details

The model employs a unique architecture where audio inputs are processed through a special <|audio|> pseudo-token, which gets replaced with embeddings derived from the input audio. The implementation uses BF16 mixed precision training and achieves impressive performance metrics, including a time-to-first-token of approximately 400ms and a generation speed of 50-100 tokens per second on 4xH100 SXM GPU.

Built on Llama 3.1-70B-Instruct backbone
Incorporates Whisper-medium encoder
Trained using knowledge-distillation loss
Supports 9 different languages
Achieves 30.30 BLEU score for en_de translation

Core Capabilities

Speech-to-text processing
Multilingual support
Voice agent functionality
Speech-to-speech translation
Spoken audio analysis
Low latency response generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs simultaneously, combined with its multilingual capabilities and high-performance metrics, makes it stand out. It's particularly notable for its efficient adapter-based architecture that keeps the base models frozen while training only the multimodal adapter.

Q: What are the recommended use cases?

The model is ideal for voice agents, speech-to-speech translation, audio analysis, and any application requiring multilingual speech understanding. It's particularly well-suited for interactive voice applications requiring quick response times.