Ultravox v0.3
Property | Value |
---|---|
Parameter Count | 8.06B |
Model Type | Multimodal Speech LLM |
License | MIT |
Tensor Type | BF16 |
Repository | https://ultravox.ai |
What is ultravox-v0_3?
Ultravox v0.3 is an advanced multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-small architectures. It's designed to process both speech and text inputs seamlessly, making it a versatile tool for voice-based applications and natural language processing tasks.
Implementation Details
The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-small encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, achieving impressive performance metrics including a 200ms time-to-first-token and 50-100 tokens per second on an A100-40GB GPU.
- Built on Llama3.1-8B-Instruct and Whisper-small backbone
- Knowledge-distillation training approach
- Multimodal processing capabilities
- High-performance metrics (BLEU scores: 22.68 for en_de, 24.10 for es_en)
Core Capabilities
- Speech and text input processing
- Voice agent functionality
- Speech-to-speech translation
- Spoken audio analysis
- Low latency response generation
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to process both speech and text inputs through a unified architecture, combined with its impressive performance metrics and relatively small footprint for its capabilities, makes it stand out in the field of multimodal AI models.
Q: What are the recommended use cases?
Ultravox v0.3 is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing capabilities. It's particularly effective for interactive voice applications requiring quick response times.