Ultravox-v0_4_1-mistral-nemo
Property | Value |
---|---|
Parameter Count | 52.4M |
License | MIT |
Supported Languages | 15 languages including English, Arabic, Chinese, etc. |
Training Hardware | 8x H100 GPUs |
Precision | BF16 mixed precision |
What is ultravox-v0_4_1-mistral-nemo?
Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. It's designed to process both speech and text inputs simultaneously, offering a versatile solution for voice-based applications and speech analysis.
Implementation Details
The model employs a unique architecture where speech input is handled through a special <|audio|> pseudo-token, which gets replaced with audio-derived embeddings. The system maintains frozen Whisper encoder and Mistral components while training only the multi-modal adapter, using knowledge-distillation loss to match text-based Mistral backbone logits.
- Time-to-first-token (TTFT): ~150ms
- Processing speed: 50-100 tokens/second on A100-40GB GPU
- Training dataset: Mix of ASR datasets with Mistral Nemo-generated continuations
Core Capabilities
- Multimodal processing of speech and text inputs
- Support for 15 different languages
- Speech-to-speech translation
- Voice agent functionality
- Spoken audio analysis
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to process both speech and text inputs simultaneously, combined with its efficient architecture using both Mistral and Whisper backbones, makes it stand out. The use of knowledge distillation during training and its support for 15 languages adds to its uniqueness.
Q: What are the recommended use cases?
The model is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing. It's particularly useful for multilingual applications due to its broad language support.