Ultravox-v0_4_1-mistral-nemo

Property	Value
Parameter Count	52.4M
License	MIT
Supported Languages	15 languages including English, Arabic, Chinese, etc.
Training Hardware	8x H100 GPUs
Precision	BF16 mixed precision

What is ultravox-v0_4_1-mistral-nemo?

Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. It's designed to process both speech and text inputs simultaneously, offering a versatile solution for voice-based applications and speech analysis.

Implementation Details

The model employs a unique architecture where speech input is handled through a special <|audio|> pseudo-token, which gets replaced with audio-derived embeddings. The system maintains frozen Whisper encoder and Mistral components while training only the multi-modal adapter, using knowledge-distillation loss to match text-based Mistral backbone logits.

Time-to-first-token (TTFT): ~150ms
Processing speed: 50-100 tokens/second on A100-40GB GPU
Training dataset: Mix of ASR datasets with Mistral Nemo-generated continuations

Core Capabilities

Multimodal processing of speech and text inputs
Support for 15 different languages
Speech-to-speech translation
Voice agent functionality
Spoken audio analysis

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs simultaneously, combined with its efficient architecture using both Mistral and Whisper backbones, makes it stand out. The use of knowledge distillation during training and its support for 15 languages adds to its uniqueness.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing. It's particularly useful for multilingual applications due to its broad language support.