ultravox-v0_4_1-mistral-nemo

Maintained By
fixie-ai

Ultravox-v0_4_1-mistral-nemo

PropertyValue
Parameter Count52.4M
LicenseMIT
Supported Languages15 languages including English, Arabic, Chinese, etc.
Training Hardware8x H100 GPUs
PrecisionBF16 mixed precision

What is ultravox-v0_4_1-mistral-nemo?

Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. It's designed to process both speech and text inputs simultaneously, offering a versatile solution for voice-based applications and speech analysis.

Implementation Details

The model employs a unique architecture where speech input is handled through a special <|audio|> pseudo-token, which gets replaced with audio-derived embeddings. The system maintains frozen Whisper encoder and Mistral components while training only the multi-modal adapter, using knowledge-distillation loss to match text-based Mistral backbone logits.

  • Time-to-first-token (TTFT): ~150ms
  • Processing speed: 50-100 tokens/second on A100-40GB GPU
  • Training dataset: Mix of ASR datasets with Mistral Nemo-generated continuations

Core Capabilities

  • Multimodal processing of speech and text inputs
  • Support for 15 different languages
  • Speech-to-speech translation
  • Voice agent functionality
  • Spoken audio analysis

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs simultaneously, combined with its efficient architecture using both Mistral and Whisper backbones, makes it stand out. The use of knowledge distillation during training and its support for 15 languages adds to its uniqueness.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing. It's particularly useful for multilingual applications due to its broad language support.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.