ultravox-v0_4-llama-3_1-70b

Maintained By
fixie-ai

Ultravox v0.4 Llama 3.1 70B

PropertyValue
Parameter Count50.3M (adapter)
LicenseMIT
Supported Languages9 (en, ar, de, es, fr, it, ja, pt, ru)
ArchitectureLlama 3.1 70B + Whisper-medium
Training Hardware8x H100 GPUs

What is ultravox-v0_4-llama-3_1-70b?

Ultravox is a sophisticated multimodal Speech Language Model that combines the power of Llama 3.1-70B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand and process audio inputs alongside text, making it particularly suitable for voice-based applications and multilingual speech processing.

Implementation Details

The model employs a unique architecture where audio inputs are processed through a special <|audio|> pseudo-token, which gets replaced with embeddings derived from the input audio. The implementation uses BF16 mixed precision training and achieves impressive performance metrics, including a time-to-first-token of approximately 400ms and a generation speed of 50-100 tokens per second on 4xH100 SXM GPU.

  • Built on Llama 3.1-70B-Instruct backbone
  • Incorporates Whisper-medium encoder
  • Trained using knowledge-distillation loss
  • Supports 9 different languages
  • Achieves 30.30 BLEU score for en_de translation

Core Capabilities

  • Speech-to-text processing
  • Multilingual support
  • Voice agent functionality
  • Speech-to-speech translation
  • Spoken audio analysis
  • Low latency response generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs simultaneously, combined with its multilingual capabilities and high-performance metrics, makes it stand out. It's particularly notable for its efficient adapter-based architecture that keeps the base models frozen while training only the multimodal adapter.

Q: What are the recommended use cases?

The model is ideal for voice agents, speech-to-speech translation, audio analysis, and any application requiring multilingual speech understanding. It's particularly well-suited for interactive voice applications requiring quick response times.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.