ultravox-v0_3

Maintained By
fixie-ai

Ultravox v0.3

PropertyValue
Parameter Count8.06B
Model TypeMultimodal Speech LLM
LicenseMIT
Tensor TypeBF16
Repositoryhttps://ultravox.ai

What is ultravox-v0_3?

Ultravox v0.3 is an advanced multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-small architectures. It's designed to process both speech and text inputs seamlessly, making it a versatile tool for voice-based applications and natural language processing tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-small encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, achieving impressive performance metrics including a 200ms time-to-first-token and 50-100 tokens per second on an A100-40GB GPU.

  • Built on Llama3.1-8B-Instruct and Whisper-small backbone
  • Knowledge-distillation training approach
  • Multimodal processing capabilities
  • High-performance metrics (BLEU scores: 22.68 for en_de, 24.10 for es_en)

Core Capabilities

  • Speech and text input processing
  • Voice agent functionality
  • Speech-to-speech translation
  • Spoken audio analysis
  • Low latency response generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its impressive performance metrics and relatively small footprint for its capabilities, makes it stand out in the field of multimodal AI models.

Q: What are the recommended use cases?

Ultravox v0.3 is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing capabilities. It's particularly effective for interactive voice applications requiring quick response times.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.