gemma-3-4b-it-speech

Maintained By
junnei

Gemma-3-4b-it-speech

PropertyValue
Base Modelgoogle/gemma-3-4b-it
Parameters4B + 596B Speech LoRA adapter
LicenseGemma
Authorjunnei
Context Length128K tokens

What is gemma-3-4b-it-speech?

Gemma-3-4b-it-speech is an innovative multimodal extension of the Gemma-3 model family, specifically designed to handle text, vision, and speech processing tasks. This model represents a significant advancement in multimodal AI by incorporating a Speech Adapter into the original Gemma architecture, enabling capabilities like speech recognition and translation while maintaining the core language and vision processing abilities.

Implementation Details

The model builds upon the google/gemma-3-4b-it base model by adding a 596B parameter Speech LoRA adapter. Training was conducted on ASR and AST tasks using a single A100 GPU over one epoch (12 hours), focusing on English and Korean languages from the Covost2 Dataset for audio clips under 30 seconds.

  • Architecture: Multimodal Language Model with Speech Processing capabilities
  • Training Data: Covost2 Dataset (English and Korean)
  • Performance Metrics: ASR (English) - BLEU: 85.95, CER: 4.47, WER: 8.49
  • AST Performance: English-Korean translation BLEU score of 29.83

Core Capabilities

  • Automatic Speech Recognition (ASR) with high accuracy
  • Audio-to-Text Translation (AST)
  • Vision-Language Processing
  • Multilingual Support (English and Korean)
  • 128K token context window

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines Gemma's language and vision capabilities with speech processing, offering a comprehensive multimodal solution. It's one of the few open models that can handle text, vision, and speech inputs within a single architecture.

Q: What are the recommended use cases?

The model is best suited for experimental and research purposes, particularly for tasks involving speech recognition and translation of short audio clips (under 30 seconds). It's specifically optimized for English ASR and English-to-Korean translation tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.