Gemma-3-4b-it-speech

Property	Value
Base Model	google/gemma-3-4b-it
Parameters	4B + 596B Speech LoRA adapter
License	Gemma
Author	junnei
Context Length	128K tokens

What is gemma-3-4b-it-speech?

Gemma-3-4b-it-speech is an innovative multimodal extension of the Gemma-3 model family, specifically designed to handle text, vision, and speech processing tasks. This model represents a significant advancement in multimodal AI by incorporating a Speech Adapter into the original Gemma architecture, enabling capabilities like speech recognition and translation while maintaining the core language and vision processing abilities.

Implementation Details

The model builds upon the google/gemma-3-4b-it base model by adding a 596B parameter Speech LoRA adapter. Training was conducted on ASR and AST tasks using a single A100 GPU over one epoch (12 hours), focusing on English and Korean languages from the Covost2 Dataset for audio clips under 30 seconds.

Architecture: Multimodal Language Model with Speech Processing capabilities
Training Data: Covost2 Dataset (English and Korean)
Performance Metrics: ASR (English) - BLEU: 85.95, CER: 4.47, WER: 8.49
AST Performance: English-Korean translation BLEU score of 29.83

Core Capabilities

Automatic Speech Recognition (ASR) with high accuracy
Audio-to-Text Translation (AST)
Vision-Language Processing
Multilingual Support (English and Korean)
128K token context window

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines Gemma's language and vision capabilities with speech processing, offering a comprehensive multimodal solution. It's one of the few open models that can handle text, vision, and speech inputs within a single architecture.

Q: What are the recommended use cases?

The model is best suited for experimental and research purposes, particularly for tasks involving speech recognition and translation of short audio clips (under 30 seconds). It's specifically optimized for English ASR and English-to-Korean translation tasks.