Granite Speech 3.2-8B

Property	Value
Developer	IBM
Release Date	April 2nd, 2025
License	Apache 2.0
Primary Tasks	ASR and AST
Model Size	8B parameters
Training Infrastructure	32 NVIDIA H100 GPUs

What is granite-speech-3.2-8b?

Granite-speech-3.2-8b is IBM's state-of-the-art speech language model designed specifically for automatic speech recognition (ASR) and automatic speech translation (AST). Built on the foundation of granite-3.2-8b-instruct, this model has been specially adapted for speech processing through modality alignment training on diverse public corpora.

Implementation Details

The model features a sophisticated architecture comprising three main components: a speech encoder with 10 conformer blocks, a speech-text modality adapter, and the base granite-3.2-8b-instruct language model. The speech encoder processes input using CTC with block-attention mechanism, while the modality adapter employs a 2-layer window query transformer for temporal downsampling.

Speech encoder with 1024 hidden dimensions and 8 attention heads
Temporal downsampling factor of 10x for efficient processing
LoRA adapters with rank=64 for query and value projections
128k context length capability

Core Capabilities

English speech recognition with state-of-the-art accuracy
Speech translation to French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin
Trained on over 60,000 hours of diverse speech data
Optimized for enterprise applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient architecture that combines speech and language processing capabilities in a relatively compact 8B parameter model, while maintaining high performance through innovative temporal downsampling and modality adaptation techniques.

Q: What are the recommended use cases?

The model is specifically designed for enterprise applications requiring speech processing, particularly English speech-to-text transcription and translation to major languages. It's not recommended for text-only tasks, where the standard Granite language models would be more appropriate.