Whisper Medium

Property	Value
Parameter Count	769M
Model Type	Encoder-Decoder Transformer
License	Apache 2.0
Paper	View Paper
Languages Supported	99

What is whisper-medium?

Whisper-medium is a state-of-the-art automatic speech recognition (ASR) model developed by OpenAI. It's part of the Whisper model family, representing a medium-sized variant with 769M parameters, trained on an extensive dataset of 680,000 hours of multilingual audio data. This model demonstrates robust capabilities in both transcription and translation tasks across 99 different languages.

Implementation Details

The model employs a Transformer-based encoder-decoder architecture, specifically designed for sequence-to-sequence tasks. It can process audio inputs of up to 30 seconds natively, with support for longer recordings through a chunking mechanism. The model operates on log-Mel spectrograms as input and generates text outputs with optional timestamp predictions.

Trained on 438,000 hours of English data and 243,000 hours of multilingual content
Supports both transcription and translation tasks
Achieves 2.9 WER on LibriSpeech test-clean
Implements efficient chunking for long-form audio processing

Core Capabilities

Multilingual ASR with support for 99 languages
Speech-to-text transcription in the source language
Speech translation to English
Timestamp prediction for word-level alignment
Batch processing support for efficient inference

Frequently Asked Questions

Q: What makes this model unique?

Whisper-medium stands out for its robust performance across different accents, background noise conditions, and technical language without requiring fine-tuning. It offers a strong balance between model size and performance, making it suitable for production deployments.

Q: What are the recommended use cases?

The model is ideal for general-purpose speech recognition, content transcription, subtitle generation, and cross-lingual translation. It's particularly effective for English ASR tasks, achieving a 2.9% WER on LibriSpeech test-clean, and shows strong performance in multilingual scenarios.

whisper-medium