Whisper Medium
Property | Value |
---|---|
Parameter Count | 769M |
Model Type | Encoder-Decoder Transformer |
License | Apache 2.0 |
Paper | View Paper |
Languages Supported | 99 |
What is whisper-medium?
Whisper-medium is a state-of-the-art automatic speech recognition (ASR) model developed by OpenAI. It's part of the Whisper model family, representing a medium-sized variant with 769M parameters, trained on an extensive dataset of 680,000 hours of multilingual audio data. This model demonstrates robust capabilities in both transcription and translation tasks across 99 different languages.
Implementation Details
The model employs a Transformer-based encoder-decoder architecture, specifically designed for sequence-to-sequence tasks. It can process audio inputs of up to 30 seconds natively, with support for longer recordings through a chunking mechanism. The model operates on log-Mel spectrograms as input and generates text outputs with optional timestamp predictions.
- Trained on 438,000 hours of English data and 243,000 hours of multilingual content
- Supports both transcription and translation tasks
- Achieves 2.9 WER on LibriSpeech test-clean
- Implements efficient chunking for long-form audio processing
Core Capabilities
- Multilingual ASR with support for 99 languages
- Speech-to-text transcription in the source language
- Speech translation to English
- Timestamp prediction for word-level alignment
- Batch processing support for efficient inference
Frequently Asked Questions
Q: What makes this model unique?
Whisper-medium stands out for its robust performance across different accents, background noise conditions, and technical language without requiring fine-tuning. It offers a strong balance between model size and performance, making it suitable for production deployments.
Q: What are the recommended use cases?
The model is ideal for general-purpose speech recognition, content transcription, subtitle generation, and cross-lingual translation. It's particularly effective for English ASR tasks, achieving a 2.9% WER on LibriSpeech test-clean, and shows strong performance in multilingual scenarios.