Distil-Whisper Large V3
Property | Value |
---|---|
Parameter Count | 756M |
Model Type | Speech Recognition |
License | MIT |
Paper | Distil-Whisper Paper |
Relative Speed | 6.3x faster than Whisper large-v3 |
What is distil-large-v3?
Distil-large-v3 is a knowledge-distilled version of OpenAI's Whisper large-v3 model, designed specifically for English speech recognition. It achieves comparable accuracy while being significantly faster, making it ideal for production environments. The model was trained on 22,000 hours of diverse audio data from nine open-source datasets, ensuring robustness across different domains and speaking styles.
Implementation Details
The model utilizes an encoder-decoder architecture, with the encoder retained from the original Whisper model and a reduced decoder for improved efficiency. It supports both sequential and chunked long-form transcription, with specialized optimizations for 30-second context windows.
- Optimized for both short-form and long-form audio transcription
- Supports multiple inference backends including Flash Attention 2 and PyTorch SDPA
- Compatible with popular frameworks like Whisper.cpp, Faster-Whisper, and Transformers.js
Core Capabilities
- Achieves WER within 1% of Whisper large-v3
- 6.3x faster inference speed compared to the original model
- Robust performance across different audio domains
- Supports speculative decoding for 2x speed improvement
- Optimized for both CPU and GPU inference
Frequently Asked Questions
Q: What makes this model unique?
The model's key innovation is its ability to maintain Whisper's accuracy while significantly reducing computational requirements through targeted knowledge distillation and architecture optimization.
Q: What are the recommended use cases?
It's ideal for production environments requiring fast, accurate English speech recognition, particularly for both short-form and long-form audio transcription tasks. It's especially suitable for applications where computational efficiency is crucial.