Wav2Vec2-Large-960h-Lv60 + Self-Training
Property | Value |
---|---|
License | Apache 2.0 |
Paper | View Paper |
Test WER (Clean) | 1.9% |
Test WER (Other) | 3.9% |
What is wav2vec2-large-960h-lv60-self?
Wav2Vec2-Large-960h-Lv60-self is Facebook's advanced speech recognition model that leverages self-training techniques to achieve state-of-the-art performance in automatic speech recognition (ASR). The model is trained on 960 hours of Libri-Light and LibriSpeech data, specifically optimized for 16kHz sampled speech audio.
Implementation Details
The model implements a unique approach where it masks speech input in the latent space and solves a contrastive task defined over quantized latent representations. This architecture demonstrates remarkable efficiency in learning from limited labeled data while maintaining high accuracy.
- Pretrained and fine-tuned on 960 hours of speech data
- Operates on 16kHz sampled audio input
- Implements self-training objective for improved performance
- Utilizes CTC loss for sequence prediction
Core Capabilities
- Achieves 1.9 WER on LibriSpeech clean test set
- Performs at 3.9 WER on LibriSpeech other test set
- Efficient performance with limited labeled data
- Direct audio-to-text transcription
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its self-training approach and ability to achieve superior results with significantly less labeled data than traditional ASR systems. It can maintain high performance even with just one hour of labeled data.
Q: What are the recommended use cases?
The model is ideal for English speech recognition tasks, particularly in scenarios requiring high accuracy transcription of clean speech. It's especially valuable in situations with limited labeled data availability.