Wav2Vec2-Large-960h-Lv60 + Self-Training

Property	Value
License	Apache 2.0
Paper	View Paper
Test WER (Clean)	1.9%
Test WER (Other)	3.9%

What is wav2vec2-large-960h-lv60-self?

Wav2Vec2-Large-960h-Lv60-self is Facebook's advanced speech recognition model that leverages self-training techniques to achieve state-of-the-art performance in automatic speech recognition (ASR). The model is trained on 960 hours of Libri-Light and LibriSpeech data, specifically optimized for 16kHz sampled speech audio.

Implementation Details

The model implements a unique approach where it masks speech input in the latent space and solves a contrastive task defined over quantized latent representations. This architecture demonstrates remarkable efficiency in learning from limited labeled data while maintaining high accuracy.

Pretrained and fine-tuned on 960 hours of speech data
Operates on 16kHz sampled audio input
Implements self-training objective for improved performance
Utilizes CTC loss for sequence prediction

Core Capabilities

Achieves 1.9 WER on LibriSpeech clean test set
Performs at 3.9 WER on LibriSpeech other test set
Efficient performance with limited labeled data
Direct audio-to-text transcription

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its self-training approach and ability to achieve superior results with significantly less labeled data than traditional ASR systems. It can maintain high performance even with just one hour of labeled data.

Q: What are the recommended use cases?

The model is ideal for English speech recognition tasks, particularly in scenarios requiring high accuracy transcription of clean speech. It's especially valuable in situations with limited labeled data availability.