HuBERT Large LS960-FT
Property | Value |
---|---|
License | Apache 2.0 |
Paper | Research Paper |
Developer | |
Task | Automatic Speech Recognition |
What is hubert-large-ls960-ft?
HuBERT (Hidden-Unit BERT) Large LS960-FT is a state-of-the-art speech recognition model developed by Facebook. It's fine-tuned on 960 hours of LibriSpeech data and specifically designed for processing 16kHz sampled speech audio. The model achieves an impressive 1.9 WER (Word Error Rate) on clean test sets, representing a significant advancement in speech recognition technology.
Implementation Details
The model employs a self-supervised learning approach with a BERT-like architecture, utilizing an offline clustering step to provide aligned target labels. It's built upon the base hubert-large-ll60k model and incorporates innovative techniques for handling continuous speech input.
- Uses masked prediction loss over specific regions
- Implements k-means clustering with 100 clusters
- Combines acoustic and language modeling capabilities
- Processes 16kHz audio input
Core Capabilities
- High-accuracy speech recognition with 1.9 WER on clean test sets
- Robust performance on challenging audio conditions
- Efficient processing of continuous speech input
- Integration with HuggingFace's transformers library
Frequently Asked Questions
Q: What makes this model unique?
The model's unique approach to handling multiple sound units, absence of a pre-training lexicon, and variable-length sound units sets it apart. It uses an innovative offline clustering approach and masked prediction, showing up to 19% relative WER reduction on challenging datasets.
Q: What are the recommended use cases?
This model is ideal for high-quality speech recognition tasks, particularly when working with 16kHz audio. It's especially effective for clean speech recognition scenarios and can be integrated into various applications using the HuggingFace transformers library.