HuBERT Large LS960-FT

Property	Value
License	Apache 2.0
Paper	Research Paper
Developer	Facebook
Task	Automatic Speech Recognition

What is hubert-large-ls960-ft?

HuBERT (Hidden-Unit BERT) Large LS960-FT is a state-of-the-art speech recognition model developed by Facebook. It's fine-tuned on 960 hours of LibriSpeech data and specifically designed for processing 16kHz sampled speech audio. The model achieves an impressive 1.9 WER (Word Error Rate) on clean test sets, representing a significant advancement in speech recognition technology.

Implementation Details

The model employs a self-supervised learning approach with a BERT-like architecture, utilizing an offline clustering step to provide aligned target labels. It's built upon the base hubert-large-ll60k model and incorporates innovative techniques for handling continuous speech input.

Uses masked prediction loss over specific regions
Implements k-means clustering with 100 clusters
Combines acoustic and language modeling capabilities
Processes 16kHz audio input

Core Capabilities

High-accuracy speech recognition with 1.9 WER on clean test sets
Robust performance on challenging audio conditions
Efficient processing of continuous speech input
Integration with HuggingFace's transformers library

Frequently Asked Questions

Q: What makes this model unique?

The model's unique approach to handling multiple sound units, absence of a pre-training lexicon, and variable-length sound units sets it apart. It uses an innovative offline clustering approach and masked prediction, showing up to 19% relative WER reduction on challenging datasets.

Q: What are the recommended use cases?

This model is ideal for high-quality speech recognition tasks, particularly when working with 16kHz audio. It's especially effective for clean speech recognition scenarios and can be integrated into various applications using the HuggingFace transformers library.