s2t-small-librispeech-asr
Property | Value |
---|---|
Parameter Count | 29.5M |
License | MIT |
Paper | Research Paper |
WER (Clean/Other) | 4.3% / 9.0% |
What is s2t-small-librispeech-asr?
s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model developed by Facebook for automatic speech recognition tasks. This compact model represents an efficient implementation of the sequence-to-sequence transformer architecture, specifically trained on the LibriSpeech ASR corpus.
Implementation Details
The model utilizes an end-to-end sequence-to-sequence transformer architecture trained with autoregressive cross-entropy loss. It processes audio input through 80-channel log mel-filter bank features and employs utterance-level CMVN for preprocessing. The model features a 10,000-size vocabulary processed using SentencePiece tokenization.
- Trained on LibriSpeech ASR Corpus (1000 hours of 16kHz English speech)
- Implements SpecAugment for improved robustness
- Supports 16kHz audio input
- Uses torchaudio for feature extraction
Core Capabilities
- End-to-end speech recognition for English language
- Achieves 4.3% WER on clean test data
- Supports batch processing and streaming inference
- Integrates seamlessly with Hugging Face's transformers library
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its efficient architecture and strong performance-to-size ratio, achieving competitive WER rates with only 29.5M parameters. It's particularly notable for its clean integration with the Hugging Face ecosystem and straightforward deployment process.
Q: What are the recommended use cases?
The model is best suited for English speech recognition tasks in relatively clean audio conditions. It's particularly effective for applications requiring transcription of clear speech, such as audiobook processing, meeting transcription, or voice command systems.