s2t-small-librispeech-asr

Property	Value
Parameter Count	29.5M
License	MIT
Paper	Research Paper
WER (Clean/Other)	4.3% / 9.0%

What is s2t-small-librispeech-asr?

s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model developed by Facebook for automatic speech recognition tasks. This compact model represents an efficient implementation of the sequence-to-sequence transformer architecture, specifically trained on the LibriSpeech ASR corpus.

Implementation Details

The model utilizes an end-to-end sequence-to-sequence transformer architecture trained with autoregressive cross-entropy loss. It processes audio input through 80-channel log mel-filter bank features and employs utterance-level CMVN for preprocessing. The model features a 10,000-size vocabulary processed using SentencePiece tokenization.

Trained on LibriSpeech ASR Corpus (1000 hours of 16kHz English speech)
Implements SpecAugment for improved robustness
Supports 16kHz audio input
Uses torchaudio for feature extraction

Core Capabilities

End-to-end speech recognition for English language
Achieves 4.3% WER on clean test data
Supports batch processing and streaming inference
Integrates seamlessly with Hugging Face's transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient architecture and strong performance-to-size ratio, achieving competitive WER rates with only 29.5M parameters. It's particularly notable for its clean integration with the Hugging Face ecosystem and straightforward deployment process.

Q: What are the recommended use cases?

The model is best suited for English speech recognition tasks in relatively clean audio conditions. It's particularly effective for applications requiring transcription of clear speech, such as audiobook processing, meeting transcription, or voice command systems.