s2t-small-librispeech-asr

Maintained By
facebook

s2t-small-librispeech-asr

PropertyValue
Parameter Count29.5M
LicenseMIT
PaperResearch Paper
WER (Clean/Other)4.3% / 9.0%

What is s2t-small-librispeech-asr?

s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model developed by Facebook for automatic speech recognition tasks. This compact model represents an efficient implementation of the sequence-to-sequence transformer architecture, specifically trained on the LibriSpeech ASR corpus.

Implementation Details

The model utilizes an end-to-end sequence-to-sequence transformer architecture trained with autoregressive cross-entropy loss. It processes audio input through 80-channel log mel-filter bank features and employs utterance-level CMVN for preprocessing. The model features a 10,000-size vocabulary processed using SentencePiece tokenization.

  • Trained on LibriSpeech ASR Corpus (1000 hours of 16kHz English speech)
  • Implements SpecAugment for improved robustness
  • Supports 16kHz audio input
  • Uses torchaudio for feature extraction

Core Capabilities

  • End-to-end speech recognition for English language
  • Achieves 4.3% WER on clean test data
  • Supports batch processing and streaming inference
  • Integrates seamlessly with Hugging Face's transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient architecture and strong performance-to-size ratio, achieving competitive WER rates with only 29.5M parameters. It's particularly notable for its clean integration with the Hugging Face ecosystem and straightforward deployment process.

Q: What are the recommended use cases?

The model is best suited for English speech recognition tasks in relatively clean audio conditions. It's particularly effective for applications requiring transcription of clear speech, such as audiobook processing, meeting transcription, or voice command systems.

The first platform built for prompt engineering