wav2vec2-large-xls-r-300m-Urdu

Property	Value
Parameter Count	315M parameters
License	Apache 2.0
Base Model	facebook/wav2vec2-xls-r-300m
Test WER	39.89% (with LM)

What is wav2vec2-large-xls-r-300m-Urdu?

This is a specialized speech recognition model fine-tuned for the Urdu language, based on Facebook's wav2vec2-xls-r-300m architecture. It represents a significant advancement in Urdu ASR technology, achieving a Word Error Rate (WER) of 39.89% with language model integration and a Character Error Rate (CER) of 16.7% on the Common Voice 8.0 test set.

Implementation Details

The model was trained using carefully optimized hyperparameters, including a learning rate of 0.0001, batch size of 64, and linear scheduler with 1000 warmup steps. Training extended over 200 epochs using the Adam optimizer, resulting in robust performance metrics.

Trained on Mozilla Common Voice 8.0 Urdu dataset
Implements wav2vec2 architecture with 315M parameters
Uses PyTorch framework with Transformers library
Supports 16kHz audio input

Core Capabilities

Direct Urdu speech-to-text transcription
Supports streaming inference with configurable chunk sizes
Achieves 52.03% WER without LM, improving to 39.89% with LM
Optimized for production deployment via Inference Endpoints

Frequently Asked Questions

Q: What makes this model unique?

This model represents a specialized adaptation of the XLS-R architecture for Urdu, offering state-of-the-art performance for Urdu speech recognition with both LM and non-LM configurations. Its architecture is optimized for production use while maintaining high accuracy.

Q: What are the recommended use cases?

The model is ideal for Urdu speech transcription applications, including media transcription, voice assistants, and automated subtitling. It's particularly effective when used with the language model for improved accuracy.