wav2vec2-large-voxrex-swedish

Property	Value
Parameter Count	315M
License	CC0 1.0
Paper	arXiv:2205.03026
WER (Common Voice)	8.49%

What is wav2vec2-large-voxrex-swedish?

This is a fine-tuned Swedish speech recognition model based on the Wav2Vec 2.0 architecture. Developed by KBLab, it represents a significant advancement in Swedish automatic speech recognition (ASR), leveraging the VoxRex large model architecture and trained on a diverse dataset including Swedish radio broadcasts, NST, and Common Voice data.

Implementation Details

The model underwent extensive training with 120,000 updates on a combined dataset of NST and CommonVoice. It requires input audio to be sampled at 16kHz and employs the CTC (Connectionist Temporal Classification) approach for speech recognition.

Architecture: Wav2Vec 2.0 large architecture with 315M parameters
Performance: 8.49% WER on Common Voice test set (7.37% with 4-gram LM)
Exceptional performance on NST + Common Voice test set with 2.5% WER

Core Capabilities

Swedish speech recognition with state-of-the-art accuracy
Robust performance across different speech contexts
Direct integration with HuggingFace Transformers library
Support for both raw model usage and language model enhancement

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized training for Swedish language processing and its impressive performance metrics, particularly the 2.5% WER on combined test sets. It's built on the robust VoxRex architecture and has been extensively fine-tuned on diverse Swedish speech data.

Q: What are the recommended use cases?

This model is ideal for Swedish speech recognition tasks, particularly in scenarios requiring high accuracy transcription of clear speech. It's especially suitable for applications in media transcription, voice command systems, and general Swedish language processing tasks.