wav2vec2-large-xlsr-japanese-hiragana

Property	Value
Base Model	facebook/wav2vec2-large-xlsr-53
Task	Japanese Speech Recognition
Author	vumichien
Model Hub	HuggingFace

What is wav2vec2-large-xlsr-japanese-hiragana?

This is a specialized speech recognition model fine-tuned from Facebook's wav2vec2-large-xlsr-53 for Japanese language processing. The model has been specifically trained to output Japanese text in hiragana, utilizing both the Common Voice dataset and the Japanese speech corpus from Saruwatari-lab, University of Tokyo (JSUT).

Implementation Details

The model processes 16kHz audio input and employs MeCab tokenizer with additional preprocessing to handle Japanese text conversion. It utilizes PyKakasi for character conversion and implements specific regex patterns to clean the text output. The implementation requires CUDA support for optimal performance and achieves impressive accuracy metrics with a Word Error Rate (WER) of 24.74% and Character Error Rate (CER) of 10.99% on the test set.

Requires 16kHz audio sampling rate
Integrates MeCab and PyKakasi for Japanese text processing
Supports batch processing for efficient inference
Implements custom preprocessing pipeline for Japanese text

Core Capabilities

Direct speech-to-text conversion without language model
Automatic hiragana output generation
Batch processing support for multiple audio files
Handling of Japanese-specific characters and pronunciations

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in Japanese speech recognition with direct hiragana output, combining the power of wav2vec2-large-xlsr-53 with specialized Japanese language processing capabilities. Its integration with MeCab and PyKakasi makes it particularly effective for Japanese text conversion.

Q: What are the recommended use cases?

The model is ideal for Japanese speech transcription tasks, particularly when hiragana output is desired. It's suitable for applications in voice assistants, transcription services, and Japanese language learning tools where accurate phonetic representation is important.