whisper-ja-anime-v0.1
Property | Value |
---|---|
Author | efwkjn |
Training Duration | ~160 hours on NVIDIA 3060 |
Model URL | HuggingFace Repository |
Architecture | Modified Whisper with frozen turbo encoder + 2 decoder layers |
What is whisper-ja-anime-v0.1?
whisper-ja-anime-v0.1 is a specialized speech recognition model designed specifically for Japanese audio transcription, with a focus on anime-related content. The model was trained for 2^19 steps with a batch size of 8, incorporating multiple training datasets including OOPPEENN, Reazon, and Common Voice 19.
Implementation Details
The model features a frozen turbo encoder coupled with two decoder layers, trained using a mixed strategy: 50% with prompts and 25% without timestamps. While noted as potentially undertrained, it shows promising performance in domain-specific tasks.
- Optimized for Japanese transcription in anime-adjacent domains
- Reduced hallucination compared to baseline models
- Support for both beam search and no-timestamp operations
- Validated across multiple anime and general Japanese speech datasets
Core Capabilities
- Domain-specific transcription with improved accuracy for anime content
- Flexible beam size options (1 and 5) for different accuracy requirements
- Competitive performance against turbo models in domain-specific tasks
- Effective handling of various Japanese speech patterns and accents
Frequently Asked Questions
Q: What makes this model unique?
The model specializes in anime-domain Japanese transcription while maintaining low hallucination rates, making it particularly useful for anime-related content transcription tasks. It achieves this while remaining a drop-in replacement for standard Whisper models.
Q: What are the recommended use cases?
This model is best suited for transcribing Japanese audio from anime and related content, showing particularly strong performance on domain-specific datasets like Blue Archive, Genshin Impact, and various anime series. It's also capable of handling general Japanese speech but may show reduced performance on long-form content.