Phi-4-multimodal-finetune-ko-speech
Property | Value |
---|---|
Author | daekeun-ml |
Base Model | microsoft/Phi-4-multimodal-instruct |
Training Data | 35K Korean speech samples |
Model URL | HuggingFace |
What is Phi-4-multimodal-finetune-ko-speech?
This is a specialized fine-tuned version of Microsoft's Phi-4 multimodal model, specifically optimized for Korean speech-to-text translation tasks. The model was trained on a diverse dataset of 35,000 Korean speech samples, including data from zeroth_korean, Common Voice, MINDS14, and custom technical content, all sampled at 16kHz.
Implementation Details
The model was trained on a single A100 80GB GPU for 1 epoch with a batch size of 16. It demonstrates significant improvements in ASR performance, achieving a Character Error Rate (CER) of 3.80% on the zeroth-test dataset, a substantial improvement over the original model's 198.32% CER.
- Trained on multiple high-quality Korean speech datasets
- Incorporates audio augmentation techniques
- Optimized for both ASR and speech translation tasks
- Uses Flash Attention 2 for efficient processing
Core Capabilities
- Automatic Speech Recognition (ASR) for Korean
- Korean-to-English speech translation
- English-to-Korean speech translation
- Chain-of-thought processing for translation tasks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized focus on Korean speech processing and its significant improvement in ASR performance. It's particularly notable for achieving a 3.80% CER on the zeroth-test dataset, making it highly effective for Korean speech recognition tasks.
Q: What are the recommended use cases?
The model is best suited for proof-of-concept and experimental applications in Korean speech recognition and translation. While it shows promising results, it's not recommended for production use without further optimization and testing. It's particularly useful for researchers and developers working on Korean language processing tasks.