Phi-4-multimodal-finetune-ko-speech

Property	Value
Author	daekeun-ml
Base Model	microsoft/Phi-4-multimodal-instruct
Training Data	35K Korean speech samples
Model URL	HuggingFace

What is Phi-4-multimodal-finetune-ko-speech?

This is a specialized fine-tuned version of Microsoft's Phi-4 multimodal model, specifically optimized for Korean speech-to-text translation tasks. The model was trained on a diverse dataset of 35,000 Korean speech samples, including data from zeroth_korean, Common Voice, MINDS14, and custom technical content, all sampled at 16kHz.

Implementation Details

The model was trained on a single A100 80GB GPU for 1 epoch with a batch size of 16. It demonstrates significant improvements in ASR performance, achieving a Character Error Rate (CER) of 3.80% on the zeroth-test dataset, a substantial improvement over the original model's 198.32% CER.

Trained on multiple high-quality Korean speech datasets
Incorporates audio augmentation techniques
Optimized for both ASR and speech translation tasks
Uses Flash Attention 2 for efficient processing

Core Capabilities

Automatic Speech Recognition (ASR) for Korean
Korean-to-English speech translation
English-to-Korean speech translation
Chain-of-thought processing for translation tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Korean speech processing and its significant improvement in ASR performance. It's particularly notable for achieving a 3.80% CER on the zeroth-test dataset, making it highly effective for Korean speech recognition tasks.

Q: What are the recommended use cases?

The model is best suited for proof-of-concept and experimental applications in Korean speech recognition and translation. While it shows promising results, it's not recommended for production use without further optimization and testing. It's particularly useful for researchers and developers working on Korean language processing tasks.