R1-Omni-0.5B
Property | Value |
---|---|
Author | StarJiaxing |
Model Size | 0.5B parameters |
Paper | arXiv:2503.05379 |
Model Hub | HuggingFace |
What is R1-Omni-0.5B?
R1-Omni-0.5B is the industry's first application of Reinforcement Learning with Verifiable Reward (RLVR) to an omni-multimodal large language model focused on emotion recognition. This innovative model processes both visual and audio inputs to perform advanced emotion recognition tasks, demonstrating superior performance compared to traditional approaches.
Implementation Details
The model leverages two key components: siglip-224 for visual processing and whisper-large-v3 for audio analysis. It was initially trained through a cold start phase using 580 samples from combined datasets, followed by RLVR training on over 15,000 video samples from MAFW and DFEW datasets.
- Advanced multimodal processing architecture
- Reinforcement Learning with Verifiable Reward implementation
- Comprehensive emotion recognition capabilities across video and audio inputs
- Improved generalization for out-of-distribution scenarios
Core Capabilities
- Superior emotion recognition accuracy (65.83% WAR on DFEW dataset)
- Enhanced reasoning abilities for multimodal inputs
- Robust performance on both in-distribution and out-of-distribution data
- Explainable emotion recognition outputs with detailed reasoning
Frequently Asked Questions
Q: What makes this model unique?
R1-Omni-0.5B is the first model to combine RLVR with omni-multimodal architecture for emotion recognition, offering superior performance and explainability in its predictions.
Q: What are the recommended use cases?
The model is ideal for emotion recognition tasks requiring both visual and audio analysis, particularly in scenarios where explanation of the emotional assessment is needed. It's especially effective for applications requiring robust performance across different types of input data.