emotion2vec_plus_large
Property | Value |
---|---|
Model Size | ~300M parameters |
Training Data | 42,526 hours |
Paper | emotion2vec: Self-Supervised Pre-Training for Speech Emotion Recognition |
Author | emotion2vec team |
What is emotion2vec_plus_large?
emotion2vec_plus_large is a state-of-the-art foundation model for speech emotion recognition (SER) that aims to be the "whisper" of emotion detection. It's designed to provide universal, robust emotion recognition capabilities across different languages and recording environments through data-driven methods. The model represents the largest variant in the emotion2vec+ series, fine-tuned on an extensive dataset of 42,526 hours of speech data.
Implementation Details
The model processes 16kHz audio input and can operate in two granularity modes: "utterance" for whole-speech analysis and "frame" for 50Hz frame-level feature extraction. It supports both embedding extraction and direct classification of 9 distinct emotional states (angry, disgusted, fearful, happy, neutral, other, sad, surprised, and unknown).
- Extensive training on 42,526 hours of filtered pseudo-labeled data
- Large-scale architecture with approximately 300M parameters
- Supports both whole-utterance and frame-level analysis
- Easy integration through ModelScope and FunASR frameworks
Core Capabilities
- Robust emotion classification across 9 categories
- Language-agnostic emotion recognition
- Feature extraction at both utterance and frame level
- High performance surpassing other open-source models on Hugging Face
- Flexible deployment options through multiple frameworks
Frequently Asked Questions
Q: What makes this model unique?
emotion2vec_plus_large stands out due to its massive training dataset of over 42K hours and its universal approach to emotion recognition that works across different languages and recording conditions. It's designed to be robust and accurate, similar to how Whisper revolutionized speech recognition.
Q: What are the recommended use cases?
The model is ideal for applications requiring sophisticated emotion analysis from speech, such as call center analytics, mental health monitoring, human-computer interaction, and social robotics. It can be used for both real-time emotion classification and detailed emotional feature extraction for downstream tasks.