emotion2vec_plus_large

Property	Value
Model Size	~300M parameters
Training Data	42,526 hours
Paper	emotion2vec: Self-Supervised Pre-Training for Speech Emotion Recognition
Author	emotion2vec team

What is emotion2vec_plus_large?

emotion2vec_plus_large is a state-of-the-art foundation model for speech emotion recognition (SER) that aims to be the "whisper" of emotion detection. It's designed to provide universal, robust emotion recognition capabilities across different languages and recording environments through data-driven methods. The model represents the largest variant in the emotion2vec+ series, fine-tuned on an extensive dataset of 42,526 hours of speech data.

Implementation Details

The model processes 16kHz audio input and can operate in two granularity modes: "utterance" for whole-speech analysis and "frame" for 50Hz frame-level feature extraction. It supports both embedding extraction and direct classification of 9 distinct emotional states (angry, disgusted, fearful, happy, neutral, other, sad, surprised, and unknown).

Extensive training on 42,526 hours of filtered pseudo-labeled data
Large-scale architecture with approximately 300M parameters
Supports both whole-utterance and frame-level analysis
Easy integration through ModelScope and FunASR frameworks

Core Capabilities

Robust emotion classification across 9 categories
Language-agnostic emotion recognition
Feature extraction at both utterance and frame level
High performance surpassing other open-source models on Hugging Face
Flexible deployment options through multiple frameworks

Frequently Asked Questions

Q: What makes this model unique?

emotion2vec_plus_large stands out due to its massive training dataset of over 42K hours and its universal approach to emotion recognition that works across different languages and recording conditions. It's designed to be robust and accurate, similar to how Whisper revolutionized speech recognition.

Q: What are the recommended use cases?

The model is ideal for applications requiring sophisticated emotion analysis from speech, such as call center analytics, mental health monitoring, human-computer interaction, and social robotics. It can be used for both real-time emotion classification and detailed emotional feature extraction for downstream tasks.