emotion2vec_plus_large

Maintained By
emotion2vec

emotion2vec_plus_large

PropertyValue
Model Size~300M parameters
Training Data42,526 hours
Paperemotion2vec: Self-Supervised Pre-Training for Speech Emotion Recognition
Authoremotion2vec team

What is emotion2vec_plus_large?

emotion2vec_plus_large is a state-of-the-art foundation model for speech emotion recognition (SER) that aims to be the "whisper" of emotion detection. It's designed to provide universal, robust emotion recognition capabilities across different languages and recording environments through data-driven methods. The model represents the largest variant in the emotion2vec+ series, fine-tuned on an extensive dataset of 42,526 hours of speech data.

Implementation Details

The model processes 16kHz audio input and can operate in two granularity modes: "utterance" for whole-speech analysis and "frame" for 50Hz frame-level feature extraction. It supports both embedding extraction and direct classification of 9 distinct emotional states (angry, disgusted, fearful, happy, neutral, other, sad, surprised, and unknown).

  • Extensive training on 42,526 hours of filtered pseudo-labeled data
  • Large-scale architecture with approximately 300M parameters
  • Supports both whole-utterance and frame-level analysis
  • Easy integration through ModelScope and FunASR frameworks

Core Capabilities

  • Robust emotion classification across 9 categories
  • Language-agnostic emotion recognition
  • Feature extraction at both utterance and frame level
  • High performance surpassing other open-source models on Hugging Face
  • Flexible deployment options through multiple frameworks

Frequently Asked Questions

Q: What makes this model unique?

emotion2vec_plus_large stands out due to its massive training dataset of over 42K hours and its universal approach to emotion recognition that works across different languages and recording conditions. It's designed to be robust and accurate, similar to how Whisper revolutionized speech recognition.

Q: What are the recommended use cases?

The model is ideal for applications requiring sophisticated emotion analysis from speech, such as call center analytics, mental health monitoring, human-computer interaction, and social robotics. It can be used for both real-time emotion classification and detailed emotional feature extraction for downstream tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.