MERT-v1-95M
Property | Value |
---|---|
Parameter Count | 95M |
Architecture | Transformer (12 layers, 768 dimensions) |
License | CC-BY-NC-4.0 |
Paper | arXiv:2306.00107 |
Sample Rate | 24kHz |
Feature Rate | 75Hz |
What is MERT-v1-95M?
MERT-v1-95M is a state-of-the-art music understanding model that leverages masked language modeling (MLM) pre-training on 20,000 hours of audio data. It represents a significant advancement in the m-a-p model family, introducing improved audio processing capabilities with higher frequency sampling and enhanced feature extraction.
Implementation Details
The model employs a transformer architecture with 12 layers and 768-dimensional features. It processes audio at 24kHz and outputs features at 75Hz, representing a significant upgrade from previous versions. The model utilizes 8 codebooks from encodec for pseudo-labels and implements MLM prediction with in-batch noise mixture.
- Transformer-based architecture with 95M parameters
- Pre-trained on 20,000 hours of music data
- Supports 5-second context window during pre-training
- Implements advanced MLM paradigm with noise mixture
Core Capabilities
- High-quality music audio feature extraction
- Support for music generation tasks
- Flexible feature output from different transformer layers
- Efficient processing with 75Hz feature rate
Frequently Asked Questions
Q: What makes this model unique?
MERT-v1-95M stands out for its use of encodec codebooks for pseudo-labels, higher audio frequency processing (24kHz), and significant training data volume (20K hours), making it more robust and versatile than previous versions.
Q: What are the recommended use cases?
The model is ideal for music understanding tasks, audio classification, and feature extraction for downstream music processing applications. It's particularly effective when you need high-quality music representations while working with limited computational resources.