MERT-v1-95M

Maintained By
m-a-p

MERT-v1-95M

PropertyValue
Parameter Count95M
ArchitectureTransformer (12 layers, 768 dimensions)
LicenseCC-BY-NC-4.0
PaperarXiv:2306.00107
Sample Rate24kHz
Feature Rate75Hz

What is MERT-v1-95M?

MERT-v1-95M is a state-of-the-art music understanding model that leverages masked language modeling (MLM) pre-training on 20,000 hours of audio data. It represents a significant advancement in the m-a-p model family, introducing improved audio processing capabilities with higher frequency sampling and enhanced feature extraction.

Implementation Details

The model employs a transformer architecture with 12 layers and 768-dimensional features. It processes audio at 24kHz and outputs features at 75Hz, representing a significant upgrade from previous versions. The model utilizes 8 codebooks from encodec for pseudo-labels and implements MLM prediction with in-batch noise mixture.

  • Transformer-based architecture with 95M parameters
  • Pre-trained on 20,000 hours of music data
  • Supports 5-second context window during pre-training
  • Implements advanced MLM paradigm with noise mixture

Core Capabilities

  • High-quality music audio feature extraction
  • Support for music generation tasks
  • Flexible feature output from different transformer layers
  • Efficient processing with 75Hz feature rate

Frequently Asked Questions

Q: What makes this model unique?

MERT-v1-95M stands out for its use of encodec codebooks for pseudo-labels, higher audio frequency processing (24kHz), and significant training data volume (20K hours), making it more robust and versatile than previous versions.

Q: What are the recommended use cases?

The model is ideal for music understanding tasks, audio classification, and feature extraction for downstream music processing applications. It's particularly effective when you need high-quality music representations while working with limited computational resources.

The first platform built for prompt engineering