MMS-TTS-CAT: Catalan Text-to-Speech Model

Property	Value
Parameter Count	36.3M
License	CC-BY-NC 4.0
Author	Facebook
Paper	MMS Paper
Model Type	Text-to-Speech (VITS)

What is mms-tts-cat?

MMS-TTS-CAT is part of Facebook's Massively Multilingual Speech (MMS) project, specifically designed for Catalan language text-to-speech synthesis. This model represents a significant advancement in making speech technology accessible for the Catalan language, utilizing the sophisticated VITS architecture for high-quality speech generation.

Implementation Details

The model employs a Conditional Variational Autoencoder (VAE) architecture with three main components: a posterior encoder, decoder, and conditional prior. It uses a Transformer-based text encoder combined with flow-based modules for spectrogram prediction, followed by a HiFi-GAN-style decoder for waveform generation.

Stochastic duration predictor for varied speech rhythm
Flow-based modules for improved expressiveness
End-to-end training with variational lower bound and adversarial losses
Non-deterministic output requiring seed fixing for reproducibility

Core Capabilities

High-quality Catalan speech synthesis from text input
Variable speech rhythm generation
Efficient inference using PyTorch
Integration with 🤗 Transformers library (v4.33+)

Frequently Asked Questions

Q: What makes this model unique?

This model is unique in its specific optimization for Catalan language speech synthesis, being part of the larger MMS project that aims to democratize speech technology across multiple languages. The VITS architecture enables high-quality, natural-sounding speech output with variable rhythm capabilities.

Q: What are the recommended use cases?

The model is ideal for applications requiring Catalan language text-to-speech conversion, such as accessibility tools, educational software, and content localization. However, due to its CC-BY-NC 4.0 license, it's restricted to non-commercial applications.

mms-tts-cat