MMS-TTS English Speech Synthesis Model

Property	Value
Parameter Count	36.3M
License	CC-BY-NC 4.0
Author	Facebook
Paper	Scaling Speech Technology to 1,000+ Languages

What is mms-tts-eng?

MMS-TTS-eng is part of Facebook's Massively Multilingual Speech project, specifically designed for English text-to-speech synthesis. This model employs the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture, combining conditional variational autoencoding with adversarial training to produce natural-sounding speech.

Implementation Details

The model architecture consists of a posterior encoder, decoder, and conditional prior, utilizing a Transformer-based text encoder and multiple coupling layers for spectrogram prediction. It implements a unique stochastic duration predictor that enables varied speech rhythm generation from identical input text.

Flow-based spectrogram prediction module
HiFi-GAN-style decoder with transposed convolutions
Stochastic duration prediction for varied speech patterns
End-to-end training with variational lower bound and adversarial losses

Core Capabilities

High-quality English speech synthesis
Non-deterministic output generation
Variable speech rhythm generation
Easy integration with Transformers library (v4.33+)

Frequently Asked Questions

Q: What makes this model unique?

The model's stochastic duration predictor allows for natural variation in speech patterns, making it more expressive than deterministic TTS models. It's part of a larger multilingual project, benefiting from Facebook's extensive research in speech synthesis.

Q: What are the recommended use cases?

This model is ideal for applications requiring English text-to-speech conversion, including accessibility tools, automated content reading, and voice assistance systems. Due to its CC-BY-NC 4.0 license, it's suitable for non-commercial applications only.

mms-tts-eng