mms-tts-eng

Maintained By
facebook

MMS-TTS English Speech Synthesis Model

PropertyValue
Parameter Count36.3M
LicenseCC-BY-NC 4.0
AuthorFacebook
PaperScaling Speech Technology to 1,000+ Languages

What is mms-tts-eng?

MMS-TTS-eng is part of Facebook's Massively Multilingual Speech project, specifically designed for English text-to-speech synthesis. This model employs the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture, combining conditional variational autoencoding with adversarial training to produce natural-sounding speech.

Implementation Details

The model architecture consists of a posterior encoder, decoder, and conditional prior, utilizing a Transformer-based text encoder and multiple coupling layers for spectrogram prediction. It implements a unique stochastic duration predictor that enables varied speech rhythm generation from identical input text.

  • Flow-based spectrogram prediction module
  • HiFi-GAN-style decoder with transposed convolutions
  • Stochastic duration prediction for varied speech patterns
  • End-to-end training with variational lower bound and adversarial losses

Core Capabilities

  • High-quality English speech synthesis
  • Non-deterministic output generation
  • Variable speech rhythm generation
  • Easy integration with Transformers library (v4.33+)

Frequently Asked Questions

Q: What makes this model unique?

The model's stochastic duration predictor allows for natural variation in speech patterns, making it more expressive than deterministic TTS models. It's part of a larger multilingual project, benefiting from Facebook's extensive research in speech synthesis.

Q: What are the recommended use cases?

This model is ideal for applications requiring English text-to-speech conversion, including accessibility tools, automated content reading, and voice assistance systems. Due to its CC-BY-NC 4.0 license, it's suitable for non-commercial applications only.

The first platform built for prompt engineering