https://github.com/nipponjo/mixer-tts-pytorch

Mixer-TTS for efficient TTS
https://github.com/nipponjo/mixer-tts-pytorch

deep-learning gan ljspeech mixer-tts python pytorch speech speech-synthesis text-to-speech torchaudio tts tts-model voice-synthesis

Last synced: 11 days ago
JSON representation

Mixer-TTS for efficient TTS

Host: GitHub
URL: https://github.com/nipponjo/mixer-tts-pytorch
Owner: nipponjo
Created: 2025-04-12T10:33:25.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-04-12T15:11:58.000Z (6 months ago)
Last Synced: 2025-09-07T14:44:36.355Z (about 2 months ago)
Topics: deep-learning, gan, ljspeech, mixer-tts, python, pytorch, speech, speech-synthesis, text-to-speech, torchaudio, tts, tts-model, voice-synthesis
Language: Jupyter Notebook
Homepage:
Size: 1.21 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # mixer-tts-pytorch

[[Samples]](https://nipponjo.github.io/tts-mixer-samples/)

This repo contains an implementation of the Mixer-TTS model ([https://arxiv.org/abs/2110.03584](https://arxiv.org/abs/2110.03584)).

Pre-trained weights are available for the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) dataset. 

The channel dimensions of the convolutions inside the model were chosen as 384, 128 and 80, resulting in models with 20.6M, 3.17M and 1.74M parameters.

The pre-trained models take IPA symbols as input. Please refer to [here](https://bootphon.github.io/phonemizer/install.html) to install `phonemizer` and the `espeak-ng` backend.

A simple patch-based discriminator was used in training to generate more natural mel spectrograms.

Audio samples are available [here](https://nipponjo.github.io/tts-mixer-samples/).

**Pre-trained models**

All (3) checkpoint files can be downloaded by running: `python download_files.py`

|Dataset|dim|params|name|link|

|-------|---|------|-----|---|

|LJSpeech|80|1.74M|mixer_lj_80|[link](https://drive.google.com/file/d/1YTiA6S3okiuX-_AttUhJNVgiPzVYAyjv/view?usp=sharing)|

|LJSpeech|128|3.17M|mixer_lj_128|[link](https://drive.google.com/file/d/1wVvOyaBLxqrKAssXmEYG9mszZsqEaX5R/view?usp=sharing)|

|LJSpeech|384|20.6M|mixer_lj_384|[link](https://drive.google.com/file/d/16Rq99ZmXVfiDE_nsxmUBzF3XKEOUh5wx/view?usp=sharing)|

The pre-trained models output the 80-channel mel spectrogram version first proposed by the [HiFi-GAN](https://github.com/jik876/hifi-gan) vocoder.

**References**

The model was taken out of NVIDIA's [NeMo](https://github.com/NVIDIA/NeMo) framework in order to make it easier to modify and have fewer dependencies. An energy embedding and optional speaker and emotion embeddings have been added.

[Mixer-TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/mixer_tts.py)

Paper:

```

@article{Tatanov2021MixerTTSNF,

  title={Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings},

  author={Oktai Tatanov and Stanislav Beliaev and Boris Ginsburg},

  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},

  year={2021},

  pages={7482-7486},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nipponjo/mixer-tts-pytorch

Awesome Lists containing this project

README