https://github.com/nipponjo/mixer-tts-pytorch
Mixer-TTS for efficient TTS
https://github.com/nipponjo/mixer-tts-pytorch
deep-learning gan ljspeech mixer-tts python pytorch speech speech-synthesis text-to-speech torchaudio tts tts-model voice-synthesis
Last synced: 11 days ago
JSON representation
Mixer-TTS for efficient TTS
- Host: GitHub
- URL: https://github.com/nipponjo/mixer-tts-pytorch
- Owner: nipponjo
- Created: 2025-04-12T10:33:25.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-12T15:11:58.000Z (6 months ago)
- Last Synced: 2025-09-07T14:44:36.355Z (about 2 months ago)
- Topics: deep-learning, gan, ljspeech, mixer-tts, python, pytorch, speech, speech-synthesis, text-to-speech, torchaudio, tts, tts-model, voice-synthesis
- Language: Jupyter Notebook
- Homepage:
- Size: 1.21 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# mixer-tts-pytorch
[[Samples]](https://nipponjo.github.io/tts-mixer-samples/)
This repo contains an implementation of the Mixer-TTS model ([https://arxiv.org/abs/2110.03584](https://arxiv.org/abs/2110.03584)).
Pre-trained weights are available for the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) dataset.
The channel dimensions of the convolutions inside the model were chosen as 384, 128 and 80, resulting in models with 20.6M, 3.17M and 1.74M parameters.The pre-trained models take IPA symbols as input. Please refer to [here](https://bootphon.github.io/phonemizer/install.html) to install `phonemizer` and the `espeak-ng` backend.
A simple patch-based discriminator was used in training to generate more natural mel spectrograms.
Audio samples are available [here](https://nipponjo.github.io/tts-mixer-samples/).
**Pre-trained models**
All (3) checkpoint files can be downloaded by running: `python download_files.py`
|Dataset|dim|params|name|link|
|-------|---|------|-----|---|
|LJSpeech|80|1.74M|mixer_lj_80|[link](https://drive.google.com/file/d/1YTiA6S3okiuX-_AttUhJNVgiPzVYAyjv/view?usp=sharing)|
|LJSpeech|128|3.17M|mixer_lj_128|[link](https://drive.google.com/file/d/1wVvOyaBLxqrKAssXmEYG9mszZsqEaX5R/view?usp=sharing)|
|LJSpeech|384|20.6M|mixer_lj_384|[link](https://drive.google.com/file/d/16Rq99ZmXVfiDE_nsxmUBzF3XKEOUh5wx/view?usp=sharing)|The pre-trained models output the 80-channel mel spectrogram version first proposed by the [HiFi-GAN](https://github.com/jik876/hifi-gan) vocoder.
**References**
The model was taken out of NVIDIA's [NeMo](https://github.com/NVIDIA/NeMo) framework in order to make it easier to modify and have fewer dependencies. An energy embedding and optional speaker and emotion embeddings have been added.
[Mixer-TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/mixer_tts.py)
Paper:
```
@article{Tatanov2021MixerTTSNF,
title={Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings},
author={Oktai Tatanov and Stanislav Beliaev and Boris Ginsburg},
journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2021},
pages={7482-7486},
}
```