https://github.com/lucidrains/spear-tts-pytorch
Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
https://github.com/lucidrains/spear-tts-pytorch
artificial-intelligence attention deep-learning text-to-speech transformers
Last synced: 6 months ago
JSON representation
Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
- Host: GitHub
- URL: https://github.com/lucidrains/spear-tts-pytorch
- Owner: lucidrains
- License: mit
- Created: 2023-06-19T15:48:42.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-10-30T17:24:47.000Z (almost 2 years ago)
- Last Synced: 2025-03-29T00:09:19.941Z (7 months ago)
- Topics: artificial-intelligence, attention, deep-learning, text-to-speech, transformers
- Language: Python
- Homepage:
- Size: 198 KB
- Stars: 268
- Watchers: 27
- Forks: 19
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Spear-TTS - Pytorch
Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
The text-to-semantic module built here will be used for SoundStorm for conditioning.
## Appreciation
- Stability for their generous sponsorships to work on and open source cutting edge artificial intelligence research
- Lucas Newman for completing the backtranslation portion, as well as beam search decoding!
- Lucas Newman for completing the final text to semantic transformer training code!
## Install
```bash
$ pip install spear-tts-pytorch
```## Usage
```python
import torchfrom audiolm_pytorch import HubertWithKmeans
from spear_tts_pytorch import (
TextToSemantic,
SemanticToTextDatasetGenerator,
GeneratedAudioTextDataset,
MockDataset
)wav2vec = HubertWithKmeans(
checkpoint_path = './hubert_base_ls960.pt',
kmeans_path = './hubert_base_ls960_L9_km500.bin'
)model = TextToSemantic(
wav2vec = wav2vec,
dim = 512,
num_text_token_ids = 256,
heads = 8,
target_kv_heads = 2, # grouped query attention, for memory efficient decoding
source_depth = 1,
target_depth = 1
)ds = MockDataset(10)
dataset_generator = SemanticToTextDatasetGenerator(
model = model,
dataset = ds,
folder = './output_folder'
)dataset_generator(max_length = 2)
generated_dataset = GeneratedAudioTextDataset(
folder = './output_folder'
)assert len(generated_dataset) == 10
```## Todo
- [x] add eos logic + generate, and hook up end-to-end generation in soundstorm
- [x] add first pretraining speech-to-speech with the reconstruction of 60% deleted tokens
- [x] add dropouts for this project, as low-resource
- [x] add total flexiblity of which layers of encoder / decoder to freeze during training
- [x] add step for training on small speech -> text corpus and generating pseudo-labelled dataset + finetuning (thanks to @lucasnewman)
- [x] add final step of finetuning on text -> speech + pseudolabelled dataset
- [x] figure out the best way to store and manage the pseudo-labelled generated dataset
- [x] batched beam search decoding
- [x] allow for using rotary positions in decoder + flash attention, give Tri another citation
- [x] integrate speculative decoding with some improvisation - done in same model using early exit strategy- [ ] add cached key / values for starter + single / grouped key values, make sure flash attention can support specialized causal mask before flash attention 2 is in pytorch core
- [ ] polish the audio-text generation workflow
- [ ] concatting the real audio-text dataset with the generated one -> or being able to convert real audio-text dataset to generated## Citations
```bibtex
@misc{kharitonov2023speak,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
author = {Eugene Kharitonov and Damien Vincent and Zalán Borsos and Raphaël Marinier and Sertan Girgin and Olivier Pietquin and Matt Sharifi and Marco Tagliasacchi and Neil Zeghidour},
year = {2023},
eprint = {2302.03540},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
``````bibtex
@inproceedings{dao2022flashattention,
title = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
booktitle = {Advances in Neural Information Processing Systems},
year = {2022}
}
``````bibtex
@misc{shi2023enhance,
title = {Enhance audio generation controllability through representation similarity regularization},
author = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},
year = {2023},
eprint = {2309.08773},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
``````bibtex
@article{Ainslie2023GQATG,
title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebr'on and Sumit K. Sanghai},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.13245},
url = {https://api.semanticscholar.org/CorpusID:258833177}
}
``````bibtex
@inproceedings{Leviathan2022FastIF,
title = {Fast Inference from Transformers via Speculative Decoding},
author = {Yaniv Leviathan and Matan Kalman and Y. Matias},
booktitle = {International Conference on Machine Learning},
year = {2022},
url = {https://api.semanticscholar.org/CorpusID:254096365}
}
```