Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/soheil-mp/vits-conditional-variational-autoencoder-with-adversarial-learning-
A state-of-the-art end-to-end Text-to-Speech model that directly generates waveforms from text.
https://github.com/soheil-mp/vits-conditional-variational-autoencoder-with-adversarial-learning-
Last synced: 24 days ago
JSON representation
A state-of-the-art end-to-end Text-to-Speech model that directly generates waveforms from text.
- Host: GitHub
- URL: https://github.com/soheil-mp/vits-conditional-variational-autoencoder-with-adversarial-learning-
- Owner: soheil-mp
- Created: 2024-12-05T16:46:24.000Z (29 days ago)
- Default Branch: master
- Last Pushed: 2024-12-05T16:48:25.000Z (29 days ago)
- Last Synced: 2024-12-09T09:58:04.469Z (25 days ago)
- Language: Python
- Size: 39.1 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐๏ธ VITS: State-of-the-Art Text-to-Speech Implementation
[![PyTorch](https://img.shields.io/badge/PyTorch-2.2%2B-orange?style=flat-square&logo=pytorch)](https://pytorch.org/)
[![License](https://img.shields.io/badge/license-MIT-blue?style=flat-square)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.8%2B-blue?style=flat-square&logo=python)](https://www.python.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2106.06103-b31b1b.svg?style=flat-square)](https://arxiv.org/abs/2106.06103)*A PyTorch implementation of VITS: Conditional Variational Autoencoder with Adversarial Learning*
[Features](#features) โข [Installation](#installation) โข [Quick Start](#quick-start) โข [Training](#training)
## ๐ Overview
This project implements VITS (Conditional Variational Autoencoder with Adversarial Learning), a state-of-the-art end-to-end Text-to-Speech model that directly generates waveforms from text. Key features include:
- End-to-end text-to-speech synthesis
- Parallel sampling for ultra-fast inference
- High-quality audio generation
- Multi-speaker support
- Emotion and style control## ๐ Requirements
- Python 3.8+
- CUDA-compatible GPU (8GB+ VRAM)
- 16GB+ RAM
- 50GB+ disk space## ๐ Installation
1. **Create and activate virtual environment**:
```bash
python -m venv venv
# Linux/Mac
source venv/bin/activate
# Windows
.\venv\Scripts\activate
```2. **Install PyTorch**:
```bash
# Windows/Linux with CUDA 11.8
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
# CPU only
pip install torch torchaudio
```3. **Install dependencies**:
```bash
pip install -r requirements.txt
```4. **Verify installation**:
```python
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
```## ๐ Dataset Preparation
### Linux/macOS
```bash
mkdir -p data/raw/LJSpeech-1.1
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -P data/raw
tar -xvf data/raw/LJSpeech-1.1.tar.bz2 -C data/raw
rm data/raw/LJSpeech-1.1.tar.bz2
```### Windows (PowerShell)
```powershell
New-Item -ItemType Directory -Force -Path "data\raw\LJSpeech-1.1"
Invoke-WebRequest -Uri "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2" -OutFile "data\raw\LJSpeech-1.1.tar.bz2"
& 'C:\Program Files\7-Zip\7z.exe' x "data\raw\LJSpeech-1.1.tar.bz2" -o"data\raw"
& 'C:\Program Files\7-Zip\7z.exe' x "data\raw\LJSpeech-1.1.tar" -o"data\raw"
Remove-Item "data\raw\LJSpeech-1.1.tar*"
```### Python (Cross-platform)
```python
import os, requests, tarfile
from pathlib import Pathdata_dir = Path("data/raw/LJSpeech-1.1")
data_dir.mkdir(parents=True, exist_ok=True)url = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2"
archive_path = data_dir.parent / "LJSpeech-1.1.tar.bz2"print("Downloading LJSpeech dataset...")
response = requests.get(url, stream=True)
with open(archive_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)print("Extracting dataset...")
with tarfile.open(archive_path, 'r:bz2') as tar:
tar.extractall(path=data_dir.parent)
archive_path.unlink()
```## ๐ฏ Training
1. **Prepare dataset**:
```bash
python scripts/prepare_dataset.py --config configs/vits_config.yaml
```2. **Start training**:
```bash
# Single GPU
python scripts/train.py --config configs/vits_config.yaml# Multi-GPU (e.g., 4 GPUs)
python scripts/train.py --config configs/vits_config.yaml --world_size 4
```3. **Monitor training**:
```bash
# TensorBoard
tensorboard --logdir data/logs# Weights & Biases monitoring is automatic if enabled in config
```## ๐ต Inference
```python
from src.inference import VITS# Initialize model
vits = VITS(checkpoint="path/to/checkpoint")# Basic synthesis
audio = vits.synthesize(
text="Hello, world!",
speaker_id=0,
speed_factor=1.0
)# Save audio
vits.save_audio(audio, "output.wav")# Batch processing
texts = [
"First sentence.",
"Second sentence.",
"Third sentence."
]
audios = vits.synthesize_batch(texts, speaker_id=0)
```## ๐ง Model Architecture
```
Text โ [Text Encoder] โ Hidden States
โ
[Posterior Encoder]
โ
[Flow Decoder] โ Audio
โ
[Multi-Period Discriminator]
[Multi-Scale Discriminator]
```Key components:
1. **Text Encoder**: Transformer-based with multi-head attention
2. **Flow Decoder**: Normalizing flows with residual coupling
3. **Posterior Encoder**: WaveNet-style architecture
4. **Discriminators**: Multi-period and multi-scale for quality
5. **Voice Conversion**: Optional cross-speaker style transfer## ๐ง Troubleshooting
### Common Issues
1. **Out of Memory (OOM)**:
```bash
# Reduce batch size in config
# Enable gradient accumulation
# Use mixed precision (fp16)
```2. **Poor Audio Quality**:
- Check preprocessing parameters
- Verify loss convergence
- Ensure proper normalization3. **Slow Training**:
- Enable mixed precision
- Use DDP for multi-GPU
- Optimize dataloader workers## ๐ Citation
```bibtex
@inproceedings{kim2021vits,
title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning},
year={2021}
}
```## ๐ License
MIT License - see [LICENSE](LICENSE) file
## ๐ Acknowledgments
- [Official VITS Implementation](https://github.com/jaywalnut310/vits)
- [LJSpeech Dataset](https://keithito.com/LJ-Speech-Dataset/)
- [PyTorch](https://pytorch.org/)
Made with โค๏ธ by the TTS Team[Report Bug](https://github.com/yourusername/tts/issues) ยท [Request Feature](https://github.com/yourusername/tts/issues)