Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/boltzmannentropy/xtts2-ui

A User Interface for XTTS-2 Text-Based Voice Cloning using only 10 seconds of speech
https://github.com/boltzmannentropy/xtts2-ui

coqui-tts streamlit tts voice-cloning

Last synced: about 13 hours ago
JSON representation

A User Interface for XTTS-2 Text-Based Voice Cloning using only 10 seconds of speech

Awesome Lists containing this project

README

        

# XTTS-2-UI: A User Interface for XTTS-2 Text-Based Voice Cloning

This repository contains the essential code for cloning any voice using just text and a 10-second audio sample of the target voice. XTTS-2-UI is simple to setup and use. [Example Results 🔊](#examples)

Works in [16 languages](#language-support) and has in-built voice recording/uploading.
Note: Don't expect EL level quality, it is not there yet.

## Model
The model used is `tts_models/multilingual/multi-dataset/xtts_v2`. For more details, refer to [Hugging Face - XTTS-v2](https://huggingface.co/coqui/XTTS-v2) and its specific version [XTTS-v2 Version 2.0.2](https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2).



## Table of Contents

- [XTTS-2-UI: A User Interface for XTTS-2 Text-Based Voice Cloning](#xtts-2-ui-a-user-interface-for-xtts-2-text-based-voice-cloning)
- [Model](#model)
- [Table of Contents](#table-of-contents)
- [Setup](#setup)
- [Inference](#inference)
- [Target Voices Dataset](#target-voices-dataset)
- [Sample Audio Examples:](#sample-audio-examples)
- [Language Support](#language-support)
- [Notes](#notes)
- [Credits](#credits)

## Setup

To set up this project, follow these steps in a terminal:

1. **Clone the Repository**

- Clone the repository to your local machine.
```bash
git clone https://github.com/pbanuru/xtts2-ui.git
cd xtts2-ui
```

2. **Create a Virtual Environment:**
- Run the following command to create a Python virtual environment:
```bash
python -m venv venv
```
- Activate the virtual environment:
- Windows:
```bash
# cmd prompt
venv\Scripts\activate
```
or

```bash
# git bash
source venv/Scripts/activate
```
- Linux/Mac:
```bash
source venv/bin/activate
```

3. **Install PyTorch:**

- If you have an Nvidia CUDA-Enabled GPU, choose the appropriate PyTorch installation command:
- Before installing PyTorch, check your CUDA version by running:
```bash
nvcc --version
```
- For CUDA 12.1:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
- For CUDA 11.8:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
- If you don't have a CUDA-enabled GPU,:
Follow the instructions on the [PyTorch website](https://pytorch.org/get-started/locally/) to install the appropriate version of PyTorch for your system.

4. **Install Other Required Packages:**
- Install direct dependencies:
```bash
pip install -r requirements.txt
```
- Upgrade the TTS package to the latest version:
```bash
pip install --upgrade TTS
```

After completing these steps, your setup should be complete and you can start using the project.

Models will be downloaded automatically upon first use.

Download paths:
- MacOS: `/Users/USR/Library/Application Support/tts/tts_models--multilingual--multi-dataset--xtts_v2`
- Windows: `C:\Users\ YOUR-USER-ACCOUNT \AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2`
- Linux: `/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2`

## Inference
To run the application:

```
python app.py
OR
streamlit run app2.py
```
Or, You can also run from the terminal itself, by providing sample input texts on texts.json and generate multiple audios with multiple speakers, (you may need to adjust on appTerminal.py)
```
python appTerminal.py
```
On initial use, you will need to agree to the terms:

```
[XTTS] Loading XTTS...
> tts_models/multilingual/multi-dataset/xtts_v2 has been updated, clearing model cache...
> You must agree to the terms of service to use this model.
| > Please see the terms of service at https://coqui.ai/cpml.txt
| > "I have read, understood and agreed to the Terms and Conditions." - [y/n]
| | >
```

If your model is re-downloading each run, please consult [Issue 4723 on GitHub](https://github.com/oobabooga/text-generation-webui/issues/4723#issuecomment-1826120220).

## Target Voices Dataset
The dataset consists of a single folder named `targets`, pre-populated with several voices for testing purposes.

To add more voices (if you don't want to go through the GUI), create a 24KHz WAV file of approximately 10 seconds and place it under the `targets` folder.
You can use yt-dlp to download a voice from YouTube for cloning:
```
yt-dlp -x --audio-format wav "https://www.youtube.com/watch?"
```

## Sample Audio Examples:

| Language | Audio Sample Link |
|----------|-------------------|
| English | [▶️](demo_info/Rogger_sample_en.wav) |
| Russian | [▶️](demo_info/Rogger_sample_ru.wav) |
| Arabic | [▶️](demo_info/Rogger_sample_aa.wav) |

## Language Support
Arabic, Chinese, Czech, Dutch, English, French, German, Hungarian, Italian, Japanese[ (see setup)](#notes), Korean, Polish, Portuguese, Russian, Spanish, Turkish

## Notes
If you would like to select **Japanese** as the target language, you must install a dictionary.
```bash
# Lite version
pip install fugashi[unidic-lite]
```
or for more serious processing:
```bash
# Full version
pip install fugashi[unidic]
python -m unidic download
```
More details [here](https://github.com/polm/fugashi#installing-a-dictionary).

## Credits
1. Heavily based on https://github.com/kanttouchthis/text_generation_webui_xtts/