https://github.com/ukr-projects/chatterbox-tts-colab

Transform any text into natural-sounding speech, clone voices from audio samples, and create professional voiceovers - all running free in Google Colab!
https://github.com/ukr-projects/chatterbox-tts-colab
ai audio-processing colab-notebook deep-learning google-colab jupyter-notebook machine-learning python pytorch speech-synthesis text-to-speech tts voice-synthesis
Last synced: 3 months ago
JSON representation
Transform any text into natural-sounding speech, clone voices from audio samples, and create professional voiceovers - all running free in Google Colab!
Host: GitHub
URL: https://github.com/ukr-projects/chatterbox-tts-colab
Owner: UKR-PROJECTS
License: mit
Created: 2025-06-24T14:41:58.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-06-24T17:30:30.000Z (3 months ago)
Last Synced: 2025-06-24T17:49:01.775Z (3 months ago)
Topics: ai, audio-processing, colab-notebook, deep-learning, google-colab, jupyter-notebook, machine-learning, python, pytorch, speech-synthesis, text-to-speech, tts, voice-synthesis
Language: Jupyter Notebook
Homepage: https://colab.research.google.com/drive/1o_PnrXpxvAYozOYtnid74eqbHyOD9A45?usp=sharing
Size: 18.6 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          # 🎙️ Chatterbox TTS Colab - Easy Voice Cloning & Text-to-Speech

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1o_PnrXpxvAYozOYtnid74eqbHyOD9A45?usp=sharing)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

[![GitHub stars](https://img.shields.io/github/stars/UKR-PROJECTS/chatterbox-tts-colab.svg?style=social&label=Star)](https://github.com/UKR-PROJECTS/chatterbox-tts-colab)

> 🚀 **One-click voice cloning and text-to-speech in Google Colab with Chatterbox TTS**

Transform any text into natural-sounding speech, clone voices from audio samples, and create professional voiceovers - all running free in Google Colab!

## 🚀 Quick Start

1. Click the "Open in Colab" button above

2. Run all cell in the notebook

3. Upload your voice sample (optional)

4. Enter your text and generate speech!

## ✨ Features

- 🎯 **Zero Setup**: Run immediately in Google Colab

- 🗣️ **Voice Cloning**: Clone any voice from a short audio sample

- 🌍 **Multilingual**: Support for multiple languages

- 🎛️ **Advanced Controls**: Fine-tune voice characteristics

- 💾 **Google Drive Integration**: Automatic saving to your drive

- 🔧 **Robust Error Handling**: Graceful fallbacks and clear error messages

## 🔊 Demo: Text & Audio Samples

Here’s a quick demo so you can see—and hear—how Chatterbox-TTS-Colab performs.

---

### 📝 Sample Text

> “This is a test of the Chatterbox TTS system. I hope this works properly now with the improved error handling and correct repository. The model should now load from ResembleAI/chatterbox instead of the old fluffyox repository.”  

---

### 🎤 Original Voice Clip (for cloning)

https://github.com/user-attachments/assets/b34c7eb1-8fda-46c9-a62f-d94318d9f12a

---

### 🤖 AI-Generated TTS Output

https://github.com/user-attachments/assets/7ff42492-8928-41af-8d9a-d5e952566cbe

---

## 📦 Installation

The Colab notebook handles all installations automatically. If you want to run locally:

```bash

# Install required packages

pip install chatterbox-tts

pip install torch torchaudio

pip install gradio

pip install librosa soundfile

# For Google Drive integration

pip install google-colab-tools

```

## 🎯 Usage

### Basic Text-to-Speech

```python

from chatterbox.tts import ChatterboxTTS

import torchaudio as ta

# Initialize the model

model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech from text

text = "Hello world! This is Chatterbox TTS in action."

wav = model.generate(text)

# Save the audio

ta.save("output.wav", wav, model.sr)

```

### Voice Cloning

```python

# Clone a voice using reference audio

AUDIO_PROMPT_PATH = "path/to/your/reference_audio.wav"

text = "This text will be spoken in the cloned voice."

wav = model.generate(

    text, 

    audio_prompt_path=AUDIO_PROMPT_PATH,

    exaggeration=0.5,  # Emotion intensity (0.0-1.0)

    cfg=0.5           # Classifier-free guidance (0.0-1.0)

)

ta.save("cloned_voice_output.wav", wav, model.sr)

```

### Batch Processing

```python

# Process multiple texts

texts = [

    "First sentence to synthesize.",

    "Second sentence with different content.",

    "Third sentence for batch processing."

]

for i, text in enumerate(texts):

    wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)

    ta.save(f"batch_output_{i}.wav", wav, model.sr)

```

## 🎛️ Advanced Controls

### Emotion and Intensity Control

Chatterbox TTS offers unique emotion exaggeration control:

```python

# Subtle, natural speech

wav = model.generate(text, exaggeration=0.3, cfg=0.5)

# More dramatic, expressive speech

wav = model.generate(text, exaggeration=0.8, cfg=0.3)

# Highly exaggerated, theatrical speech

wav = model.generate(text, exaggeration=1.0, cfg=0.2)

```

### Parameter Guide

| Parameter | Range | Description | Recommended Use |

|-----------|-------|-------------|-----------------|

| `exaggeration` | 0.0-1.0 | Controls emotional intensity and expressiveness | 0.5 for natural speech, 0.7+ for dramatic |

| `cfg` | 0.0-1.0 | Classifier-free guidance for speech pacing | 0.5 for normal, 0.3 for slower pacing |

| `temperature` | 0.1-2.0 | Controls randomness in generation | 0.7 for balanced, 1.0+ for more variation |

| `top_p` | 0.1-1.0 | Nucleus sampling parameter | 0.9 for most cases |

### Audio Quality Settings

```python

# High quality (slower generation)

wav = model.generate(

    text,

    audio_prompt_path=AUDIO_PROMPT_PATH,

    exaggeration=0.5,

    cfg=0.5,

    temperature=0.7,

    top_p=0.9,

    steps=30  # More steps = higher quality

)

# Fast generation (lower quality)

wav = model.generate(

    text,

    audio_prompt_path=AUDIO_PROMPT_PATH,

    steps=15  # Fewer steps = faster generation

)

```

## 🎤 Voice Cloning Guide

### Preparing Reference Audio

For best voice cloning results:

1. **Audio Quality**: Use clear, high-quality audio (WAV or MP3)

2. **Duration**: 3-30 seconds of speech is optimal

3. **Content**: Choose audio with clear pronunciation

4. **Background**: Minimal background noise

5. **Format**: Supported formats: WAV, MP3, FLAC, M4A

### Voice Cloning Tips

```python

# For different speaker types:

# Fast-speaking reference

wav = model.generate(text, audio_prompt_path=path, cfg=0.3, exaggeration=0.5)

# Slow, deliberate speaker

wav = model.generate(text, audio_prompt_path=path, cfg=0.7, exaggeration=0.4)

# Emotional, expressive speaker

wav = model.generate(text, audio_prompt_path=path, cfg=0.3, exaggeration=0.8)

# Professional, neutral speaker

wav = model.generate(text, audio_prompt_path=path, cfg=0.5, exaggeration=0.3)

```

### Audio Preprocessing

```python

import librosa

import soundfile as sf

def preprocess_audio(input_path, output_path):

    """Preprocess audio for better voice cloning"""

    # Load audio

    audio, sr = librosa.load(input_path, sr=22050)

    

    # Normalize volume

    audio = librosa.util.normalize(audio)

    

    # Remove silence

    audio, _ = librosa.effects.trim(audio, top_db=20)

    

    # Save preprocessed audio

    sf.write(output_path, audio, sr)

    return output_path

# Use preprocessed audio for cloning

processed_audio = preprocess_audio("raw_audio.wav", "processed_audio.wav")

wav = model.generate(text, audio_prompt_path=processed_audio)

```

## 💾 Google Drive Integration

### Automatic Saving

```python

from google.colab import drive

import os

# Mount Google Drive

drive.mount('/content/drive')

# Set up directories

output_dir = '/content/drive/MyDrive/ChatterboxTTS_Outputs'

os.makedirs(output_dir, exist_ok=True)

# Save with timestamp

import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

output_path = f"{output_dir}/tts_output_{timestamp}.wav"

wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)

ta.save(output_path, wav, model.sr)

print(f"Audio saved to: {output_path}")

```

### Batch Processing with Drive

```python

# Process multiple files from Drive

input_dir = '/content/drive/MyDrive/ChatterboxTTS_Inputs'

output_dir = '/content/drive/MyDrive/ChatterboxTTS_Outputs'

# Read text files

for filename in os.listdir(input_dir):

    if filename.endswith('.txt'):

        with open(os.path.join(input_dir, filename), 'r') as f:

            text = f.read()

        

        wav = model.generate(text)

        output_path = os.path.join(output_dir, f"{filename[:-4]}.wav")

        ta.save(output_path, wav, model.sr)

```

## 🔧 Troubleshooting

### Common Issues and Solutions

#### 1. CUDA Out of Memory Error

```python

# Solution: Clear cache and reduce batch size

import torch

torch.cuda.empty_cache()

# Use smaller text chunks

def split_text(text, max_length=200):

    sentences = text.split('. ')

    chunks = []

    current_chunk = ""

    

    for sentence in sentences:

        if len(current_chunk + sentence) < max_length:

            current_chunk += sentence + ". "

        else:

            if current_chunk:

                chunks.append(current_chunk.strip())

            current_chunk = sentence + ". "

    

    if current_chunk:

        chunks.append(current_chunk.strip())

    

    return chunks

# Process in chunks

text_chunks = split_text(long_text)

audio_chunks = []

for chunk in text_chunks:

    wav = model.generate(chunk, audio_prompt_path=AUDIO_PROMPT_PATH)

    audio_chunks.append(wav)

# Concatenate chunks

final_audio = torch.cat(audio_chunks, dim=-1)

ta.save("long_text_output.wav", final_audio, model.sr)

```

#### 2. Audio Quality Issues

```python

# Solution: Adjust generation parameters

wav = model.generate(

    text,

    audio_prompt_path=AUDIO_PROMPT_PATH,

    exaggeration=0.4,  # Lower for more natural speech

    cfg=0.6,          # Higher for more controlled output

    temperature=0.6,   # Lower for more consistent quality

    steps=25          # More steps for better quality

)

```

#### 3. Voice Cloning Not Working

```python

# Check audio file format and quality

import librosa

import numpy as np

def check_audio_quality(audio_path):

    try:

        audio, sr = librosa.load(audio_path)

        duration = len(audio) / sr

        

        print(f"Audio duration: {duration:.2f} seconds")

        print(f"Sample rate: {sr} Hz")

        print(f"Audio shape: {audio.shape}")

        

        # Check for silence

        silence_threshold = 0.01

        non_silent_ratio = np.mean(np.abs(audio) > silence_threshold)

        print(f"Non-silent ratio: {non_silent_ratio:.2f}")

        

        if duration < 3:

            print("⚠️  Audio might be too short for good cloning")

        if non_silent_ratio < 0.5:

            print("⚠️  Audio might have too much silence")

        

        return True

    except Exception as e:

        print(f"❌ Error loading audio: {e}")

        return False

# Check your reference audio

check_audio_quality("your_reference_audio.wav")

```

#### 4. Slow Generation Speed

```python

# Optimization tips

import gc

def optimize_generation():

    # Clear memory

    torch.cuda.empty_cache()

    gc.collect()

    

    # Use mixed precision

    with torch.cuda.amp.autocast():

        wav = model.generate(

            text,

            audio_prompt_path=AUDIO_PROMPT_PATH,

            steps=15,  # Reduce steps for speed

            cfg=0.5

        )

    

    return wav

```

#### 5. Google Drive Mount Issues

```python

# Force remount Drive

from google.colab import drive

drive.flush_and_unmount()

drive.mount('/content/drive', force_remount=True)

# Check permissions

import os

test_path = '/content/drive/MyDrive/test_file.txt'

try:

    with open(test_path, 'w') as f:

        f.write('test')

    os.remove(test_path)

    print("✅ Drive access working")

except Exception as e:

    print(f"❌ Drive access issue: {e}")

```

### Error Messages and Solutions

| Error | Cause | Solution |

|-------|-------|----------|

| `RuntimeError: CUDA out of memory` | GPU memory exhausted | Clear cache, reduce text length, restart runtime |

| `FileNotFoundError` | Audio file path incorrect | Check file path, ensure file exists |

| `ValueError: Invalid audio format` | Unsupported audio format | Convert to WAV/MP3, check file integrity |

| `ModuleNotFoundError` | Missing dependencies | Run installation cell again |

| `ConnectionError` | Network issues | Check internet connection, restart runtime |

## 📚 Advanced Examples

### 1. Podcast Generation

```python

def generate_podcast_episode(script_file, speaker_voices, output_file):

    """Generate a multi-speaker podcast episode"""

    with open(script_file, 'r') as f:

        script = f.read()

    

    # Parse script (assumes format: "SPEAKER1: text")

    lines = script.split('\n')

    audio_segments = []

    

    for line in lines:

        if ':' in line:

            speaker, text = line.split(':', 1)

            speaker = speaker.strip()

            text = text.strip()

            

            if speaker in speaker_voices:

                voice_file = speaker_voices[speaker]

                wav = model.generate(text, audio_prompt_path=voice_file)

                audio_segments.append(wav)

                

                # Add pause between speakers

                pause = torch.zeros(int(0.5 * model.sr))

                audio_segments.append(pause)

    

    # Concatenate all segments

    full_audio = torch.cat(audio_segments, dim=-1)

    ta.save(output_file, full_audio, model.sr)

# Usage

speaker_voices = {

    'HOST': '/content/drive/MyDrive/host_voice.wav',

    'GUEST': '/content/drive/MyDrive/guest_voice.wav'

}

generate_podcast_episode('script.txt', speaker_voices, 'podcast_episode.wav')

```

### 2. Audiobook Generation

```python

def generate_audiobook(text_file, narrator_voice, output_dir):

    """Generate an audiobook with chapters"""

    with open(text_file, 'r') as f:

        content = f.read()

    

    # Split into chapters

    chapters = content.split('CHAPTER')

    

    for i, chapter in enumerate(chapters[1:], 1):  # Skip first empty split

        chapter_text = f"Chapter {i}. {chapter}"

        

        # Split long chapters into segments

        segments = split_text(chapter_text, max_length=500)

        chapter_audio = []

        

        for segment in segments:

            wav = model.generate(segment, audio_prompt_path=narrator_voice)

            chapter_audio.append(wav)

            

            # Short pause between segments

            pause = torch.zeros(int(0.3 * model.sr))

            chapter_audio.append(pause)

        

        # Save chapter

        chapter_full = torch.cat(chapter_audio, dim=-1)

        chapter_file = f"{output_dir}/chapter_{i:02d}.wav"

        ta.save(chapter_file, chapter_full, model.sr)

        print(f"Generated: {chapter_file}")

# Usage

generate_audiobook(

    'book.txt', 

    '/content/drive/MyDrive/narrator_voice.wav',

    '/content/drive/MyDrive/audiobook_output'

)

```

### 3. Multi-Language Support

```python

def generate_multilingual_content(texts_dict, voice_files_dict):

    """Generate content in multiple languages"""

    for language, text in texts_dict.items():

        voice_file = voice_files_dict.get(language)

        

        if voice_file:

            # Adjust parameters for different languages

            if language in ['spanish', 'italian']:

                exaggeration = 0.7  # More expressive for Romance languages

            elif language in ['japanese', 'mandarin']:

                cfg = 0.6  # More controlled for tonal languages

            else:

                exaggeration, cfg = 0.5, 0.5  # Default for other languages

            

            wav = model.generate(

                text,

                audio_prompt_path=voice_file,

                exaggeration=exaggeration,

                cfg=cfg

            )

            

            output_file = f"output_{language}.wav"

            ta.save(output_file, wav, model.sr)

            print(f"Generated {language}: {output_file}")

# Usage

texts = {

    'english': "Hello, this is a test in English.",

    'spanish': "Hola, esta es una prueba en español.",

    'french': "Bonjour, ceci est un test en français."

}

voices = {

    'english': '/content/drive/MyDrive/english_voice.wav',

    'spanish': '/content/drive/MyDrive/spanish_voice.wav',

    'french': '/content/drive/MyDrive/french_voice.wav'

}

generate_multilingual_content(texts, voices)

```

## 🎨 Custom Voice Effects

### Emotion Presets

```python

# Define emotion presets

EMOTION_PRESETS = {

    'neutral': {'exaggeration': 0.3, 'cfg': 0.5, 'temperature': 0.7},

    'happy': {'exaggeration': 0.8, 'cfg': 0.4, 'temperature': 0.8},

    'sad': {'exaggeration': 0.6, 'cfg': 0.6, 'temperature': 0.6},

    'angry': {'exaggeration': 0.9, 'cfg': 0.3, 'temperature': 0.9},

    'calm': {'exaggeration': 0.2, 'cfg': 0.7, 'temperature': 0.5},

    'excited': {'exaggeration': 1.0, 'cfg': 0.3, 'temperature': 1.0},

    'whisper': {'exaggeration': 0.1, 'cfg': 0.8, 'temperature': 0.4}

}

def generate_with_emotion(text, voice_file, emotion='neutral'):

    """Generate speech with specific emotion"""

    params = EMOTION_PRESETS.get(emotion, EMOTION_PRESETS['neutral'])

    

    wav = model.generate(

        text,

        audio_prompt_path=voice_file,

        **params

    )

    

    return wav

# Usage

text = "I can't believe this is happening!"

emotions = ['happy', 'sad', 'angry', 'excited']

for emotion in emotions:

    wav = generate_with_emotion(text, voice_file, emotion)

    ta.save(f"emotion_{emotion}.wav", wav, model.sr)

```

## 🎯 Performance Optimization

### Memory Management

```python

class ChatterboxManager:

    def __init__(self):

        self.model = None

        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    

    def load_model(self):

        """Load model only when needed"""

        if self.model is None:

            self.model = ChatterboxTTS.from_pretrained(device=self.device)

        return self.model

    

    def unload_model(self):

        """Free up GPU memory"""

        if self.model is not None:

            del self.model

            self.model = None

            torch.cuda.empty_cache()

            gc.collect()

    

    def generate_batch(self, texts, voice_file=None, **kwargs):

        """Generate multiple audio files efficiently"""

        model = self.load_model()

        results = []

        

        for text in texts:

            wav = model.generate(text, audio_prompt_path=voice_file, **kwargs)

            results.append(wav)

            

            # Clear cache periodically

            if len(results) % 5 == 0:

                torch.cuda.empty_cache()

        

        return results

# Usage

manager = ChatterboxManager()

texts = ["Text 1", "Text 2", "Text 3"]

audio_files = manager.generate_batch(texts, voice_file="voice.wav")

```

## 🔒 Security and Privacy

### Data Protection

```python

import tempfile

import os

def secure_audio_processing(audio_data, output_path):

    """Process audio with temporary files for security"""

    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:

        temp_path = temp_file.name

        

        try:

            # Save to temporary file

            ta.save(temp_path, audio_data, model.sr)

            

            # Process and move to final location

            shutil.move(temp_path, output_path)

            

        finally:

            # Clean up temporary file if it still exists

            if os.path.exists(temp_path):

                os.remove(temp_path)

```

### Watermark Detection

```python

def detect_watermark(audio_path):

    """Check if audio contains Chatterbox watermark"""

    try:

        # This is a placeholder - actual watermark detection

        # would require Resemble AI's Perth watermark detector

        print("⚠️  All Chatterbox-generated audio contains watermarks")

        print("   Use responsibly and follow ethical guidelines")

        return True

    except Exception as e:

        print(f"Error checking watermark: {e}")

        return False

```

## 🤝 Contributing

We welcome contributions! Here's how you can help:

1. **Report Bugs**: Use the GitHub Issues tab

2. **Feature Requests**: Suggest new features via Issues

3. **Code Contributions**: Fork the repo and submit PRs

4. **Documentation**: Help improve this README and docs

5. **Examples**: Share your creative use cases

### Development Setup

```bash

git clone https://github.com/UKR-PROJECTS/chatterbox-tts-colab.git

cd chatterbox-tts-colab

pip install -r requirements.txt

```

## 🙏 Acknowledgments

- **Resemble AI** for creating the incredible Chatterbox TTS model

- **Google Colab** for providing free GPU access

- **Hugging Face** for model hosting and distribution

- **PyTorch** and **Torchaudio** for the underlying framework

- **The Open Source Community** for continuous support and contributions

### Special Thanks

- Original Chatterbox TTS: [resemble-ai/chatterbox](https://github.com/resemble-ai/chatterbox)

- Resemble AI Team for open-sourcing this state-of-the-art model

- Contributors who help maintain and improve this Colab implementation

## 🌟 Star History

If you find this project useful, please consider giving it a star on GitHub! Your support helps us continue improving and maintaining this tool.

## 📞 Support

- **GitHub Issues**: [Report bugs or request features](https://github.com/UKR-PROJECTS/chatterbox-tts-colab/issues)

- **Discussions**: [Community discussions and Q&A](https://github.com/UKR-PROJECTS/chatterbox-tts-colab/discussions)

- **Email**: ukrpurojekuto@gmail.com

## 🚀 What's Next?

- [ ] Real-time voice conversion

- [ ] Voice morphing capabilities

- [ ] Improved multilingual support

- [ ] Enhanced emotion control

- [ ] Batch processing optimizations

- [ ] API endpoint integration

- [ ] Training capabilites

---



**Made with ❤️ by the Ujjwal Nova**

[⭐ Star this repo](https://github.com/UKR-PROJECTS/chatterbox-tts-colab) | [🐛 Report Bug](https://github.com/UKR-PROJECTS/chatterbox-tts-colab/issues) | [💡 Request Feature](https://github.com/UKR-PROJECTS/chatterbox-tts-colab/issues)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ukr-projects/chatterbox-tts-colab

Awesome Lists containing this project

README