https://github.com/isaiahbjork/csm-voice-cloning
Sesame CSM 1B Voice Cloning
https://github.com/isaiahbjork/csm-voice-cloning
ai modal python voice-cloning
Last synced: about 1 month ago
JSON representation
Sesame CSM 1B Voice Cloning
- Host: GitHub
- URL: https://github.com/isaiahbjork/csm-voice-cloning
- Owner: isaiahbjork
- Created: 2025-03-14T06:09:58.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-14T06:38:17.000Z (about 2 months ago)
- Last Synced: 2025-03-14T07:31:17.040Z (about 2 months ago)
- Topics: ai, modal, python, voice-cloning
- Language: Python
- Homepage:
- Size: 15.6 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Voice Cloning with CSM-1B
This repository contains tools to clone your voice using the Sesame CSM-1B model. It provides two methods for voice cloning:
1. Local execution on your own GPU
2. Cloud execution using Modal> **Note:** While this solution does capture some voice characteristics and provides a recognizable clone, it's not the best voice cloning solution available. The results are decent but not perfect. If you have ideas on how to improve the cloning quality, feel free to contribute!
## Prerequisites
- Python 3.10+
- CUDA-compatible GPU (for local execution)
- Hugging Face account with access to the CSM-1B model
- Hugging Face API token## Installation
1. Clone this repository:
```bash
git clone https://github.com/isaiahbjork/csm-voice-cloning.git
cd csm-voice-cloning
```2. Install the required dependencies:
```bash
pip install -r requirements.txt
```## Setting Up Your Hugging Face Token
You need to set your Hugging Face token to download the model. You can do this in two ways:
1. Set it as an environment variable:
```bash
export HF_TOKEN="your_hugging_face_token"
```2. Or directly in the `voice_clone.py` file:
```python
os.environ["HF_TOKEN"] = "your_hugging_face_token"
```## Accepting the Model on Hugging Face
Before using the model, you need to accept the terms on Hugging Face:
1. Visit the [Sesame CSM-1B model page](https://huggingface.co/sesame/csm-1b)
2. Click on "Access repository" and accept the terms
3. Make sure you're logged in with the same account that your HF_TOKEN belongs to## Preparing Your Voice Sample
1. Record a clear audio sample of your voice (2-3 minutes is recommended)
2. Save it as an MP3 or WAV file
3. Transcribe the audio using Whisper or another transcription tool to get the exact text## Running Voice Cloning Locally
1. Edit the `voice_clone.py` file to set your parameters directly in the code:
```python
# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."# Set the output filename
output_filename = "output.wav"
```2. Run the script:
```bash
python voice_clone.py
```## Running Voice Cloning on Modal
Modal provides cloud GPU resources for faster processing:
1. Install Modal:
```bash
pip install modal
```2. Set up Modal authentication:
```bash
modal token new
```3. Edit the `modal_voice_cloning.py` file to set your parameters directly in the code:
```python
# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."# Set the output filename
output_filename = "output.wav"
```4. Run the Modal script:
```bash
modal run modal_voice_cloning.py
```## Important Note on Model Sequence Length
If you encounter tensor dimension errors, you may need to adjust the model's maximum sequence length in `models.py`. The default sequence length is 2048, which works for most cases, but if you're using longer audio samples, you might need to increase this value.
Look for the `max_seq_len` parameter in the `llama3_2_1B()` and `llama3_2_100M()` functions in `models.py` and ensure they have the same value:
```python
def llama3_2_1B():
return llama3_2.llama3_2(
# other parameters...
max_seq_len=2048, # Increase this value if needed
# other parameters...
)
```## Example
Using a 2 minute and 50 second audio sample works fine with the default settings. For longer samples, you may need to adjust the sequence length as mentioned above.
## Troubleshooting
- **Tensor dimension errors**: Adjust the model sequence length as described above
- **CUDA out of memory**: Try reducing the audio sample length or use a GPU with more memory
- **Model download issues**: Ensure you've accepted the model terms on Hugging Face and your token is correct## License
This project uses the Sesame CSM-1B model, which is subject to its own license terms. Please refer to the [model page](https://huggingface.co/sesame/csm-1b) for details.