https://github.com/nidhiyashwanth/sesameailabs-csm
A conversational speech model (CSM) that generates natural-sounding speech with context awareness. Supports multi-speaker conversations and maintains contextual understanding across turns.
https://github.com/nidhiyashwanth/sesameailabs-csm
context-aware conversational-ai csm moshi sesame sesameailabs speech-generation
Last synced: about 2 months ago
JSON representation
A conversational speech model (CSM) that generates natural-sounding speech with context awareness. Supports multi-speaker conversations and maintains contextual understanding across turns.
- Host: GitHub
- URL: https://github.com/nidhiyashwanth/sesameailabs-csm
- Owner: nidhiyashwanth
- Created: 2025-03-17T18:20:31.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-17T18:40:51.000Z (2 months ago)
- Last Synced: 2025-03-17T19:34:57.281Z (2 months ago)
- Topics: context-aware, conversational-ai, csm, moshi, sesame, sesameailabs, speech-generation
- Language: Jupyter Notebook
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SesameAILabs-csm
A conversational speech model (CSM) implementation by Sesame AI Labs that enables text-to-speech generation with context awareness and consistent audio quality.
## Description
SesameAILabs-csm is a powerful text-to-speech model that can generate natural-sounding speech with context awareness. It supports multiple speakers and maintains consistent audio quality across conversations. The model is fine-tuned to ensure that the audio remains consistent, even in long conversations.
## Features
- Text-to-speech generation with context awareness
- Multi-speaker support
- Natural-sounding speech output
- Contextual conversation handling
- Consistent audio quality across conversations
- Support for custom audio input
- GPU acceleration support## Installation
1. Clone the repository:
```bash
git clone https://github.com/SesameAILabs/csm.git
cd csm
```2. Install the required packages:
```bash
pip install -r requirements.txt
```3. Log in to Hugging Face (required for model download):
```python
from huggingface_hub import login
login()
```## Requirements
- Python 3.8+
- CUDA-capable GPU (recommended)
- PyTorch 2.4.0
- torchaudio 2.4.0
- transformers 4.49.0
- huggingface_hub 0.28.1
- And other dependencies listed in requirements.txt## Usage
### Basic Usage
```python
from generator import load_csm_1b
import torchaudio# Initialize the generator
generator = load_csm_1b(device="cuda")# Generate speech
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)# Save the generated audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
```### Contextual Conversation
```python
from generator import load_csm_1b, Segment
import torchaudio# Initialize the generator
generator = load_csm_1b(device="cuda")# Define speakers, transcripts, and audio paths
speakers = [0]
transcripts = ["Hey how are you doing."]
audio_paths = ["conversational_b.wav"]# Function to load and resample audio
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor# Create segments
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]# Generate audio with context
audio = generator.generate(
text="Your response text here",
speaker=1,
context=segments,
max_audio_length_ms=50_000,
)# Save the generated audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
```## Model Details
The model is automatically downloaded from the Hugging Face Hub when first used. It includes:
- Encoder model
- Decoder model
- Multiple speaker embeddings
- Configuration files## Author
Nidhi Yashwanth ([github.com/nidhiyashwanth](https://github.com/nidhiyashwanth))
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Sesame AI Labs for developing and maintaining the model
- Hugging Face for hosting the model and providing the transformers library
- The PyTorch team for the deep learning framework