{"id":26322791,"url":"https://github.com/SesameAILabs/csm","last_synced_at":"2025-03-15T17:04:44.253Z","repository":{"id":279639508,"uuid":"939470275","full_name":"SesameAILabs/csm","owner":"SesameAILabs","description":"A Conversational Speech Generation Model","archived":false,"fork":false,"pushed_at":"2025-03-13T20:23:50.000Z","size":7,"stargazers_count":6867,"open_issues_count":36,"forks_count":226,"subscribers_count":1108,"default_branch":"main","last_synced_at":"2025-03-13T20:32:48.389Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SesameAILabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-26T15:38:54.000Z","updated_at":"2025-03-13T20:30:51.000Z","dependencies_parsed_at":"2025-02-26T16:41:09.582Z","dependency_job_id":"db15a554-63e4-49a0-845a-6ff8b2c063b1","html_url":"https://github.com/SesameAILabs/csm","commit_stats":null,"previous_names":["sesameailabs/csm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SesameAILabs%2Fcsm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SesameAILabs%2Fcsm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SesameAILabs%2Fcsm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SesameAILabs%2Fcsm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SesameAILabs","download_url":"https://codeload.github.com/SesameAILabs/csm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243762266,"owners_count":20343979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-15T17:04:43.479Z","updated_at":"2025-03-15T17:04:44.244Z","avatar_url":"https://github.com/SesameAILabs.png","language":null,"readme":"# CSM\n\n**2025/03/13** - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https://huggingface.co/sesame/csm_1b).\n\n---\n\nCSM (Conversational Speech Model) is a speech generation model from [Sesame](https://www.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.\n\nA fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).\n\nA hosted [Hugging Face space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.\n\n## Requirements\n\n* A CUDA-compatible GPU\n* The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions\n* Similarly, Python 3.10 is recommended, but newer versions may be fine\n* For some audio operations, `ffmpeg` may be required\n* Access to the following Hugging Face models:\n  * [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)\n  * [CSM-1B](https://huggingface.co/sesame/csm-1b)\n\n### Setup\n\n```bash\ngit clone git@github.com:SesameAILabs/csm.git\ncd csm\npython3.10 -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n\n# You will need access to CSM-1B and Llama-3.2-1B\nhuggingface-cli login\n```\n\n### Windows Setup\n\nThe `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.\n\n## Usage\n\nGenerate a sentence\n\n```python\nfrom huggingface_hub import hf_hub_download\nfrom generator import load_csm_1b\nimport torchaudio\nimport torch\n\nif torch.backends.mps.is_available():\n    device = \"mps\"\nelif torch.cuda.is_available():\n    device = \"cuda\"\nelse:\n    device = \"cpu\"\nmodel_path = hf_hub_download(repo_id=\"sesame/csm-1b\", filename=\"ckpt.pt\")\ngenerator = load_csm_1b(model_path, device)\naudio = generator.generate(\n    text=\"Hello from Sesame.\",\n    speaker=0,\n    context=[],\n    max_audio_length_ms=10_000,\n)\n\ntorchaudio.save(\"audio.wav\", audio.unsqueeze(0).cpu(), generator.sample_rate)\n```\n\nCSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker's utterance.\n\n```python\nspeakers = [0, 1, 0, 0]\ntranscripts = [\n    \"Hey how are you doing.\",\n    \"Pretty good, pretty good.\",\n    \"I'm great.\",\n    \"So happy to be speaking to you.\",\n]\naudio_paths = [\n    \"utterance_0.wav\",\n    \"utterance_1.wav\",\n    \"utterance_2.wav\",\n    \"utterance_3.wav\",\n]\n\ndef load_audio(audio_path):\n    audio_tensor, sample_rate = torchaudio.load(audio_path)\n    audio_tensor = torchaudio.functional.resample(\n        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate\n    )\n    return audio_tensor\n\nsegments = [\n    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))\n    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)\n]\naudio = generator.generate(\n    text=\"Me too, this is some cool stuff huh?\",\n    speaker=1,\n    context=segments,\n    max_audio_length_ms=10_000,\n)\n\ntorchaudio.save(\"audio.wav\", audio.unsqueeze(0).cpu(), generator.sample_rate)\n```\n\n## FAQ\n\n**Does this model come with any voices?**\n\nThe model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.\n\n**Can I converse with the model?**\n\nCSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.\n\n**Does it support other languages?**\n\nThe model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.\n\n## Misuse and abuse ⚠️\n\nThis project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:\n\n- **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.\n- **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.\n- **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.\n\nBy using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.\n\n---\n\n## Authors\nJohan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSesameAILabs%2Fcsm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSesameAILabs%2Fcsm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSesameAILabs%2Fcsm/lists"}