{"id":26512900,"url":"https://github.com/canopyai/Orpheus-TTS","last_synced_at":"2025-03-21T04:02:03.904Z","repository":{"id":281289586,"uuid":"944828547","full_name":"canopyai/Orpheus-TTS","owner":"canopyai","description":"TTS Towards Human-Sounding Speech","archived":false,"fork":false,"pushed_at":"2025-03-18T23:07:40.000Z","size":3184,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-19T00:20:58.769Z","etag":null,"topics":["llm","realtime","tts"],"latest_commit_sha":null,"homepage":"https://canopylabs.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/canopyai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-08T03:41:54.000Z","updated_at":"2025-03-18T23:07:38.000Z","dependencies_parsed_at":"2025-03-19T11:46:49.354Z","dependency_job_id":null,"html_url":"https://github.com/canopyai/Orpheus-TTS","commit_stats":null,"previous_names":["amuvarma13/orpheus-tts-0.1","canopyai/orpheus-tts"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canopyai%2FOrpheus-TTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canopyai%2FOrpheus-TTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canopyai%2FOrpheus-TTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/canopyai%2FOrpheus-TTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/canopyai","download_url":"https://codeload.github.com/canopyai/Orpheus-TTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244734103,"owners_count":20501017,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","realtime","tts"],"created_at":"2025-03-21T04:02:01.481Z","updated_at":"2025-03-21T04:02:03.880Z","avatar_url":"https://github.com/canopyai.png","language":"Python","funding_links":[],"categories":["Python","TTS (Text-to-Speech) | 文本转语音","语音合成","Colab Notebooks","4. Text-to-speech (TTS)"],"sub_categories":["Open Source TTS Models | 开源 TTS 模型","资源传输下载","Orpheus TTS","Open source"],"readme":"# Orpheus TTS\n## Overview\nOrpheus TTS is an open-source text-to-speech system built on the Llama-3b backbone. Orpheus demonstrates the emergent capabilities of using LLMs for speech synthesis. We offer comparisons of the models below to leading closed models like Eleven Labs and PlayHT in our blog post.\n\n[Check out our blog post](https://canopylabs.ai/model-releases)\n\n\nhttps://github.com/user-attachments/assets/ce17dd3a-f866-4e67-86e4-0025e6e87b8a\n\n## Abilities\n\n- **Human-Like Speech**: Natural intonation, emotion, and rhythm that is superior to SOTA closed source models\n- **Zero-Shot Voice Cloning**: Clone voices without prior fine-tuning\n- **Guided Emotion and Intonation**: Control speech and emotion characteristics with simple tags\n- **Low Latency**: ~200ms streaming latency for realtime applications, reducible to ~100ms with input streaming\n\n## Models\n\nWe provide three models in this release, and additionally we offer the data processing scripts and sample datasets to make it very straightforward to create your own finetune.\n\n1. [**Finetuned Prod**](https://huggingface.co/canopylabs/orpheus-tts-0.1-finetune-prod) – A finetuned model for everyday TTS applications\n\n2. [**Pretrained**](https://huggingface.co/canopylabs/orpheus-tts-0.1-pretrained) – Our base model trained on 100k+ hours of English speech data\n\n\n### Inference\n\n#### Simple setup on colab\n1. [Colab For Tuned Model](https://colab.research.google.com/drive/1KhXT56UePPUHhqitJNUxq63k-pQomz3N?usp=sharing) (not streaming, see below for realtime streaming) – A finetuned model for everyday TTS applications.\n2. [Colab For Pretrained Model](https://colab.research.google.com/drive/10v9MIEbZOr_3V8ZcPAIh8MN7q2LjcstS?usp=sharing) – This notebook is set up for conditioned generation but can be extended to a range of tasks.\n\n#### Streaming Inference Example\n\n1. Clone this repo\n   ```bash\n   git clone https://github.com/canopyai/Orpheus-TTS.git\n   ```\n2. Navigate and install packages\n   ```bash\n   cd Orpheus-TTS \u0026\u0026 pip install orpheus-speech # uses vllm under the hood for fast inference\n   ```\n   vllm pushed a slightly buggy version on March 18th so some bugs are being resolved by reverting to `pip install vllm==0.7.3` after `pip install orpheus-speech`\n4. Run the example below:\n   ```python\n   from orpheus_tts import OrpheusModel\n   import wave\n   import time\n   \n   model = OrpheusModel(model_name =\"canopylabs/orpheus-tts-0.1-finetune-prod\")\n   prompt = '''Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we're all connected 24/7 but somehow people feel more alone than ever. And don't even get me started on how it's messing with kids' self-esteem and mental health and whatnot.'''\n\n   start_time = time.monotonic()\n   syn_tokens = model.generate_speech(\n      prompt=prompt,\n      voice=\"tara\",\n      )\n\n   with wave.open(\"output.wav\", \"wb\") as wf:\n      wf.setnchannels(1)\n      wf.setsampwidth(2)\n      wf.setframerate(24000)\n\n      total_frames = 0\n      chunk_counter = 0\n      for audio_chunk in syn_tokens: # output streaming\n         chunk_counter += 1\n         frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())\n         total_frames += frame_count\n         wf.writeframes(audio_chunk)\n      duration = total_frames / wf.getframerate()\n\n      end_time = time.monotonic()\n      print(f\"It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio\")\n   ```\n\n#### Prompting\n\n1. The `finetune-prod` models: for the primary model, your text prompt is formatted as `{name}: I went to the ...`. The options for name in order of conversational realism (subjective benchmarks) are \"tara\", \"leah\", \"jess\", \"leo\", \"dan\", \"mia\", \"zac\", \"zoe\". Our python package does this formatting for you, and the notebook also prepends the appropriate string. You can additionally add the following emotive tags: `\u003claugh\u003e`, `\u003cchuckle\u003e`, `\u003csigh\u003e`, `\u003ccough\u003e`, `\u003csniffle\u003e`, `\u003cgroan\u003e`, `\u003cyawn\u003e`, `\u003cgasp\u003e`.\n\n2. The pretrained model: you can either generate speech just conditioned on text, or generate speech conditioned on one or more existing text-speech pairs in the prompt. Since this model hasn't been explicitly trained on the zero-shot voice cloning objective, the more text-speech pairs you pass in the prompt, the more reliably it will generate in the correct voice.\n\n\u003c!-- 3. The research model: the prompt that should get passed to the model has `prompt + \" \" + \"\u003c{emotion}\u003e\"` at the end. It should also not have the `{name}:` prefix as it is only trained on one voice. This model is not designed to be used in production. Rather, it's main goal is to show how LLMs can easily support tags to guide controllable emotional generations, and for now will perform worse on other metrics.\n --\u003e\n\nAdditionally, use regular LLM generation args like `temperature`, `top_p`, etc. as you expect for a regular LLM. `repetition_penalty\u003e=1.1`is required for stable generations. Increasing `repetition_penalty` and `temperature` makes the model speak faster.\n\n\n## Finetune Model\n\nHere is an overview of how to finetune your model on any text and speech.\nThis is a very simple process analogous to tuning an LLM using Trainer and Transformers.\n\nYou should start to see high quality results after ~50 examples but for best results, aim for 300 examples/speaker.\n\n1. Your dataset should be a huggingface dataset in [this format](https://huggingface.co/datasets/canopylabs/zac-sample-dataset)\n2. We prepare the data using this [this notebook](https://colab.research.google.com/drive/1wg_CPCA-MzsWtsujwy-1Ovhv-tn8Q1nD?usp=sharing). This pushes an intermediate dataset to your Hugging Face account which you can can feed to the training script in finetune/train.py. Preprocessing should take less than 1 minute/thousand rows.\n3. Modify the `finetune/config.yaml` file to include your dataset and training properties, and run the training script. You can additionally run any kind of huggingface compatible process like Lora to tune the model.\n   ```bash\n    pip install transformers datasets wandb trl flash_attn torch\n    huggingface-cli login \u003center your HF token\u003e\n    wandb login \u003cwandb token\u003e\n    accelerate launch train.py\n   ```\n## Also Check out\n\nWhile we can't verify these implementations are completely accurate/bug free, they have been recommended on a couple of forums, so we include them here:\n\n1. [A lightweight client for running Orpheus TTS locally using LM Studio API](https://github.com/isaiahbjork/orpheus-tts-local) \n2. [Gradio WebUI that runs smoothly on WSL and CUDA](https://github.com/Saganaki22/OrpheusTTS-WebUI)\n\n\n# Checklist\n\n- [x] Release 3b pretrained model and finetuned models\n- [ ] Release pretrained and finetuned models in sizes: 1b, 400m, 150m parameters\n- [ ] Fix glitch in realtime streaming package that occasionally skips frames.\n- [ ] Fix voice cloning Colab notebook implementation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanopyai%2FOrpheus-TTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcanopyai%2FOrpheus-TTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanopyai%2FOrpheus-TTS/lists"}