{"id":30597888,"url":"https://github.com/microsoft/VibeVoice","last_synced_at":"2025-08-29T22:04:24.525Z","repository":{"id":311630342,"uuid":"1044296738","full_name":"microsoft/VibeVoice","owner":"microsoft","description":"Frontier Open-Source Text-to-Speech","archived":false,"fork":false,"pushed_at":"2025-08-25T17:22:46.000Z","size":16697,"stargazers_count":47,"open_issues_count":2,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-25T17:35:37.112Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-25T13:24:01.000Z","updated_at":"2025-08-25T17:29:21.000Z","dependencies_parsed_at":"2025-08-25T17:46:41.710Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/VibeVoice","commit_stats":null,"previous_names":["microsoft/vibevoice"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/microsoft/VibeVoice","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FVibeVoice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FVibeVoice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FVibeVoice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FVibeVoice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/VibeVoice/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FVibeVoice/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272547269,"owners_count":24953436,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-28T02:00:10.768Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-29T22:02:21.542Z","updated_at":"2025-08-29T22:04:24.517Z","avatar_url":"https://github.com/microsoft.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\n## 🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model\n[![Project Page](https://img.shields.io/badge/Project-Page-blue?logo=microsoft)](https://microsoft.github.io/VibeVoice)\n[![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)\n[![Technical Report](https://img.shields.io/badge/Technical-Report-red?logo=adobeacrobatreader)](https://arxiv.org/pdf/2508.19205)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynb)\n[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/VibeVoice-Demo)\n\n\u003c/div\u003e\n\u003c!-- \u003cdiv align=\"center\"\u003e\n\u003cimg src=\"Figures/log.png\" alt=\"VibeVoice Logo\" width=\"200\"\u003e\n\u003c/div\u003e --\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"Figures/VibeVoice_logo.png\" alt=\"VibeVoice Logo\" width=\"300\"\u003e\n\u003c/div\u003e\n\nVibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.\n\nA core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.\n\nThe model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models. \n\n\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"Figures/MOS-preference.png\" alt=\"MOS Preference Results\" height=\"260px\"\u003e\n  \u003cimg src=\"Figures/VibeVoice.jpg\" alt=\"VibeVoice Overview\" height=\"250px\" style=\"margin-right: 10px;\"\u003e\n\u003c/p\u003e\n\n### 🔥 News\n\n- **[2025-08-26] 🎉 We Open Source the [VibeVoice-7B-Preview](https://huggingface.co/WestZhang/VibeVoice-Large-pt) model weights!**\n\n### 📋 TODO\n\n- [ ] Merge models into official Hugging Face repository\n- [ ] Release example training code and documentation\n\n### 🎵 Demo Examples\n\n\n**Video Demo**\n\nWe produced this video with [Wan2.2](https://github.com/Wan-Video/Wan2.2). We sincerely appreciate the Wan-Video team for their great work.\n\n**English**\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784\n\n\u003c/div\u003e\n\n\n**Chinese**\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f\n\n\u003c/div\u003e\n\n**Cross-Lingual**\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722\n\n\u003c/div\u003e\n\n**Spontaneous Singing**\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730\n\n\u003c/div\u003e\n\n\n**Long Conversation with 4 people**\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727\n\n\u003c/div\u003e\n\nFor more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).\n\nTry it on [Colab](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynb) or [Demo](https://aka.ms/VibeVoice-Demo).\n\n\n\n## Models\n| Model | Context Length | Generation Length |  Weight |\n|-------|----------------|----------|----------|\n| VibeVoice-0.5B-Streaming | - | - | On the way |\n| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |\n| VibeVoice-7B-Preview| 32K | ~45 min | [HF link](https://huggingface.co/WestZhang/VibeVoice-Large-pt) |\n\n## Installation\nWe recommend to use NVIDIA Deep Learning Container to manage the CUDA environment. \n\n1. Launch docker\n```bash\n# NVIDIA PyTorch Container 24.07 / 24.10 / 24.12 verified. \n# Later versions are also compatible.\nsudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it  nvcr.io/nvidia/pytorch:24.07-py3\n\n## If flash attention is not included in your docker environment, you need to install it manually\n## Refer to https://github.com/Dao-AILab/flash-attention for installation instructions\n# pip install flash-attn --no-build-isolation\n```\n\n2. Install from github\n```bash\ngit clone https://github.com/microsoft/VibeVoice.git\ncd VibeVoice/\n\npip install -e .\n```\n\n## Usages\n\n### 🚨 Tips\nWe observed users may encounter occasional instability when synthesizing Chinese speech. We recommend:\n\n- Using English punctuation even for Chinese text, preferably only commas and periods.\n- Using the 7B model variant, which is considerably more stable.\n\n### Usage 1: Launch Gradio demo\n```bash\napt update \u0026\u0026 apt install ffmpeg -y # for demo\n\n# For 1.5B model\npython demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share\n\n# For 7B model\npython demo/gradio_demo.py --model_path WestZhang/VibeVoice-Large-pt --share\n```\n\n### Usage 2: Inference from files directly\n```bash\n# We provide some LLM generated example scripts under demo/text_examples/ for demo\n# 1 speaker\npython demo/inference_from_file.py --model_path WestZhang/VibeVoice-Large-pt --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice\n\n# or more speakers\npython demo/inference_from_file.py --model_path WestZhang/VibeVoice-Large-pt --txt_path demo/text_examples/2p_music.txt --speaker_names Alice Frank\n```\n\n## FAQ\n#### Q1: Is this a pretrained model?\n**A:** Yes, it's a pretrained model without any post-training or benchmark-specific optimizations. In a way, this makes VibeVoice very versatile and fun to use.\n\n#### Q2: Randomly trigger Sounds / Music / BGM.\n**A:** As you can see from our demo page, the background music or sounds are spontaneous. This means we can't directly control whether they are generated or not. The model is content-aware, and these sounds are triggered based on the input text and the chosen voice prompt.\n\nHere are a few things we've noticed:\n*   If the voice prompt you use contains background music, the generated speech is more likely to have it as well. (The 7B model is quite stable and effective at this—give it a try on the demo!)\n*   If the voice prompt is clean (no BGM), but the input text includes introductory words or phrases like \"Welcome to,\" \"Hello,\" or \"However,\" background music might still appear.\n*   Speaker voice related, using \"Alice\" results in random BGM than others.\n*   In other scenarios, the 7B model is more stable and has a lower probability of generating unexpected background music.\n\nIn fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.\n\n#### Q3: Text normalization?\n**A:** We don't perform any text normalization during training or inference. Our philosophy is that a large language model should be able to handle complex user inputs on its own. However, due to the nature of the training data, you might still run into some corner cases.\n\n#### Q4: Singing Capability.\n**A:** Our training data **doesn't contain any music data**. The ability to sing is an emergent capability of the model (which is why it might sound off-key, even on a famous song like 'See You Again'). (The 7B model is more likely to exhibit this than the 1.5B).\n\n#### Q5: Some Chinese pronunciation errors.\n**A:** The volume of Chinese data in our training set is significantly smaller than the English data. Additionally, certain special characters (e.g., Chinese quotation marks) may occasionally cause pronunciation issues.\n\n#### Q6: Instability of cross-lingual transfer.\n**A:** The model does exhibit strong cross-lingual transfer capabilities, including the preservation of accents, but its performance can be unstable. This is an emergent ability of the model that we have not specifically optimized. It's possible that a satisfactory result can be achieved through repeated sampling.\n\n## Risks and limitations\n\nWhile efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).\nPotential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.\n\nEnglish and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.\n\nNon-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.\n\nOverlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.\n\nWe do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.\n","funding_links":[],"categories":["Developer Tools","Repos","Python","🤖 AI \u0026 Machine Learning","Others","Colab Notebooks","\u003cspan id=\"speech\"\u003eSpeech\u003c/span\u003e","📋 Contents","精选文章"],"sub_categories":["Python Tools","VibeVoice 1.5B TTS","\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","🧠 2. Open Foundation Models","语音处理"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FVibeVoice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FVibeVoice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FVibeVoice/lists"}