{"id":14561904,"url":"https://github.com/huggingface/speech-to-speech","last_synced_at":"2026-02-06T14:30:31.242Z","repository":{"id":252108296,"uuid":"839428333","full_name":"huggingface/speech-to-speech","owner":"huggingface","description":"Speech To Speech: an effort for an open-sourced and modular GPT4-o","archived":false,"fork":false,"pushed_at":"2025-04-15T09:38:44.000Z","size":306,"stargazers_count":4191,"open_issues_count":71,"forks_count":476,"subscribers_count":49,"default_branch":"main","last_synced_at":"2025-09-30T18:02:31.341Z","etag":null,"topics":["ai","assistant","language-model","machine-learning","python","speech","speech-synthesis","speech-to-text","speech-translation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-07T15:32:09.000Z","updated_at":"2025-09-30T00:48:11.000Z","dependencies_parsed_at":"2024-08-19T09:44:45.853Z","dependency_job_id":"eb520b9f-9161-4e40-b520-200bb5c72bd5","html_url":"https://github.com/huggingface/speech-to-speech","commit_stats":null,"previous_names":["eustlb/speech-to-speech","huggingface/speech-to-speech"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/speech-to-speech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fspeech-to-speech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fspeech-to-speech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fspeech-to-speech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fspeech-to-speech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/speech-to-speech/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fspeech-to-speech/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019321,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","assistant","language-model","machine-learning","python","speech","speech-synthesis","speech-to-text","speech-translation"],"created_at":"2024-09-07T02:01:03.072Z","updated_at":"2025-10-14T15:30:15.780Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":["Table of Contents \u003c!-- omit in toc --\u003e","Frameworks \u0026 Platforms | 框架与平台","Python","Tools\u003ca id=\"tool\"\u003e\u003c/a\u003e"],"sub_categories":["Audio Synthesis for Video","Comprehensive Frameworks | 综合性框架","Others\u003ca id=\"paper11\"\u003e\u003c/a\u003e"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cdiv\u003e\u0026nbsp;\u003c/div\u003e\n  \u003cimg src=\"logo.png\" width=\"600\"/\u003e \n\u003c/div\u003e\n\n# Speech To Speech: an effort for an open-sourced and modular GPT4-o\n\n\n## 📖 Quick Index\n* [Approach](#approach)\n  - [Structure](#structure)\n  - [Modularity](#modularity)\n* [Setup](#setup)\n* [Usage](#usage)\n  - [Docker Server approach](#docker-server)\n  - [Server/Client approach](#serverclient-approach)\n  - [Local approach](#local-approach-running-on-mac)\n* [Command-line usage](#command-line-usage)\n  - [Model parameters](#model-parameters)\n  - [Generation parameters](#generation-parameters)\n  - [Notable parameters](#notable-parameters)\n\n## Approach\n\n### Structure\nThis repository implements a speech-to-speech cascaded pipeline consisting of the following parts:\n1. **Voice Activity Detection (VAD)**\n2. **Speech to Text (STT)**\n3. **Language Model (LM)**\n4. **Text to Speech (TTS)**\n\n### Modularity\nThe pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:\n\n**VAD** \n- [Silero VAD v5](https://github.com/snakers4/silero-vad)\n\n**STT**\n- Any [Whisper](https://huggingface.co/docs/transformers/en/model_doc/whisper) model checkpoint on the Hugging Face Hub through Transformers 🤗, including [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) and [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)\n- [Lightning Whisper MLX](https://github.com/mustafaaljadery/lightning-whisper-mlx?tab=readme-ov-file#lightning-whisper-mlx)\n- [Paraformer - FunASR](https://github.com/modelscope/FunASR)\n\n**LLM**\n- Any instruction-following model on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation\u0026sort=trending) via Transformers 🤗\n- [mlx-lm](https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md)\n- [OpenAI API](https://platform.openai.com/docs/quickstart)\n\n**TTS**\n- [Parler-TTS](https://github.com/huggingface/parler-tts) 🤗\n- [MeloTTS](https://github.com/myshell-ai/MeloTTS)\n- [ChatTTS](https://github.com/2noise/ChatTTS?tab=readme-ov-file)\n\n## Setup\n\nClone the repository:\n```bash\ngit clone https://github.com/huggingface/speech-to-speech.git\ncd speech-to-speech\n```\n\nInstall the required dependencies using [uv](https://github.com/astral-sh/uv):\n```bash\nuv pip install -r requirements.txt\n```\n\nFor Mac users, use the `requirements_mac.txt` file instead:\n```bash\nuv pip install -r requirements_mac.txt\n```\n\nIf you want to use Melo TTS, you also need to run:\n```bash\npython -m unidic download\n```\n\n\n## Usage\n\nThe pipeline can be run in two ways:\n- **Server/Client approach**: Models run on a server, and audio input/output are streamed from a client.\n- **Local approach**: Runs locally.\n\n### Recommended setup \n\n### Server/Client Approach\n\n1. Run the pipeline on the server:\n   ```bash\n   python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0\n   ```\n\n2. Run the client locally to handle microphone input and receive generated audio:\n   ```bash\n   python listen_and_play.py --host \u003cIP address of your server\u003e\n   ```\n\n### Local Approach (Mac)\n\n1. For optimal settings on Mac:\n   ```bash\n   python s2s_pipeline.py --local_mac_optimal_settings\n   ```\n\nThis setting:\n   - Adds `--device mps` to use MPS for all models.\n     - Sets LightningWhisperMLX for STT\n     - Sets MLX LM for language model\n     - Sets MeloTTS for TTS\n\n### Docker Server\n\n#### Install the NVIDIA Container Toolkit\n\nhttps://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html\n\n#### Start the docker container\n```docker compose up```\n\n\n\n### Recommended usage with Cuda\n\nLeverage Torch Compile for Whisper and Parler-TTS. **The usage of Parler-TTS allows for audio output streaming, further reducing the overall latency** 🚀:\n\n```bash\npython s2s_pipeline.py \\\n\t--lm_model_name microsoft/Phi-3-mini-4k-instruct \\\n\t--stt_compile_mode reduce-overhead \\\n\t--tts_compile_mode default \\\n  --recv_host 0.0.0.0 \\\n\t--send_host 0.0.0.0 \n```\n\nFor the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (`reduce-overhead`, `max-autotune`).\n\n### Multi-language Support\n\nThe pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.  \nTwo use cases are considered:\n\n- **Single-language conversation**: Enforce the language setting using the `--language` flag, specifying the target language code (default is 'en').\n- **Language switching**: Set `--language` to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with \"`Please reply to my message in ...`\" to ensure the response is in the detected language.\n\nPlease note that you must use STT and LLM checkpoints compatible with the target language(s). For the STT part, Parler-TTS is not yet multilingual (though that feature is coming soon! 🤗). In the meantime, you should use Melo (which supports English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.\n\n#### With the server version:\n\nFor automatic language detection:\n\n```bash\npython s2s_pipeline.py \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \\\n```\n\nOr for one language in particular, chinese in this example\n\n```bash\npython s2s_pipeline.py \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \\\n```\n\n#### Local Mac Setup\n\nFor automatic language detection:\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --device mps \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \\\n```\n\nOr for one language in particular, chinese in this example\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --device mps \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \\\n```\n\n## Command-line Usage\n\n\u003e **_NOTE:_** References for all the CLI arguments can be found directly in the [arguments classes](https://github.com/huggingface/speech-to-speech/tree/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes) or by running `python s2s_pipeline.py -h`.\n\n### Module level Parameters \nSee [ModuleArguments](https://github.com/huggingface/speech-to-speech/blob/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes/module_arguments.py) class. Allows to set:\n- a common `--device` (if one wants each part to run on the same device)\n- `--mode` `local` or `server`\n- chosen STT implementation \n- chosen LM implementation\n- chose TTS implementation\n- logging level\n\n### VAD parameters\nSee [VADHandlerArguments](https://github.com/huggingface/speech-to-speech/blob/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes/vad_arguments.py) class. Notably:\n- `--thresh`: Threshold value to trigger voice activity detection.\n- `--min_speech_ms`: Minimum duration of detected voice activity to be considered speech.\n- `--min_silence_ms`: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.\n\n\n### STT, LM and TTS parameters\n\n`model_name`, `torch_dtype`, and `device` are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. `stt`, `lm` or `tts`, check the implementations' [arguments classes](https://github.com/huggingface/speech-to-speech/tree/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes) for more details).\n\nFor example:\n```bash\n--lm_model_name google/gemma-2b-it\n```\n\n### Generation parameters\n\nOther generation parameters of the model's generate method can be set using the part's prefix + `_gen_`, e.g., `--stt_gen_max_new_tokens 128`. These parameters can be added to the pipeline part's arguments class if not already exposed.\n\n## Citations\n\n### Silero VAD\n```bibtex\n@misc{Silero VAD,\n  author = {Silero Team},\n  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/snakers4/silero-vad}},\n  commit = {insert_some_commit_here},\n  email = {hello@silero.ai}\n}\n```\n\n### Distil-Whisper\n```bibtex\n@misc{gandhi2023distilwhisper,\n      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},\n      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},\n      year={2023},\n      eprint={2311.00430},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n### Parler-TTS\n```bibtex\n@misc{lacombe-etal-2024-parler-tts,\n  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},\n  title = {Parler-TTS},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/huggingface/parler-tts}}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fspeech-to-speech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fspeech-to-speech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fspeech-to-speech/lists"}