{"id":22010385,"url":"https://github.com/gpustack/vox-box","last_synced_at":"2025-04-07T07:00:22.597Z","repository":{"id":264121937,"uuid":"891837489","full_name":"gpustack/vox-box","owner":"gpustack","description":"A text-to-speech and speech-to-text server compatible with the OpenAI API, supporting Whisper, FunASR, Bark, and CosyVoice backends.","archived":false,"fork":false,"pushed_at":"2025-01-14T04:13:44.000Z","size":473,"stargazers_count":89,"open_issues_count":7,"forks_count":9,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-31T06:01:03.643Z","etag":null,"topics":["asr","audio-processing","openai-compatible-api","python","speech-to-text","stt","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gpustack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-21T03:26:00.000Z","updated_at":"2025-03-30T15:01:48.000Z","dependencies_parsed_at":"2024-12-11T11:32:28.421Z","dependency_job_id":"b396f852-7c58-46b9-8f2b-0fc7b9690c79","html_url":"https://github.com/gpustack/vox-box","commit_stats":null,"previous_names":["gpustack/vox-box"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpustack%2Fvox-box","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpustack%2Fvox-box/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpustack%2Fvox-box/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpustack%2Fvox-box/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gpustack","download_url":"https://codeload.github.com/gpustack/vox-box/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247608150,"owners_count":20965952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","audio-processing","openai-compatible-api","python","speech-to-text","stt","text-to-speech","tts"],"created_at":"2024-11-30T02:12:50.813Z","updated_at":"2025-04-07T07:00:22.545Z","avatar_url":"https://github.com/gpustack.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Vox Box\n\nA text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, and CosyVoice.\n\n## Requirements\n\n- Python 3.10 or greater\n- Support Nvidia GPU, requires the following NVIDIA libraries to be installed:\n  - [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)\n  - [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)  \n\n## Installation\n\nYou can install the project using pip:\n\n```bash\npip install vox-box\n\n# For MacOS, you need to manually install `openfst`, `pynini`, and `wetextprocessing` after installing `vox-box` to make `cosyvoice` work:\nbrew install openfst\nexport CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include\nexport LIBRARY_PATH=$(brew --prefix openfst)/lib\npip install pynini==2.1.6\npip install wetextprocessing==1.0.4.1\n```\n\n## Usage\n\n```bash\nvox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir ./cache/data-dir --host 0.0.0.0 --port 80\n\n# Windows\nvox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\\Users\\michelia\\AppData\\Roaming\\vox-box --host 0.0.0.0 --port 8082\n```\n\n### Options\n- -d, --debug: Enable debug mode.\n- --host: Host to bind the server to. Default is 0.0.0.0.\n- --port: Port to bind the server to. Default is 80.\n- --model: model path.\n- --device: Binding device, e.g., cuda:0. Default is cpu.\n- --huggingface-repo-id: Huggingface repo id for the model.\n- --model-scope-model-id: Model scope model id for the model.\n- --data-dir: Directory to store downloaded model data. Default is OS specific.\n\n## Supported Models\n\n| Model                           | Type           | Link                                                                                                                                                                                        | Verified Platforms                                                      |\n| ------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |\n| Faster-whisper-large-v3         | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-large-v3), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-large-v3)                                         | Linux \u0026#9989;, Windows \u0026#9989;, MacOS \u0026#9989;                           |\n| Faster-whisper-large-v2         | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-large-v2), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-large-v2)                                         | Linux \u0026#9989;, Windows \u0026#9989;, MacOS \u0026#9989;                           |\n| Faster-whisper-large-v1         | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-large-v1), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-large-v1)                                         |                                                                         |\n| Faster-whisper-medium           | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-medium), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-medium)                                             | Linux \u0026#9989;, Windows \u0026#9989;, MacOS \u0026#9989;                           |\n| Faster-whisper-medium.en        | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-medium.en), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-medium.en)                                       |                                                                         |\n| Faster-whisper-small            | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-small), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-small)                                               | Linux \u0026#9989;, Windows \u0026#9989;, MacOS \u0026#9989;                           |\n| Faster-whisper-small.en         | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-small.en), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-small.en)                                         |                                                                         |\n| Faster-distil-whisper-large-v3  | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-distil-whisper-large-v3), [ModelScope](https://modelscope.cn/models/gpustack/faster-distil-whisper-large-v3)                           | MacOS \u0026#9989;                                                           |\n| Faster-distil-whisper-large-v2  | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-distil-whisper-large-v2), [ModelScope](https://modelscope.cn/models/gpustack/faster-distil-whisper-large-v2)                           | MacOS \u0026#9989;                                                           |\n| Faster-distil-whisper-medium.en | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-distil-whisper-medium.en), [ModelScope](https://modelscope.cn/models/gpustack/faster-distil-whisper-medium.en)                         |                                                                         |\n| Faster-whisper-tiny             | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-tiny), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-tiny)                                                 |                                                                         |\n| Faster-whisper-tiny.en          | speech-to-text | [Hugging Face](https://huggingface.co/Systran/faster-whisper-tiny.en), [ModelScope](https://modelscope.cn/models/gpustack/faster-whisper-tiny.en)                                           |                                                                         |\n| Paraformer-zh                   | speech-to-text | [Hugging Face](https://huggingface.co/funasr/paraformer-zh), [ModelScope](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) |                                                                         |\n| Paraformer-zh-streaming         | speech-to-text | [Hugging Face](https://huggingface.co/funasr/paraformer-zh-streaming), [ModelScope](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online)     | Linux \u0026#9989;, MacOS \u0026#9989;                                            |\n| Paraformer-en                   | speech-to-text | [Hugging Face](https://huggingface.co/funasr/paraformer-en), [ModelScope](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020)           |                                                                         |\n| Conformer-en                    | speech-to-text | [Hugging Face](https://huggingface.co/funasr/conformer-en), [Modelscope](https://modelscope.cn/models/iic/speech_conformer_asr-en-16k-vocab4199-pytorch)                                    |                                                                         |\n| SenseVoiceSmall                 | speech-to-text | [Hugging Face](https://huggingface.co/FunAudioLLM/SenseVoiceSmall), [ModelScope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)                                                      | Linux \u0026#9989;, Windows \u0026#9989;, MacOS \u0026#9989;                           |\n| Bark                            | text-to-speech | [Hugging Face](https://huggingface.co/suno/bark)                                                                                                                                            | Linux \u0026#9989;, Windows, MacOS \u0026#9989;                                   |\n| Bark-small                      | text-to-speech | [Hugging Face](https://huggingface.co/suno/bark-small)                                                                                                                                      | Linux \u0026#9989;, Windows, MacOS \u0026#9989;                                   |\n| CosyVoice2-0.5B                 | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B), [ModelScope](https://modelscope.cn/models/iic/CosyVoice2-0.5B)                                                          | Linux(ARM not supported) \u0026#9989;, Windows(Not supported), macOS \u0026#9989; |\n| CosyVoice-300M-Instruct         | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice-300M-Instruct), [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M-Instruct)                                          | Linux(ARM not supported) \u0026#9989;, Windows(Not supported), macOS \u0026#9989; |\n| CosyVoice-300M-SFT              | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT), [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M-SFT)                                                    | Linux(ARM not supported) \u0026#9989;, Windows(Not supported), macOS \u0026#9989; |\n| CosyVoice-300M                  | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice-300M), [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M)                                                            | Linux(ARM not supported) \u0026#9989;, Windows(Not supported), macOS \u0026#9989; |\n| CosyVoice-300M-25Hz             | text-to-speech | [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M-25Hz)                                                                                                                          | Linux(ARM not supported) \u0026#9989;, Windows(Not supported), macOS \u0026#9989; |\n\n## Supported APIs\n\n### Create speech \n\n**Endpoint**: `POST /v1/audio/speech`\n\nGenerates audio from the input text. Compatible with the [OpenAI audio/speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech).\n\n**Example Request**:\n```bash\ncurl http://localhost/v1/audio/speech \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"cosyvoice\",\n    \"input\": \"Hello world\",\n    \"voice\": \"English Female\"\n  }' \\\n  --output speech.mp3\n```\n\n**Response**:\nThe audio file content.\n\n### Create transcription \n\n**Endpoint**: `POST /v1/audio/transcriptions`\n\nTranscribes audio into the input language. Compatible with the [OpenAI audio/transcription API](https://platform.openai.com/docs/api-reference/audio/createTranscription).\n\n**Example Request**:\n```bash\ncurl https://localhost/v1/audio/transcriptions \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -H \"Content-Type: multipart/form-data\" \\\n  -F file=\"@/path/to/file/audio.mp3\" \\\n  -F model=\"whisper-large-v3\"\n```\n\n**Response**:\n```json\n{\n  \"text\": \"Hello world.\"\n}\n```\n\n### List Models\n\n**Endpoint**: `GET /v1/models`\n\nReturns the current running models.\n\n### Get Model\n\n**Endpoint**: `GET /v1/models/{model_id}`\n\nReturns the current running model.\n\n### Get Voices\n\n**Endpoint**: `GET /v1/voices`\n\nReturns the supported voice for current running model.\n\n### Health Check\n\n**Endpoint**: `GET /health`\n\nReturns the heath check result of the Vox Box.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpustack%2Fvox-box","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgpustack%2Fvox-box","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpustack%2Fvox-box/lists"}