{"id":30702977,"url":"https://github.com/QuentinFuxa/WhisperLiveKit","last_synced_at":"2025-09-02T16:03:34.187Z","repository":{"id":282868373,"uuid":"905697354","full_name":"QuentinFuxa/WhisperLiveKit","owner":"QuentinFuxa","description":"Python package for Real-time, Local Speech-to-Text and Speaker Diarization. FastAPI Server \u0026 Web Interface","archived":false,"fork":false,"pushed_at":"2025-08-26T16:33:21.000Z","size":6046,"stargazers_count":1007,"open_issues_count":55,"forks_count":178,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-08-26T23:10:14.578Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QuentinFuxa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-19T10:49:09.000Z","updated_at":"2025-08-26T22:30:05.000Z","dependencies_parsed_at":"2025-04-09T09:23:18.982Z","dependency_job_id":"970d06d9-1205-4202-9608-1c395a7515ed","html_url":"https://github.com/QuentinFuxa/WhisperLiveKit","commit_stats":null,"previous_names":["quentinfuxa/whisperlivekit"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/QuentinFuxa/WhisperLiveKit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuentinFuxa%2FWhisperLiveKit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuentinFuxa%2FWhisperLiveKit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuentinFuxa%2FWhisperLiveKit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuentinFuxa%2FWhisperLiveKit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QuentinFuxa","download_url":"https://codeload.github.com/QuentinFuxa/WhisperLiveKit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuentinFuxa%2FWhisperLiveKit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273309572,"owners_count":25082545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-02T16:02:10.573Z","updated_at":"2025-09-02T16:03:34.171Z","avatar_url":"https://github.com/QuentinFuxa.png","language":"Python","funding_links":[],"categories":["Python","STT (Speech-to-Text) | 语音转文本","Repos","语音合成","Table of Contents"],"sub_categories":["Realtime Whisper Implementations | Whisper 实时流式实现","资源传输下载","Transcription"],"readme":"\u003ch1 align=\"center\"\u003eWhisperLiveKit\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png\" alt=\"WhisperLiveKit Demo\" width=\"730\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cb\u003eReal-time, Fully Local Speech-to-Text with Speaker Identification\u003c/b\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://pypi.org/project/whisperlivekit/\"\u003e\u003cimg alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/whisperlivekit?color=g\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pepy.tech/project/whisperlivekit\"\u003e\u003cimg alt=\"PyPI Downloads\" src=\"https://static.pepy.tech/personalized-badge/whisperlivekit?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=brightgreen\u0026left_text=installations\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/whisperlivekit/\"\u003e\u003cimg alt=\"Python Versions\" src=\"https://img.shields.io/badge/python-3.9--3.13-dark_green\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/badge/License-MIT/Dual Licensed-dark_green\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\nReal-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨\n\n#### Powered by Leading Research:\n\n- [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy\n- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription with LocalAgreement policy\n- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization\n- [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - Real-time speaker diarization\n- [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - Enterprise-grade Voice Activity Detection\n\n\n\u003e **Why not just run a simple Whisper model on every audio batch?** Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.\n\n\n### Architecture\n\n\u003cimg alt=\"Architecture\" src=\"https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png\" /\u003e\n\n*The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.*\n\n### Installation \u0026 Quick Start\n\n```bash\npip install whisperlivekit\n```\n\n\u003e  **FFmpeg is required** and must be installed before using WhisperLiveKit\n\u003e \n\u003e | OS | How to install |\n\u003e |-----------|-------------|\n\u003e  | Ubuntu/Debian | `sudo apt install ffmpeg` |\n\u003e | MacOS | `brew install ffmpeg` |\n\u003e | Windows | Download .exe from https://ffmpeg.org/download.html and add to PATH |\n\n#### Quick Start\n1. **Start the transcription server:**\n   ```bash\n   whisperlivekit-server --model base --language en\n   ```\n\n2. **Open your browser** and navigate to `http://localhost:8000`. Start speaking and watch your words appear in real-time!\n\n\n\u003e - See [tokenizer.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) for the list of all available languages.\n\u003e - For HTTPS requirements, see the **Parameters** section for SSL configuration options.\n\n \n\n#### Optional Dependencies\n\n| Optional | `pip install` |\n|-----------|-------------|\n| **Speaker diarization with Sortformer** | `git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]` |\n| Speaker diarization with Diart | `diart` |\n| Original Whisper backend | `whisper` |\n| Improved timestamps backend | `whisper-timestamped` |\n| Apple Silicon optimization backend | `mlx-whisper` |\n| OpenAI API backend | `openai` |\n\nSee  **Parameters \u0026 Configuration** below on how to use them.\n\n\n\n### Usage Examples\n\n**Command-line Interface**: Start the transcription server with various options:\n\n```bash\n# Use better model than default (small)\nwhisperlivekit-server --model large-v3\n\n# Advanced configuration with diarization and language\nwhisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr\n```\n\n\n**Python API Integration**: Check [basic_server](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a more complete example of how to use the functions and classes.\n\n```python\nfrom whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args\nfrom fastapi import FastAPI, WebSocket, WebSocketDisconnect\nfrom fastapi.responses import HTMLResponse\nfrom contextlib import asynccontextmanager\nimport asyncio\n\ntranscription_engine = None\n\n@asynccontextmanager\nasync def lifespan(app: FastAPI):\n    global transcription_engine\n    transcription_engine = TranscriptionEngine(model=\"medium\", diarization=True, lan=\"en\")\n    yield\n\napp = FastAPI(lifespan=lifespan)\n\nasync def handle_websocket_results(websocket: WebSocket, results_generator):\n    async for response in results_generator:\n        await websocket.send_json(response)\n    await websocket.send_json({\"type\": \"ready_to_stop\"})\n\n@app.websocket(\"/asr\")\nasync def websocket_endpoint(websocket: WebSocket):\n    global transcription_engine\n\n    # Create a new AudioProcessor for each connection, passing the shared engine\n    audio_processor = AudioProcessor(transcription_engine=transcription_engine)    \n    results_generator = await audio_processor.create_tasks()\n    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))\n    await websocket.accept()\n    while True:\n        message = await websocket.receive_bytes()\n        await audio_processor.process_audio(message)        \n```\n\n**Frontend Implementation**: The package includes an HTML/JavaScript implementation [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). You can also import it using `from whisperlivekit import get_inline_ui_html` \u0026 `page = get_inline_ui_html()`\n\n\n## Parameters \u0026 Configuration\n\nAn important list of parameters can be changed. But what *should* you change?\n- the `--model` size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md)\n- the `--language`.  List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English.\n- the `--backend` ? you can switch to `--backend faster-whisper` if  `simulstreaming` does not work correctly or if you prefer to avoid the dual-license requirements.\n- `--warmup-file`, if you have one\n- `--host`, `--port`, `--ssl-certfile`, `--ssl-keyfile`, if you set up a server\n- `--diarization`, if you want to use it.\n\nThe rest I don't recommend. But below are your options.\n\n| Parameter | Description | Default |\n|-----------|-------------|---------|\n| `--model` | Whisper model size. | `small` |\n| `--language` | Source language code or `auto` | `auto` |\n| `--task` | `transcribe` or `translate` | `transcribe` |\n| `--backend` | Processing backend | `simulstreaming` |\n| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |\n| `--no-vac` | Disable Voice Activity Controller | `False` |\n| `--no-vad` | Disable Voice Activity Detection | `False` |\n| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |\n| `--host` | Server host address | `localhost` |\n| `--port` | Server port | `8000` |\n| `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |\n| `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |\n\n\n| WhisperStreaming backend options | Description | Default |\n|-----------|-------------|---------|\n| `--confidence-validation` | Use confidence scores for faster validation | `False` |\n| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |\n\n\n| SimulStreaming backend options | Description | Default |\n|-----------|-------------|---------|\n| `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |\n| `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |\n| `--decoder` | Force decoder type (`beam` or `greedy`) | `auto` |\n| `--audio-max-len` | Maximum audio buffer length (seconds) | `30.0` |\n| `--audio-min-len` | Minimum audio length to process (seconds) | `0.0` |\n| `--cif-ckpt-path` | Path to CIF model for word boundary detection | `None` |\n| `--never-fire` | Never truncate incomplete words | `False` |\n| `--init-prompt` | Initial prompt for the model | `None` |\n| `--static-init-prompt` | Static prompt that doesn't scroll | `None` |\n| `--max-context-tokens` | Maximum context tokens | `None` |\n| `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |\n| `--preloaded-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |\n\n| Diarization options | Description | Default |\n|-----------|-------------|---------|\n| `--diarization` | Enable speaker identification | `False` |\n| `--diarization-backend` |  `diart` or `sortformer` | `sortformer` |\n| `--segmentation-model` | Hugging Face model ID for Diart segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |\n| `--embedding-model` | Hugging Face model ID for Diart embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |\n\n\n\u003e For diarization using Diart, you need access to pyannote.audio models:\n\u003e 1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model\n\u003e 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model\n\u003e 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model\n\u003e4. Login with HuggingFace: `huggingface-cli login`\n\n### 🚀 Deployment Guide\n\nTo deploy WhisperLiveKit in production:\n \n1. **Server Setup**: Install production ASGI server \u0026 launch with multiple workers\n   ```bash\n   pip install uvicorn gunicorn\n   gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app\n   ```\n\n2. **Frontend**: Host your customized version of the `html` example \u0026 ensure WebSocket connection points correctly\n\n3. **Nginx Configuration** (recommended for production):\n    ```nginx    \n   server {\n       listen 80;\n       server_name your-domain.com;\n        location / {\n            proxy_pass http://localhost:8000;\n            proxy_set_header Upgrade $http_upgrade;\n            proxy_set_header Connection \"upgrade\";\n            proxy_set_header Host $host;\n    }}\n    ```\n\n4. **HTTPS Support**: For secure deployments, use \"wss://\" instead of \"ws://\" in WebSocket URL\n\n## 🐋 Docker\n\nDeploy the application easily using Docker with GPU or CPU support.\n\n### Prerequisites\n- Docker installed on your system\n- For GPU support: NVIDIA Docker runtime installed\n\n### Quick Start\n\n**With GPU acceleration (recommended):**\n```bash\ndocker build -t wlk .\ndocker run --gpus all -p 8000:8000 --name wlk wlk\n```\n\n**CPU only:**\n```bash\ndocker build -f Dockerfile.cpu -t wlk .\ndocker run -p 8000:8000 --name wlk wlk\n```\n\n### Advanced Usage\n\n**Custom configuration:**\n```bash\n# Example with custom model and language\ndocker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr\n```\n\n### Memory Requirements\n- **Large models**: Ensure your Docker runtime has sufficient memory allocated\n\n\n#### Customization\n\n- `--build-arg` Options:\n  - `EXTRAS=\"whisper-timestamped\"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!\n  - `HF_PRECACHE_DIR=\"./.cache/\"` - Pre-load a model cache for faster first-time start\n  - `HF_TKN_FILE=\"./token\"` - Add your Hugging Face Hub access token to download gated models\n\n## 🔮 Use Cases\nCapture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQuentinFuxa%2FWhisperLiveKit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FQuentinFuxa%2FWhisperLiveKit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQuentinFuxa%2FWhisperLiveKit/lists"}