{"id":40857841,"url":"https://github.com/dimastatz/whisper-flow","last_synced_at":"2026-01-22T00:03:34.911Z","repository":{"id":242509566,"uuid":"809767236","full_name":"dimastatz/whisper-flow","owner":"dimastatz","description":"Whisper-Flow is a framework designed to enable real-time transcription of audio content using OpenAI’s Whisper model. Rather than processing entire files after upload (“batch mode”), Whisper-Flow accepts a continuous stream of audio chunks and produces incremental transcripts immediately.","archived":false,"fork":false,"pushed_at":"2025-02-26T07:06:41.000Z","size":75875,"stargazers_count":350,"open_issues_count":0,"forks_count":43,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-11-30T17:49:35.440Z","etag":null,"topics":["pypi-package","python","speech-to-text","transcription","whisper"],"latest_commit_sha":null,"homepage":"https://github.com/dimastatz/whisper-flow","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dimastatz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-03T12:13:05.000Z","updated_at":"2025-11-30T16:34:39.000Z","dependencies_parsed_at":"2024-10-19T21:22:12.862Z","dependency_job_id":null,"html_url":"https://github.com/dimastatz/whisper-flow","commit_stats":null,"previous_names":["dimastatz/whisper-flow"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/dimastatz/whisper-flow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimastatz%2Fwhisper-flow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimastatz%2Fwhisper-flow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimastatz%2Fwhisper-flow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimastatz%2Fwhisper-flow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dimastatz","download_url":"https://codeload.github.com/dimastatz/whisper-flow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimastatz%2Fwhisper-flow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28647492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T21:29:11.980Z","status":"ssl_error","status_checked_at":"2026-01-21T21:24:31.872Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pypi-package","python","speech-to-text","transcription","whisper"],"created_at":"2026-01-22T00:03:34.837Z","updated_at":"2026-01-22T00:03:34.902Z","avatar_url":"https://github.com/dimastatz.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1 align=\"center\"\u003e Whisper Flow \u003c/h1\u003e \n\u003ch3\u003eReal-Time Transcription Using OpenAI Whisper\u003c/br\u003e\u003c/h3\u003e\n\u003cimg src=\"https://img.shields.io/badge/Progress-100%25-red\"\u003e \u003cimg src=\"https://img.shields.io/badge/Feedback-Welcome-green\"\u003e\n\u003c/br\u003e\n\u003c/br\u003e\n\u003ckbd\u003e\n\u003cimg src=\"https://github.com/dimastatz/whisper-flow/blob/da8b67c6180566b987854b2fb94670fee92e6682/docs/imgs/whisper-flow.png?raw=true\" width=\"256px\"\u003e \n\u003c/kbd\u003e\n\u003c/div\u003e\n\n## About The Project\n\n### OpenAI Whisper \nOpenAI [Whisper](https://github.com/openai/whisper) is a versatile speech recognition model designed for general use. Trained on a vast and varied audio dataset, Whisper can handle tasks such as multilingual speech recognition, speech translation, and language identification. It is commonly used for batch transcription, where you provide the entire audio or video file to Whisper, which then converts the speech into text. This process is not done in real-time; instead, Whisper processes the files and returns the text afterward, similar to handing over a recording and receiving the transcript later.\n\n### Whisper Flow \nUsing Whisper Flow, you can generate real-time transcriptions for your media content. Unlike batch transcriptions, where media files are uploaded and processed, streaming media is delivered to Whisper Flow in real time, and the service returns a transcript immediately.\n\n### What is Streaming\nStreaming content is sent as a series of sequential data packets, or 'chunks,' which Whisper Flow transcribes on the spot. The benefits of using streaming over batch processing include the ability to incorporate real-time speech-to-text functionality into your applications and achieving faster transcription times. However, this speed may come at the expense of accuracy in some cases.\n\n### Stream Windowing\nIn scenarios involving time-streaming, it's typical to perform operations on data within specific time frames known as temporal windows. One common approach is using the [tumbling window](https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions#tumbling-window) technique, which involves gathering events into segments until a certain condition is met.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://github.com/dimastatz/whisper-flow/blob/main/docs/imgs/streaming.png?raw=true\"\u003e \n\u003cdiv\u003eTumbling Window\u003c/div\u003e\n\u003c/div\u003e\u003cbr/\u003e\n\n### Streaming Results\nWhisper Flow splits the audio stream into segments based on natural speech patterns, like speaker changes or pauses. The transcription is sent back as a series of events, with each response containing more transcribed speech until the entire segment is complete.\n\n| Transcript                                    | EndTime  | IsPartial |\n| :-------------------------------------------- | :------: | --------: |\n| Reality                                       |   0.55   | True      |\n| Reality is created                            |   1.05   | True      |\n| Reality is created by the                     |   1.50   | True      |\n| Reality is created by the mind                |   2.15   | True      |\n| Reality is created by the mind                |   2.65   | False     |\n| we can                                        |   3.05   | True      |\n| we can change                                 |   3.45   | True      |\n| we can change reality                         |   4.05   | True      |\n| we can change reality by changing             |   4.45   | True      |\n| we can change reality by changing our mind    |   5.05   | True      |\n| we can change reality by changing our mind    |   5.55   | False     |\n\n### Benchmarking\nThe evaluation metrics for comparing the performance of Whisper Flow are Word Error Rate (WER) and latency. Latency is measured as the time between two subsequent partial results, with the goal of achieving sub-second latency. We're not starting from scratch, as several quality benchmarks have already been performed for different ASR engines. I will rely on the research article [\"Benchmarking Open Source and Paid Services for Speech to Text\"](https://www.frontiersin.org/articles/10.3389/fdata.2023.1210559/full) for guidance. For benchmarking the current implementation of Whisper Flow, I use [LibriSpeech](https://www.openslr.org/12).\n\n```bash\n| Partial | Latency | Result |\n\nTrue  175.47  when we took\nTrue  185.14  When we took her.\nTrue  237.83  when we took our seat.\nTrue  176.42  when we took our seats.\nTrue  198.59  when we took our seats at the\nTrue  186.72  when we took our seats at the\nTrue  210.04  when we took our seats at the breakfast.\nTrue  220.36  when we took our seats at the breakfast table.\nTrue  203.46  when we took our seats at the breakfast table.\nTrue  242.63  When we took our seats at the breakfast table, it will\nTrue  237.41  When we took our seats at the breakfast table, it was with\nTrue  246.36  When we took our seats at the breakfast table, it was with the\nTrue  278.96  When we took our seats at the breakfast table, it was with the feeling.\nTrue  285.03  When we took our seats at the breakfast table, it was with the feeling of being.\nTrue  295.39  When we took our seats at the breakfast table, it was with the feeling of being no\nTrue  270.88  When we took our seats at the breakfast table, it was with the feeling of being no longer\nTrue  320.43  When we took our seats at the breakfast table, it was with the feeling of being no longer looked\nTrue  303.66  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon.\nTrue  470.73  When we took our seats at the breakfast table, it was with the feeling of being no longer\nTrue  353.25  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected.\nTrue  345.74  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected in any way.\nTrue  368.66  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected in any way with the\nTrue  400.25  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected in any way with this case.\nTrue  382.71  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected in any way with this case.\nFalse 405.02  When we took our seats at the breakfast table, it was with the feeling of being no longer looked upon as connected in any way with this case.\n```\n\nWhen running this benchmark on a MacBook Air with an [M1 chip and 16GB of RAM](https://support.apple.com/en-il/111883#:~:text=Testing%20conducted%20by%20Apple%20in,to%208%20clicks%20from%20bottom.), we achieve impressive performance metrics. The latency is consistently well below 500ms, ensuring real-time responsiveness. Additionally, the word error rate is around 7%, demonstrating the accuracy of the transcription.\n\n```bash\nLatency Stats:\ncount     26.000000\nmean     275.223077\nstd       84.525695\nmin      154.700000\n25%      205.105000\n50%      258.620000\n75%      339.412500\nmax      470.700000\n```\n\n### How To Use it\n\n#### As a Web Server\nTo run WhisperFlow as a web server, start by cloning the repository to your local machine.\n```bash\ngit clone https://github.com/dimastatz/whisper-flow.git\n```\nThen navigate to WhisperFlow folder, create a local venv with all dependencies and run the web server on port 8181.\n```bash\ncd whisper-flow\n./run.sh -local\nsource .venv/bin/activate\n./run.sh -benchmark\n```\n\n#### As a Python Package\nSet up a WebSocket endpoint for real-time transcription by retrieving the transcription model and creating asynchronous functions for transcribing audio chunks and sending JSON responses. Manage the WebSocket connection by continuously processing incoming audio data. Handle terminate exception to stop the session and close the connection if needed.\n\nStart with installing whisper python package\n\n```bash\npip install whisperflow\n```\n\nNow import whsiperflow and transcriber modules\n\n```Python\nimport whisperflow.streaming as st\nimport whisperflow.transcriber as ts\n\n@app.websocket(\"/ws\")\nasync def websocket_endpoint(websocket: WebSocket):\n    model = ts.get_model()\n\n    async def transcribe_async(chunks: list):\n        return await ts.transcribe_pcm_chunks_async(model, chunks)\n\n    async def send_back_async(data: dict):\n        await websocket.send_json(data)\n\n    try:\n        await websocket.accept()\n        session = st.TrancribeSession(transcribe_async, send_back_async)\n\n        while True:\n            data = await websocket.receive_bytes()\n            session.add_chunk(data)\n    except Exception as exception:\n        await session.stop()\n        await websocket.close()\n```\n#### Roadmap\n- [X] Release v1.0-RC - Includes transcription streaming implementation.\n- [X] Release v1.1 - Bug fixes and implementation of the most requested changes.\n- [ ] Release v1.2 - Prepare the package for integration with the py-speech package.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdimastatz%2Fwhisper-flow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdimastatz%2Fwhisper-flow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdimastatz%2Fwhisper-flow/lists"}