{"id":13454334,"url":"https://github.com/Const-me/Whisper","last_synced_at":"2025-03-24T05:33:42.510Z","repository":{"id":65330021,"uuid":"586310592","full_name":"Const-me/Whisper","owner":"Const-me","description":"High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model","archived":false,"fork":false,"pushed_at":"2024-08-03T02:35:39.000Z","size":4520,"stargazers_count":8376,"open_issues_count":145,"forks_count":718,"subscribers_count":87,"default_branch":"master","last_synced_at":"2024-10-29T15:27:24.292Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Const-me.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-07T17:25:57.000Z","updated_at":"2024-10-29T01:51:59.000Z","dependencies_parsed_at":"2024-11-16T17:36:51.093Z","dependency_job_id":"6de973a8-0edc-4a7f-94ea-3a714c4b0186","html_url":"https://github.com/Const-me/Whisper","commit_stats":{"total_commits":183,"total_committers":1,"mean_commits":183.0,"dds":0.0,"last_synced_commit":"9440de17fa4be62d3ca3bd8b48ebdc48baf12ac6"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Const-me%2FWhisper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Const-me%2FWhisper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Const-me%2FWhisper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Const-me%2FWhisper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Const-me","download_url":"https://codeload.github.com/Const-me/Whisper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245217428,"owners_count":20579291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T08:00:53.171Z","updated_at":"2025-03-24T05:33:40.428Z","avatar_url":"https://github.com/Const-me.png","language":"C++","readme":"﻿This project is a Windows port of the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) implementation.\u003cbr/\u003e\nWhich in turn is a C++ port of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model.\n\n# Quick Start Guide\n\nDownload WhisperDesktop.zip from the “Releases” section of this repository, unpack the ZIP, and run WhisperDesktop.exe.\n\nOn the first screen it will ask you to download a model.\u003cbr/\u003e\nI recommend `ggml-medium.bin` (1.42GB in size), because I’ve mostly tested the software with that model.\u003cbr/\u003e\n![Load Model Screen](gui-load-model.png)\n\nThe next screen allows to transcribe an audio file.\u003cbr/\u003e\n![Transcribe Screen](gui-transcribe.png)\n\nThere’s another screen which allows to capture and transcribe or translate live audio from a microphone.\u003cbr/\u003e\n![Capture Screen](gui-capture.png)\n\n# Features\n\n* Vendor-agnostic GPGPU based on DirectCompute; another name for that technology is “compute shaders in Direct3D 11”\n\n* Plain C++ implementation, no runtime dependencies except essential OS components\n\n* Much faster than OpenAI’s implementation.\u003cbr/\u003e\nOn my desktop computer with GeForce [1080Ti](https://en.wikipedia.org/wiki/GeForce_10_series#GeForce_10_(10xx)_series_for_desktops) GPU,\nmedium model, [3:24 min speech](https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg)\ntook 45 seconds to transcribe with PyTorch and CUDA, but only 19 seconds with my implementation and DirectCompute.\u003cbr/\u003e\nFunfact: that’s 9.63 gigabytes runtime dependencies, versus 431 kilobytes `Whisper.dll`\n\n* Mixed F16 / F32 precision: Windows \n[requires support](https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/format-support-for-direct3d-feature-level-10-0-hardware#dxgi_format_r16_floatfcs-54)\nof `R16_FLOAT` buffers since D3D version 10.0\n\n* Built-in performance profiler which measures execution time of individual compute shaders\n\n* Low memory usage\n\n* Media Foundation for audio handling, supports most audio and video formats (with the notable exception of Ogg Vorbis),\nand most audio capture devices which work on Windows (except some professional ones, which only implementing [ASIO](https://en.wikipedia.org/wiki/Audio_Stream_Input/Output) API).\n\n* Voice activity detection for audio capture.\u003cbr/\u003e\nThe implementation is based on the [2009 article](https://www.researchgate.net/publication/255667085_A_simple_but_efficient_real-time_voice_activity_detection_algorithm)\n“A simple but efficient real-time voice activity detection algorithm” by Mohammad Moattar and Mahdi Homayoonpoor.\n\n* Easy to use COM-style API. Idiomatic C# wrapper [available on nuget](https://www.nuget.org/packages/WhisperNet/).\u003cbr/\u003e\nVersion 1.10 [introduced](https://github.com/Const-me/Whisper/tree/master/WhisperPS)\nscripting support for PowerShell 5.1, that’s the older “Windows PowerShell” version which comes pre-installed on Windows.\n\n* Pre-built binaries available\n\nThe only supported platform is 64-bit Windows.\u003cbr/\u003e\nShould work on Windows 8.1 or newer, but I have only tested on Windows 10.\u003cbr/\u003e\nThe library requires a Direct3D 11.0 capable GPU, which in 2023 simply means “any hardware GPU”.\nThe most recent GPU without D3D 11.0 support was Intel [Sandy Bridge](https://en.wikipedia.org/wiki/Sandy_Bridge) from 2011.\n\nOn the CPU side, the library requires [AVX1](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and [F16C](https://en.wikipedia.org/wiki/F16C) support.\n\n# Developer Guide\n\n## Build Instructions\n\n1. Clone this repository\n\n2. Open `WhisperCpp.sln` in Visual Studio 2022. I’m using the freeware community edition, version 17.4.4.\n\n3. Switch to `Release` configuration\n\n4. Build and run `CompressShaders` C# project, in the `Tools` subfolder of the solution.\nTo run that project, right click in visual studio, “Set as startup project”, then in the main menu of VS “Debug / Start Without Debugging”.\nWhen completed successfully, you should see a console window with a line like that:\u003cbr/\u003e\n`Compressed 46 compute shaders, 123.5 kb -\u003e 18.0 kb`\n\n5. Build `Whisper` project to get the native DLL, or `WhisperNet` for the C# wrapper and nuget package, or the examples.\n\n## Other Notes\n\nIf you gonna consume the library in a software built with Visual C++ 2022 or newer, you probably redistribute Visual C++ runtime DLLs in the form of the `.msm` merge module,\nor [vc_redist.x64.exe](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170) binary.\u003cbr/\u003e\nIf you do that, right click on the `Whisper` project, Properties, C/C++, Code Generation,\nswitch “Runtime Library” setting from `Multi-threaded (/MT)` to `Multi-threaded DLL (/MD)`,\nand rebuild: the binary will become smaller.\n\nThe library includes [RenderDoc](https://renderdoc.org/) GPU debugger integration.\u003cbr/\u003e\nWhen launched your program from RenderDoc, hold F12 key to capture the compute calls.\u003cbr/\u003e\nIf you gonna debug HLSL shaders, use the debug build of the DLL, it includes debug build of the shaders and you’ll get better UX in the debugger.\n\nThe repository includes a lot of code which was only used for development:\ncouple alternative model implementations, compatible FP64 versions of some compute shaders, debug tracing and the tool to compare the traces, etc.\u003cbr/\u003e\nThat stuff is disabled by preprocessor macros or `constexpr` flags, I hope it’s fine to keep here.\n\n## Performance Notes\n\nI have a limited selection of GPUs in this house.\u003cbr/\u003e\nSpecifically, I have optimized for nVidia 1080Ti, Radeon Vega 8 inside Ryzen 7 5700G, and Radeon Vega 7 inside Ryzen 5 5600U.\u003cbr/\u003e\n[Here’s the summary](https://github.com/Const-me/Whisper/blob/master/SampleClips/summary.tsv).\n\nThe nVidia delivers relative speed 5.8 for the large model, 10.6 for the medium model.\u003cbr/\u003e\nThe AMD Ryzen 5 5600U APU delivers relative speed about 2.2 for the medium model. Not great, but still, much faster than realtime.\n\nI have also tested on [nVidia 1650](https://en.wikipedia.org/wiki/GeForce_16_series#Desktop): slower than 1080Ti but pretty good, much faster than realtime.\u003cbr/\u003e\nI have also tested on Intel HD Graphics 4000 inside Core i7-3612QM, the relative speed was 0.14 for medium model, 0.44 for small model.\nThat’s much slower than realtime, but I was happy to find my software works even on the integrated mobile GPU [launched](https://ark.intel.com/products/64901) in 2012.\n\nI’m not sure the performance is ideal on discrete AMD GPUs, or integrated Intel GPUs, have not specifically optimized for them.\u003cbr/\u003e\nIdeally, they might need slightly different builds of a couple of the most expensive compute shaders, `mulMatTiled.hlsl` and `mulMatByRowTiled.hlsl`\u003cbr/\u003e\nAnd maybe other adjustments, like the `useReshapedMatMul()` value in `Whisper/D3D/device.h` header file.\n\nI don’t know how to measure that, but I have a feeling the bottleneck is memory, not compute.\u003cbr/\u003e\nSomeone on Hacker News [has tested](https://news.ycombinator.com/item?id=34408429) on [3060Ti](https://en.wikipedia.org/wiki/GeForce_30_series#Desktop),\nthe version with GDDR6 memory.\nCompared to 1080Ti, that GPU has 1.3x FP32 FLOPS, but 0.92x VRAM bandwidth.\nThe app was about 10% slower on the 3060Ti.\n\n## Further Optimisations\n\nI have only spent a few days optimizing performance of these shaders.\u003cbr/\u003e\nIt might be possible to do much better, here’s a few ideas.\n\n* Newer GPUs like Radeon Vega or nVidia 1650 have higher FP16 performance compared to FP32, yet my compute shaders are only using FP32 data type.\u003cbr/\u003e\n[Half The Precision, Twice The Fun](https://therealmjp.github.io/posts/shader-fp16/)\n\n* In the current version, FP16 tensors are using shader resource views to upcast loaded values, and unordered access views to downcast stored ones.\u003cbr/\u003e\nMight be a good idea to switch to [byte address buffers](https://learn.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-cs-resources#byte-address-buffer),\nload/store complete 4-bytes values, and upcast / downcast in HLSL with `f16tof32` / `f32tof16` intrinsics.\n\n* In the current version all shaders are compiled offline, and `Whisper.dll` includes DXBC byte codes.\u003cbr/\u003e\nThe HLSL compiler `D3DCompiler_47.dll` is an OS component, and is pretty fast.\nFor the expensive compute shaders, it’s probably a good idea to ship HLSL instead of DXBC,\nand [compile](https://learn.microsoft.com/en-us/windows/win32/api/d3dcompiler/nf-d3dcompiler-d3dcompile) on startup\nwith environment-specific [values](https://learn.microsoft.com/en-us/windows/win32/api/d3dcommon/ns-d3dcommon-d3d_shader_macro) for the macros.\n\n* It might be a good idea to upgrade the whole thing from D3D11 to D3D12.\u003cbr/\u003e\nThe newer API is harder to use, but includes potentially useful features not exposed to D3D11:\n[wave intrinsics](https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics),\nand [explicit FP16](https://github.com/microsoft/DirectXShaderCompiler/wiki/16-Bit-Scalar-Types).\n\n## Missing Features\n\nAutomatic language detection is not implemented.\n\nIn the current version there’s high latency for realtime audio capture.\u003cbr/\u003e\nSpecifically, depending on voice detection the figure is about 5-10 seconds.\u003cbr/\u003e\nAt least in my tests, the model wasn’t happy when I supplied too short pieces of the audio.\u003cbr/\u003e\nI have increased the latency and called it a day, but ideally this needs a better fix for optimal UX.\n\n# Final Words\n\nFrom my perspective, this is an unpaid hobby project, which I completed over the 2022-23 winter holydays.\u003cbr/\u003e\nThe code probably has bugs.\u003cbr/\u003e\nThe software is provided “as is”, without warranty of any kind.\n\nThanks to [Georgi Gerganov](https://github.com/ggerganov) for [whisper.cpp](https://github.com/ggerganov/whisper.cpp) implementation,\nand the models in GGML binary format.\u003cbr/\u003e\nI don’t program Python, and I don’t know anything about the ML ecosystem.\u003cbr/\u003e\nI wouldn’t even start this project without a good C++ reference implementation, to test my version against.\n\nThat whisper.cpp project has an example which [uses](https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk/gpt-2.cpp)\nthe same GGML implementation to run another OpenAI’s model, [GPT-2](https://en.wikipedia.org/wiki/GPT-2).\u003cbr/\u003e\nIt shouldn’t be hard to support that ML model with the compute shaders and relevant infrastructure already implemented in this project.\n\nIf you find this useful, I’ll be very grateful if you consider a donation to [“Come Back Alive” foundation](https://savelife.in.ua/en/).","funding_links":[],"categories":["C++","\u003ca name=\"cpp\"\u003e\u003c/a\u003eC++","精选文章","Repos","Summary","Frameworks","B站"],"sub_categories":["语音识别-生成字幕"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FConst-me%2FWhisper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FConst-me%2FWhisper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FConst-me%2FWhisper/lists"}