{"id":51264507,"url":"https://github.com/thc1006/ct2-maxwell-final","last_synced_at":"2026-06-29T14:32:27.331Z","repository":{"id":365375766,"uuid":"1271747334","full_name":"thc1006/ct2-maxwell-final","owner":"thc1006","description":"Frozen CTranslate2 build re-introducing NVIDIA Compute Capability 5.0 (sm_50, Maxwell) so faster-whisper runs on Maxwell GPUs (Quadro K2200, GTX 750/9xx, 940MX, 960M). Packages upstream PR #1766; CUDA 12.9 + cuDNN 9.10; validated on a K2200.","archived":false,"fork":false,"pushed_at":"2026-06-17T04:04:18.000Z","size":62,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-17T05:22:12.561Z","etag":null,"topics":["ctranslate2","cuda","faster-whisper","maxwell","sm50","speech-to-text","whisper"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thc1006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-17T01:23:03.000Z","updated_at":"2026-06-17T04:04:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/thc1006/ct2-maxwell-final","commit_stats":null,"previous_names":["thc1006/ct2-maxwell-final"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/thc1006/ct2-maxwell-final","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thc1006%2Fct2-maxwell-final","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thc1006%2Fct2-maxwell-final/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thc1006%2Fct2-maxwell-final/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thc1006%2Fct2-maxwell-final/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thc1006","download_url":"https://codeload.github.com/thc1006/ct2-maxwell-final/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thc1006%2Fct2-maxwell-final/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34931587,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ctranslate2","cuda","faster-whisper","maxwell","sm50","speech-to-text","whisper"],"created_at":"2026-06-29T14:32:23.289Z","updated_at":"2026-06-29T14:32:27.325Z","avatar_url":"https://github.com/thc1006.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ct2-maxwell-final\n\nA **final, frozen** build of [CTranslate2](https://github.com/OpenNMT/CTranslate2) that\nre-introduces NVIDIA Compute Capability **5.0 (sm_50, Maxwell GM107/GM108)** support, so\nthat [faster-whisper](https://github.com/SYSTRAN/faster-whisper) runs on Maxwell GPUs\ninstead of failing to launch any CUDA kernel. Stock CTranslate2 wheels (and any CUDA 12\nbuild of upstream) no longer emit sm_50 SASS, so on a Maxwell card faster-whisper aborts\nthe moment it tries to run on the GPU. This repo carries the upstream sm_50 patch, pins\nthe *last* toolchain that can still compile it, and packages a working build.\n\nIt is **frozen by design.** CUDA 13.0 removes sm_50 codegen from `nvcc`, and cuDNN 9.11\nraises the minimum compute capability to 7.5 (Turing), dropping Maxwell entirely. This\nproject therefore targets the last Maxwell-supporting toolchain (CUDA 12.9 + cuDNN 9.10.x)\nand **will not be maintained past it.** There is no upgrade path; that is the point.\n\n## The error this fixes\n\nOn a Maxwell GPU, a stock faster-whisper / CTranslate2 install dies on the GPU path with:\n\n```\ncudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device\n```\n\nThis means the loaded `libctranslate2.so` contains no kernels compiled for your card's\narchitecture (sm_50). It is not a driver problem and not a faster-whisper bug — the SASS\nfor Maxwell simply was never emitted.\n\n### Who is affected\n\nAny GPU with compute capability **5.0 (Maxwell GM107/GM108)**, including:\n\n- Quadro K2200 (4 GB)\n- GeForce GTX 750 / 750 Ti\n- GeForce GTX 9xx (e.g. GTX 950/960-class GM-parts at cc 5.0)\n- GeForce 940MX (2 GB)\n- GeForce 960M (laptop)\n\n(Higher Maxwell parts at cc 5.2, e.g. GTX 970/980, hit the same upstream gap; this build\ntargets cc 5.0 specifically — see `CUDA_ARCH_LIST` below.)\n\n## Frozen pins\n\nThese are the ceiling that still supports sm_50. The shell scripts, CI workflow, and this\nREADME all agree with [`BUILD_SPEC.md`](BUILD_SPEC.md), which is the single source of truth.\n\n| Component      | Pin                                      | Why this is the last one |\n|----------------|------------------------------------------|--------------------------|\n| CUDA Toolkit   | **12.9** (`cuda-toolkit-12-9`)           | last `nvcc` that emits sm_50 SASS; CUDA 13.0 removes it |\n| cuDNN          | **9.10.x** (`\u003c= 9.10`, never 9.11)       | 9.11 raises min compute capability to 7.5 (Turing), dropping Maxwell |\n| CTranslate2    | **v4.8.0** + [`patches/1766-sm50.patch`](patches/1766-sm50.patch) | latest release; patch applies clean |\n| CUDA arch      | `-DCUDA_ARCH_LIST=\"5.0\"`                 | only kernel we need; keeps build small and fast |\n| NVIDIA driver  | **DO NOT TOUCH**                         | install the toolkit only; never `cuda`, `cuda-12-9`, or `cuda-drivers*`, which can replace your working driver |\n\nThe build/validation host runs driver 580.159.03 (R580, the last Maxwell driver branch;\n\u003e= the 575.51.03 floor for CUDA 12.9). The install script holds existing NVIDIA driver\npackages, captures the driver version before and after, and aborts if it changes.\n\n## Quickstart (build from source)\n\nThis must run on an **actual sm_50 Linux host** — the build emits and validates Maxwell\nSASS, so the GPU has to be present. Target platform is **Ubuntu 24.04 x86_64** with a\nworking NVIDIA driver already installed (you need `nvidia-smi` to report cc 5.0).\n\n```bash\ngit clone https://github.com/thc1006/ct2-maxwell-final.git\ncd ct2-maxwell-final\n\n# 1. Install the frozen toolchain: CUDA Toolkit 12.9 (no driver) + cuDNN 9.10.x.\n#    Holds your NVIDIA driver, pins cuDNN, guards disk, aborts if the driver moves.\nbash scripts/01_install_toolchain.sh\n\n# 2. Clone CTranslate2 v4.8.0, apply patches/1766-sm50.patch, build the C++ lib,\n#    build the Python wheel into ./venv, then install faster-whisper and\n#    force-reinstall our patched wheel last so the sm_50 build wins.\nbash scripts/02_build_ct2.sh\n\n# 3. Validate: prove the GPU path works and benchmark GPU vs CPU.\nsource venv/bin/activate\nsource cuda-env.sh\npython scripts/03_validate.py\n```\n\nNotes:\n\n- The scripts are idempotent and re-runnable, and use `set -euo pipefail`.\n- Step 1 installs **only** `cuda-toolkit-12-9`. It never installs driver metapackages.\n- If apt offers no cuDNN 9.10.x (only 9.11+), step 1 aborts and tells you to use the\n  cuDNN 8 fallback below.\n\n## Using it with faster-whisper\n\nAfter step 2, the project venv has faster-whisper plus the patched sm_50 `ctranslate2`:\n\n```python\nfrom faster_whisper import WhisperModel\n\n# On Maxwell sm_50 the only usable CUDA compute type is float32.\n# Requesting int8/float16 on CUDA raises an error (no efficient int8/FP16 on this GPU).\nmodel = WhisperModel(\"small\", device=\"cuda\", compute_type=\"float32\")\n\n# CPU alternative (int8 IS valid on CPU; on old CPUs float32 may even be faster):\n# model = WhisperModel(\"small\", device=\"cpu\", compute_type=\"int8\")\n\nsegments, info = model.transcribe(\"audio.wav\", beam_size=1)\nfor s in segments:\n    print(s.text)\n```\n\n**VRAM caveat.** Pick the model to fit the card:\n\n- **Quadro K2200 (4 GB):** tiny / base / small (cuda `float32`) are comfortable.\n- **GeForce 940MX (2 GB):** stick to **tiny / base / small**. `large` will not fit\n  in 2 GB and will OOM. Do not assume a model that runs on the K2200 also runs on a 2 GB\n  card.\n\n## Is the GPU worth it on Maxwell? (measured)\n\nsm_50 has **no native FP16** and **no `dp4a` int8** acceleration, so on this GPU CTranslate2\nreports `float32` as the only usable CUDA compute type — requesting `int8` (or `float16`) on\nCUDA raises an error here, so use `float32` on the GPU (int8 is for the CPU path). The common\nintuition \"an old Maxwell GPU can't beat a CPU\" turns out to be\n**wrong for sustained work**: measured on a Quadro K2200, GPU `float32` transcribes **4–5x\nfaster than the CPU baseline** on a 66 s clip.\n\nMeasured on a Quadro K2200 (sm_50, 4 GB, driver 580.159.03), Ubuntu 24.04, CTranslate2 v4.8.0\n+ PR #1766, faster-whisper, `beam_size=1`, VAD off, CPU pinned to 4 threads. Each `transcribe`\nfigure is the **median of 5 timed runs after one warmup**; the one-time `WhisperModel(...)`\nload (CUDA context + weights to VRAM + first cuDNN/cuBLAS setup) is reported separately as\n`load` and is **not** a per-clip cost. RTF = transcribe / audio seconds (lower is faster);\n`x_rt` = times faster than real time.\n\n**Throughput — 66 s clip (the number that matters for real files):**\n\n| model | config       | load  | transcribe (median) | RTF   | x_rt |\n|-------|--------------|-------|---------------------|-------|------|\n| tiny  | cuda float32 | 0.86s | 7.40s               | 0.112 | 8.9x |\n| tiny  | cpu  int8    | 0.53s | 32.92s              | 0.499 | 2.0x |\n| tiny  | cpu  float32 | 0.44s | 29.35s              | 0.445 | 2.2x |\n| small | cuda float32 | 9.88s | 27.36s              | 0.415 | 2.4x |\n| small | cpu  int8    | 0.89s | 140.51s             | 2.129 | 0.5x |\n| small | cpu  float32 | 1.15s | 113.74s             | 1.723 | 0.6x |\n\nGPU vs CPU-int8 on the 66 s clip: **tiny 4.45x, small 5.13x faster.**\n\n**Latency — single 11 s clip (fixed per-call overhead dominates):**\n\n| model | config       | load  | transcribe (median) | end-to-end (load + 1 clip) |\n|-------|--------------|-------|---------------------|----------------------------|\n| tiny  | cuda float32 | 0.86s | 0.19s               | ~1.0s                      |\n| tiny  | cpu  float32 | 0.44s | 0.82s               | ~1.3s                      |\n| small | cuda float32 | 9.88s | 1.20s               | ~11.1s                     |\n| small | cpu  int8    | 0.89s | 5.33s               | ~6.2s                      |\n\n**What this means.** For batch / long audio the GPU wins decisively (4–5x). For a *single\none-off short clip* with a larger model, the GPU's ~10 s one-time load can make the CPU faster\nend-to-end — but that load amortizes immediately over any repeated use. Note also that **int8\nwas slower than float32 on this CPU** (the old host CPU lacks modern int8 / VNNI instructions),\nso on Maxwell-era machines \"use int8 on CPU\" is not automatically faster — measure it. For the\n`small` model this CPU cannot keep up with real time (RTF \u003e 1) while the GPU stays at ~0.4 RTF.\n\n**Benchmark caveat (read before trusting these numbers).** All numbers are from a single Quadro\nK2200 on one machine. The 940MX is the same sm_50 ISA but a **different, slower, 2 GB part** —\nthese numbers do not transfer to it; on 2 GB stick to tiny / base / small. The short-clip rows\nmeasure latency; only the 66 s rows support a throughput claim. CPU threads were pinned\n(CTranslate2 4.8.0's default `intra_threads=0` can oversubscribe, upstream issue #2063).\nReproduce on your own card with `bench/run_bench.py` (and the 5-way compute-type sanity check\nin `scripts/03_validate.py`) before deciding the GPU is worth it for your workload.\n\n## Prebuilt wheel\n\nThe latest release ships a prebuilt wheel, built by CI ([`build-wheels.yml`](.github/workflows/build-wheels.yml)):\n\n- **[v0.1.0](https://github.com/thc1006/ct2-maxwell-final/releases/tag/v0.1.0)** — `ctranslate2-4.8.0-cp312-cp312-manylinux_2_39_x86_64.whl`\n\nIt is a self-contained `manylinux` wheel (auditwheel-repaired, bundles `libctranslate2.so`), for:\n\n- **sm_50 SASS only** (Maxwell GM107/GM108; no other architectures),\n- **Python 3.12**, **Linux x86_64**, glibc \u003e= 2.39 (Ubuntu 24.04+),\n- runtime **CUDA 12.9 + cuDNN 9.10** (driver \u003e= 575; the Maxwell R580 branch satisfies this).\n\nInstall it on a Maxwell host with that runtime. Install faster-whisper first, then force the\nsm_50 wheel last so it wins over the PyPI build that has no Maxwell kernels:\n\n```bash\npip install faster-whisper\npip install --force-reinstall --no-deps \\\n  https://github.com/thc1006/ct2-maxwell-final/releases/download/v0.1.0/ctranslate2-4.8.0-cp312-cp312-manylinux_2_39_x86_64.whl\n```\n\nFor any other Python version, glibc, CUDA/cuDNN, or platform, build from source with the\nscripts above. Each new `v*` tag rebuilds and attaches a fresh wheel.\n\n## Credit\n\nThe actual fix is **OpenNMT/CTranslate2 PR #1766 by Giulio Paci ([@giuliopaci](https://github.com/giuliopaci)),\nwhich is still open.** It re-adds sm_50 to the CUDA 12 arch list and guards the AWQ\ndequantize kernel for pre-sm_53 devices. All credit for the working code goes to him.\n\n- PR: https://github.com/OpenNMT/CTranslate2/pull/1766\n- Issue (Maxwell / CUDA 12 codegen): https://github.com/OpenNMT/CTranslate2/issues/1765\n- Original Quadro K2200 report: https://github.com/OpenNMT/CTranslate2/issues/1666\n\n**This repository merely packages and freezes that patch** against a known-good toolchain.\nIt contributes no kernel code of its own; see [`patches/1766-sm50.patch`](patches/1766-sm50.patch).\n\n## Conservative fallback\n\nIf the cuDNN 9.10 / Maxwell path proves fragile (e.g. apt no longer offers a 9.10.x build,\nor you hit cuDNN runtime issues), fall back to **cuDNN 8 + `ctranslate2==4.4.0`**. cuDNN 8\nfully supported sm_50, and CTranslate2 4.4.0 predates the upstream changes that dropped it.\nThis is older and slower-moving, but rock-solid on Maxwell.\n\n## License\n\nThe scripts, patch packaging, and configuration in this repository are released under the\n**MIT License**. CTranslate2 itself is also **MIT-licensed**; the sm_50 patch is carried\nunder the same terms as upstream.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthc1006%2Fct2-maxwell-final","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthc1006%2Fct2-maxwell-final","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthc1006%2Fct2-maxwell-final/lists"}