An open API service indexing awesome lists of open source software.

https://github.com/thc1006/ct2-maxwell-final

Frozen CTranslate2 build re-introducing NVIDIA Compute Capability 5.0 (sm_50, Maxwell) so faster-whisper runs on Maxwell GPUs (Quadro K2200, GTX 750/9xx, 940MX, 960M). Packages upstream PR #1766; CUDA 12.9 + cuDNN 9.10; validated on a K2200.
https://github.com/thc1006/ct2-maxwell-final

ctranslate2 cuda faster-whisper maxwell sm50 speech-to-text whisper

Last synced: about 13 hours ago
JSON representation

Frozen CTranslate2 build re-introducing NVIDIA Compute Capability 5.0 (sm_50, Maxwell) so faster-whisper runs on Maxwell GPUs (Quadro K2200, GTX 750/9xx, 940MX, 960M). Packages upstream PR #1766; CUDA 12.9 + cuDNN 9.10; validated on a K2200.

Awesome Lists containing this project

README

          

# ct2-maxwell-final

A **final, frozen** build of [CTranslate2](https://github.com/OpenNMT/CTranslate2) that
re-introduces NVIDIA Compute Capability **5.0 (sm_50, Maxwell GM107/GM108)** support, so
that [faster-whisper](https://github.com/SYSTRAN/faster-whisper) runs on Maxwell GPUs
instead of failing to launch any CUDA kernel. Stock CTranslate2 wheels (and any CUDA 12
build of upstream) no longer emit sm_50 SASS, so on a Maxwell card faster-whisper aborts
the moment it tries to run on the GPU. This repo carries the upstream sm_50 patch, pins
the *last* toolchain that can still compile it, and packages a working build.

It is **frozen by design.** CUDA 13.0 removes sm_50 codegen from `nvcc`, and cuDNN 9.11
raises the minimum compute capability to 7.5 (Turing), dropping Maxwell entirely. This
project therefore targets the last Maxwell-supporting toolchain (CUDA 12.9 + cuDNN 9.10.x)
and **will not be maintained past it.** There is no upgrade path; that is the point.

## The error this fixes

On a Maxwell GPU, a stock faster-whisper / CTranslate2 install dies on the GPU path with:

```
cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
```

This means the loaded `libctranslate2.so` contains no kernels compiled for your card's
architecture (sm_50). It is not a driver problem and not a faster-whisper bug — the SASS
for Maxwell simply was never emitted.

### Who is affected

Any GPU with compute capability **5.0 (Maxwell GM107/GM108)**, including:

- Quadro K2200 (4 GB)
- GeForce GTX 750 / 750 Ti
- GeForce GTX 9xx (e.g. GTX 950/960-class GM-parts at cc 5.0)
- GeForce 940MX (2 GB)
- GeForce 960M (laptop)

(Higher Maxwell parts at cc 5.2, e.g. GTX 970/980, hit the same upstream gap; this build
targets cc 5.0 specifically — see `CUDA_ARCH_LIST` below.)

## Frozen pins

These are the ceiling that still supports sm_50. The shell scripts, CI workflow, and this
README all agree with [`BUILD_SPEC.md`](BUILD_SPEC.md), which is the single source of truth.

| Component | Pin | Why this is the last one |
|----------------|------------------------------------------|--------------------------|
| CUDA Toolkit | **12.9** (`cuda-toolkit-12-9`) | last `nvcc` that emits sm_50 SASS; CUDA 13.0 removes it |
| cuDNN | **9.10.x** (`<= 9.10`, never 9.11) | 9.11 raises min compute capability to 7.5 (Turing), dropping Maxwell |
| CTranslate2 | **v4.8.0** + [`patches/1766-sm50.patch`](patches/1766-sm50.patch) | latest release; patch applies clean |
| CUDA arch | `-DCUDA_ARCH_LIST="5.0"` | only kernel we need; keeps build small and fast |
| NVIDIA driver | **DO NOT TOUCH** | install the toolkit only; never `cuda`, `cuda-12-9`, or `cuda-drivers*`, which can replace your working driver |

The build/validation host runs driver 580.159.03 (R580, the last Maxwell driver branch;
>= the 575.51.03 floor for CUDA 12.9). The install script holds existing NVIDIA driver
packages, captures the driver version before and after, and aborts if it changes.

## Quickstart (build from source)

This must run on an **actual sm_50 Linux host** — the build emits and validates Maxwell
SASS, so the GPU has to be present. Target platform is **Ubuntu 24.04 x86_64** with a
working NVIDIA driver already installed (you need `nvidia-smi` to report cc 5.0).

```bash
git clone https://github.com/thc1006/ct2-maxwell-final.git
cd ct2-maxwell-final

# 1. Install the frozen toolchain: CUDA Toolkit 12.9 (no driver) + cuDNN 9.10.x.
# Holds your NVIDIA driver, pins cuDNN, guards disk, aborts if the driver moves.
bash scripts/01_install_toolchain.sh

# 2. Clone CTranslate2 v4.8.0, apply patches/1766-sm50.patch, build the C++ lib,
# build the Python wheel into ./venv, then install faster-whisper and
# force-reinstall our patched wheel last so the sm_50 build wins.
bash scripts/02_build_ct2.sh

# 3. Validate: prove the GPU path works and benchmark GPU vs CPU.
source venv/bin/activate
source cuda-env.sh
python scripts/03_validate.py
```

Notes:

- The scripts are idempotent and re-runnable, and use `set -euo pipefail`.
- Step 1 installs **only** `cuda-toolkit-12-9`. It never installs driver metapackages.
- If apt offers no cuDNN 9.10.x (only 9.11+), step 1 aborts and tells you to use the
cuDNN 8 fallback below.

## Using it with faster-whisper

After step 2, the project venv has faster-whisper plus the patched sm_50 `ctranslate2`:

```python
from faster_whisper import WhisperModel

# On Maxwell sm_50 the only usable CUDA compute type is float32.
# Requesting int8/float16 on CUDA raises an error (no efficient int8/FP16 on this GPU).
model = WhisperModel("small", device="cuda", compute_type="float32")

# CPU alternative (int8 IS valid on CPU; on old CPUs float32 may even be faster):
# model = WhisperModel("small", device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.wav", beam_size=1)
for s in segments:
print(s.text)
```

**VRAM caveat.** Pick the model to fit the card:

- **Quadro K2200 (4 GB):** tiny / base / small (cuda `float32`) are comfortable.
- **GeForce 940MX (2 GB):** stick to **tiny / base / small**. `large` will not fit
in 2 GB and will OOM. Do not assume a model that runs on the K2200 also runs on a 2 GB
card.

## Is the GPU worth it on Maxwell? (measured)

sm_50 has **no native FP16** and **no `dp4a` int8** acceleration, so on this GPU CTranslate2
reports `float32` as the only usable CUDA compute type — requesting `int8` (or `float16`) on
CUDA raises an error here, so use `float32` on the GPU (int8 is for the CPU path). The common
intuition "an old Maxwell GPU can't beat a CPU" turns out to be
**wrong for sustained work**: measured on a Quadro K2200, GPU `float32` transcribes **4–5x
faster than the CPU baseline** on a 66 s clip.

Measured on a Quadro K2200 (sm_50, 4 GB, driver 580.159.03), Ubuntu 24.04, CTranslate2 v4.8.0
+ PR #1766, faster-whisper, `beam_size=1`, VAD off, CPU pinned to 4 threads. Each `transcribe`
figure is the **median of 5 timed runs after one warmup**; the one-time `WhisperModel(...)`
load (CUDA context + weights to VRAM + first cuDNN/cuBLAS setup) is reported separately as
`load` and is **not** a per-clip cost. RTF = transcribe / audio seconds (lower is faster);
`x_rt` = times faster than real time.

**Throughput — 66 s clip (the number that matters for real files):**

| model | config | load | transcribe (median) | RTF | x_rt |
|-------|--------------|-------|---------------------|-------|------|
| tiny | cuda float32 | 0.86s | 7.40s | 0.112 | 8.9x |
| tiny | cpu int8 | 0.53s | 32.92s | 0.499 | 2.0x |
| tiny | cpu float32 | 0.44s | 29.35s | 0.445 | 2.2x |
| small | cuda float32 | 9.88s | 27.36s | 0.415 | 2.4x |
| small | cpu int8 | 0.89s | 140.51s | 2.129 | 0.5x |
| small | cpu float32 | 1.15s | 113.74s | 1.723 | 0.6x |

GPU vs CPU-int8 on the 66 s clip: **tiny 4.45x, small 5.13x faster.**

**Latency — single 11 s clip (fixed per-call overhead dominates):**

| model | config | load | transcribe (median) | end-to-end (load + 1 clip) |
|-------|--------------|-------|---------------------|----------------------------|
| tiny | cuda float32 | 0.86s | 0.19s | ~1.0s |
| tiny | cpu float32 | 0.44s | 0.82s | ~1.3s |
| small | cuda float32 | 9.88s | 1.20s | ~11.1s |
| small | cpu int8 | 0.89s | 5.33s | ~6.2s |

**What this means.** For batch / long audio the GPU wins decisively (4–5x). For a *single
one-off short clip* with a larger model, the GPU's ~10 s one-time load can make the CPU faster
end-to-end — but that load amortizes immediately over any repeated use. Note also that **int8
was slower than float32 on this CPU** (the old host CPU lacks modern int8 / VNNI instructions),
so on Maxwell-era machines "use int8 on CPU" is not automatically faster — measure it. For the
`small` model this CPU cannot keep up with real time (RTF > 1) while the GPU stays at ~0.4 RTF.

**Benchmark caveat (read before trusting these numbers).** All numbers are from a single Quadro
K2200 on one machine. The 940MX is the same sm_50 ISA but a **different, slower, 2 GB part** —
these numbers do not transfer to it; on 2 GB stick to tiny / base / small. The short-clip rows
measure latency; only the 66 s rows support a throughput claim. CPU threads were pinned
(CTranslate2 4.8.0's default `intra_threads=0` can oversubscribe, upstream issue #2063).
Reproduce on your own card with `bench/run_bench.py` (and the 5-way compute-type sanity check
in `scripts/03_validate.py`) before deciding the GPU is worth it for your workload.

## Prebuilt wheel

The latest release ships a prebuilt wheel, built by CI ([`build-wheels.yml`](.github/workflows/build-wheels.yml)):

- **[v0.1.0](https://github.com/thc1006/ct2-maxwell-final/releases/tag/v0.1.0)** — `ctranslate2-4.8.0-cp312-cp312-manylinux_2_39_x86_64.whl`

It is a self-contained `manylinux` wheel (auditwheel-repaired, bundles `libctranslate2.so`), for:

- **sm_50 SASS only** (Maxwell GM107/GM108; no other architectures),
- **Python 3.12**, **Linux x86_64**, glibc >= 2.39 (Ubuntu 24.04+),
- runtime **CUDA 12.9 + cuDNN 9.10** (driver >= 575; the Maxwell R580 branch satisfies this).

Install it on a Maxwell host with that runtime. Install faster-whisper first, then force the
sm_50 wheel last so it wins over the PyPI build that has no Maxwell kernels:

```bash
pip install faster-whisper
pip install --force-reinstall --no-deps \
https://github.com/thc1006/ct2-maxwell-final/releases/download/v0.1.0/ctranslate2-4.8.0-cp312-cp312-manylinux_2_39_x86_64.whl
```

For any other Python version, glibc, CUDA/cuDNN, or platform, build from source with the
scripts above. Each new `v*` tag rebuilds and attaches a fresh wheel.

## Credit

The actual fix is **OpenNMT/CTranslate2 PR #1766 by Giulio Paci ([@giuliopaci](https://github.com/giuliopaci)),
which is still open.** It re-adds sm_50 to the CUDA 12 arch list and guards the AWQ
dequantize kernel for pre-sm_53 devices. All credit for the working code goes to him.

- PR: https://github.com/OpenNMT/CTranslate2/pull/1766
- Issue (Maxwell / CUDA 12 codegen): https://github.com/OpenNMT/CTranslate2/issues/1765
- Original Quadro K2200 report: https://github.com/OpenNMT/CTranslate2/issues/1666

**This repository merely packages and freezes that patch** against a known-good toolchain.
It contributes no kernel code of its own; see [`patches/1766-sm50.patch`](patches/1766-sm50.patch).

## Conservative fallback

If the cuDNN 9.10 / Maxwell path proves fragile (e.g. apt no longer offers a 9.10.x build,
or you hit cuDNN runtime issues), fall back to **cuDNN 8 + `ctranslate2==4.4.0`**. cuDNN 8
fully supported sm_50, and CTranslate2 4.4.0 predates the upstream changes that dropped it.
This is older and slower-moving, but rock-solid on Maxwell.

## License

The scripts, patch packaging, and configuration in this repository are released under the
**MIT License**. CTranslate2 itself is also **MIT-licensed**; the sm_50 patch is carried
under the same terms as upstream.