{"id":48956209,"url":"https://github.com/forkni/cuda-link","last_synced_at":"2026-05-30T12:00:34.658Z","repository":{"id":340003546,"uuid":"1155915380","full_name":"forkni/cuda-link","owner":"forkni","description":"Zero-copy bidirectional GPU texture sharing between TouchDesigner and Python via CUDA IPC. Sub-microsecond per-frame overhead with ring buffer architecture and GPU-side synchronization.","archived":false,"fork":false,"pushed_at":"2026-05-29T07:48:50.000Z","size":96820,"stargazers_count":17,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-29T09:25:17.285Z","etag":null,"topics":["cuda","cupy","gpu","inter-process-communication","ipc","python","pytorch","real-time","shared-memory","texture-sharing","touchdesigner","zero-copy"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/forkni.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-12T03:38:11.000Z","updated_at":"2026-05-29T07:47:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/forkni/cuda-link","commit_stats":null,"previous_names":["forkni/cuda-link"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/forkni/cuda-link","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forkni%2Fcuda-link","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forkni%2Fcuda-link/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forkni%2Fcuda-link/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forkni%2Fcuda-link/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/forkni","download_url":"https://codeload.github.com/forkni/cuda-link/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forkni%2Fcuda-link/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33691312,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cupy","gpu","inter-process-communication","ipc","python","pytorch","real-time","shared-memory","texture-sharing","touchdesigner","zero-copy"],"created_at":"2026-04-18T00:05:57.303Z","updated_at":"2026-05-30T12:00:34.642Z","avatar_url":"https://github.com/forkni.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cuda-link\n\nZero-copy GPU texture transfer between TouchDesigner and Python processes using CUDA IPC.\n\n## Overview\n\nThis component enables **zero-copy GPU texture sharing** between TouchDesigner and Python processes using CUDA Inter-Process Communication (IPC). It eliminates CPU memory copies for real-time AI pipelines, video processing, and other GPU-accelerated workflows.\n\n### Key Features\n\n- **Zero-copy GPU transfer** - Textures stay on GPU, no CPU memory copies\n- **Bidirectional IPC** - TD → Python (input capture) AND Python → TD (AI output display)\n- **Low-overhead IPC** - `export_frame()` 22–367 µs p50 (512×512 → 4K float32, EXPORT_SYNC=1); `get_frame_numpy()` D2H 0.18–5.7 ms p50 (PCIe 4.0 ~22–24 GB/s); IPC notification ~136–286 µs cross-process (see [docs/BENCHMARKS.md](docs/BENCHMARKS.md))\n- **Ring buffer architecture** - N-slot pipeline prevents producer/consumer blocking\n- **GPU-side synchronization** - CUDA IPC events eliminate CPU polling\n- **Triple output modes** - PyTorch tensors (GPU, zero-copy), CuPy arrays (GPU, zero-copy), or numpy arrays (CPU, D2H copy)\n- **Production-ready** - Tested at 30+ FPS for hours, handles dynamic resolution changes\n\n### Performance\n\nMeasured on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. All Python-side.\n\n| Operation | p50 | Notes |\n|-----------|-----|-------|\n| `export_frame()` — 512×512 RGBA float32 | 22 µs | Standalone, EXPORT_SYNC=1; GPU D2D + stream_synchronize |\n| `export_frame()` — 1080p RGBA float32 | 117 µs | Standalone, EXPORT_SYNC=1 |\n| `export_frame()` — 4K RGBA float32 | 367 µs | Standalone, EXPORT_SYNC=1 |\n| `get_frame_numpy()` D2H — 512×512 float32 | 0.18 ms | Standalone, ~22 GB/s |\n| `get_frame_numpy()` D2H — 1080p float32 | 1.32 ms | Standalone, ~24 GB/s PCIe 4.0 |\n| `get_frame_numpy()` D2H — 4K float32 | 5.7 ms | Standalone, ~21 GB/s PCIe 4.0 |\n| `get_frame()` / `get_frame_cupy()` GPU | \u003c5 µs | Zero-copy tensor/array view, no D2H |\n| IPC notification latency | ~136–286 µs | Producer publish → consumer detect (cross-process) |\n| Initialization | ~50–100 µs | One-time IPC handle opening |\n\n## Requirements\n\n- **OS**: Windows 10/11 (CUDA IPC is Windows-only)\n- **CUDA**: 12.x (tested with 12.4)\n- **GPU**: NVIDIA GPU with CUDA compute capability 3.5+\n- **TouchDesigner**: 2022.x or later (for producer side)\n- **Python**: 3.9+ (for consumer side)\n\n### Python Dependencies\n\n**Required**: None (pure ctypes CUDA wrapper)\n\n**Optional**:\n\n- `torch\u003e=2.0` - For zero-copy GPU tensor output (recommended for AI pipelines)\n- `cupy-cuda12x\u003e=12.0` - For zero-copy GPU array output (CuPy/JAX workflows)\n- `numpy\u003e=1.21` - For CPU array output (for OpenCV, etc.)\n\n## Quick Start\n\n### 1. TouchDesigner Side (Exporter)\n\n**Option A: Use the .tox component** (recommended)\n\n1. Drag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TD network\n2. Wire your source TOP to the `input` In TOP\n3. Set `Ipcmemname` parameter (e.g., `\"my_texture_ipc\"`)\n4. Enable `Active` toggle\n\nThe component displays its transfer state in the read-only **Status** custom parameter:\n`\"\u003cW\u003ex\u003cH\u003e \u003cdtype\u003e \u003cch\u003ech\"` during active transfer, `\"WARNING: ...\"` or `\"ERROR: ...\"` on\nfaults, and `\"Idle\"` when inactive. A `warning_emitter` Script TOP inside the COMP also\nshows a local warning badge when the component is open. See [`td_exporter/HELP_DOC.md`](td_exporter/HELP_DOC.md)\nfor per-parameter documentation.\n\n**Option B: Build from source**\n\nSee [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) for step-by-step assembly.\n\n**Option C: Library mode (cleaner .tox — fewer Text DATs)**\n\nInstall `cuda_link` into a Python environment TouchDesigner can see. The `CUDALinkBootstrap`\nDAT then loads the package automatically — the 14 mirror Text DATs (Env, SHMProtocol,\nExporter, Importer, …) are no longer needed in the `.tox`. Run the multi-target installer\n(one-time):\n\n```bat\nbuild_wheel.cmd                    REM build dist\\cuda_link-1.7.1-py3-none-any.whl\ninstall_td_library.cmd             REM interactive menu — choose one of 5 install modes\n```\n\n**Install modes** (`python scripts/install_td_library.py --help`):\n\n| Mode | Flag | Description |\n|------|------|-------------|\n| 1 | `--target DIR` | Install into a custom folder; set `CUDALINK_LIB_PATH=DIR` before launching TD |\n| 2 | `--venv DIR` | Install into an existing venv that TD is configured to use |\n| 3 | `--conda ENV` | Install into a conda environment |\n| 4 | `--python EXE` | Install into a parallel Python; auto-writes TD Preferences — no env var needed |\n| 5 | `--td-python EXE` | Install directly into TD's bundled Python (`app.pythonExecutable`) |\n\n**Mode 4 (recommended for most setups):** auto-discovers both the registered system Python\n(`py -3`) and the TouchDesigner install path; sets `Python64 Path` in TD Preferences so\nlibrary mode activates on the next TD launch with zero env-var configuration.\n\n```bat\nREM Non-interactive mode 4 (auto-discover Python + TD):\ninstall_td_library.cmd --mode 4 --non-interactive\n\nREM Dry-run to preview what would be written:\ninstall_td_library.cmd --mode 4 --dry-run\n```\n\nThe `TDHost`/`TDConfig`/`TDSender`/`TDReceiver` glue DATs remain in the COMP unchanged.\nIf `CUDALINK_LIB_PATH` is unset and mode 4 was not used, the bootstrap no-ops and the\nclassic mirror DATs take over silently. See [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md)\nfor full instructions.\n\n### 2. Python Side (Importer)\n\n#### Install the package\n\n```bash\n# Option A: Build wheel and install (recommended — portable, no source needed):\ncd C:\\path\\to\\CUDA_IPC\nbuild_wheel.cmd                             # Builds dist\\cuda_link-1.7.1-py3-none-any.whl\n\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[torch]\"   # PyTorch GPU tensors\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[cupy]\"    # CuPy GPU arrays\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[numpy]\"   # NumPy CPU arrays\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[all]\"     # All output modes\n\n# Option B: Editable install from source (for development — changes apply immediately):\npip install -e \".[torch]\"\npip install -e \".[all]\"\n\n# From PyPI (coming soon):\n# pip install cuda-link[torch]\n```\n\n#### Use in your Python script\n\n```python\nfrom cuda_link import Importer, ImportSpec, ImportOutcome\n\nimporter = Importer.open(\n    ImportSpec(\n        shm_name=\"my_texture_ipc\",\n        shape=(1080, 1920, 4),  # height, width, channels (RGBA) — or None for auto-detect\n        dtype=\"float32\",         # \"float32\", \"float16\", \"uint8\" — or None for auto-detect\n        timeout_ms=5000.0,       # Wait up to 5s for producer to appear (default)\n    )\n)\n\n# Option 1: Get torch.Tensor (GPU, zero-copy)\nresult = importer.get_frame()\nif result.outcome is ImportOutcome.NEW_FRAME:\n    tensor = result.frame  # torch.Tensor on GPU, shape (1080, 1920, 4)\n    # Use directly in AI model:\n    # output = model(tensor)\n\n# Option 2: Get numpy array (CPU, involves D2H copy)\nresult = importer.get_frame_numpy()\nif result.outcome is ImportOutcome.NEW_FRAME:\n    array = result.frame  # numpy.ndarray on CPU\n    # Use in OpenCV, PIL, etc.:\n    # cv2.imwrite(\"frame.png\", array)\n\n# Option 3: Get CuPy array (GPU, zero-copy)\nresult = importer.get_frame_cupy()\nif result.outcome is ImportOutcome.NEW_FRAME:\n    cupy_arr = result.frame  # cupy.ndarray on GPU\n    # Use in CuPy/JAX workflows\n\n# Context manager (recommended — ensures cleanup on exit)\nwith Importer.open(ImportSpec(shm_name=\"my_texture_ipc\")) as importer:\n    for _ in range(100):\n        result = importer.get_frame()\n        if result.outcome is ImportOutcome.NEW_FRAME:\n            tensor = result.frame\n\n# Explicit cleanup\nimporter.close()\n```\n\n### 3. Python → TouchDesigner (AI Output)\n\nSend AI-generated frames **back to TD** for display:\n\n```python\nfrom cuda_link import Exporter, FrameSpec, GpuFrame\n\nwith Exporter.open(\n    FrameSpec(\n        shm_name=\"ai_output_ipc\",  # Must match TD Receiver's Ipcmemname parameter\n        height=512, width=512,\n        channels=4, dtype=\"uint8\",\n        num_slots=2,               # Ring buffer slots (double-buffering)\n    )\n) as exporter:\n    # Export each AI-generated frame (~10-20μs overhead at 512x512)\n    exporter.export(GpuFrame(\n        ptr=output_tensor.data_ptr(),\n        size=output_tensor.nbytes,\n    ))\n```\n\nOn the TD side, set `CUDAIPCExtension` **Mode** to `Receiver` with matching `Ipcmemname`.\n\n## Architecture\n\n```\nDirection A: TD (Producer) → Python (Consumer)\n──────────────────────────────────────────────\nCUDAIPCExtension facade\n  └── TDSenderEngine (thin TD adapter)   Importer\n        │ cuda_memory() → GpuFrame         │ get_frame() / get_frame_numpy()\n        │ delegates to Exporter            │ Waits on IPC event\n        └─→ SharedMemory ←─────────────────┘\n\nDirection B: Python (Producer) → TD (Consumer)\n───────────────────────────────────────────────\nExporter                           CUDAIPCExtension facade\n  │ export(GpuFrame(ptr, size))      └── TDReceiverEngine\n  │ cudaMemcpy D2D → ring buf             │ import_frame(script_top)\n  └─→ SharedMemory ←──────────────────────┘ copyCUDAMemory()\n\nBoth directions share the same v0.5.0 binary protocol.\n```\n\nThe TD extension uses a **facade-with-delegation** pattern: `CUDAIPCExtension` (~300 LOC) holds either a `TDSenderEngine` or `TDReceiverEngine` and delegates all work to it. `TDSenderEngine` is a thin TD-only adapter (~415 LOC) over the canonical `Exporter` — it owns pixel-format bridging, the `cuda_memory()`→`GpuFrame` translation, dynamic geometry reopen, and `HolderBarrier` lifecycle; all GPU ring-buffer logic delegates to `Exporter`. Mode switches replace the engine entirely — zero cross-mode state leak. All TouchDesigner runtime access (`ownerComp.par.*`, `top.cudaMemory()`, `copyCUDAMemory()`) goes through the `TDHost`/`TOPHandle` adapter seam, making the engine logic testable without a TD runtime.\n\n### Ring Buffer (3 Slots)\n\nThe system uses a 3-slot ring buffer to allow producer and consumer to work in parallel:\n\n- **Slot 0**: Producer writes frame N\n- **Slot 1**: Producer writes frame N+1 while consumer reads frame N\n- **Slot 2**: Producer writes frame N+2 while consumer reads frame N+1\n- Wraps back to Slot 0 for frame N+3\n\nThis prevents blocking - producer never waits for consumer, consumer is always 1 frame behind.\n\n### SharedMemory Protocol (433 bytes for 3 slots)\n\n```\n[0-3]     magic \"CIPD\" (4B)       - Protocol validation (0x43495044)\n[4-11]    version (8B)             - Increments on TD re-initialization\n[12-15]   num_slots (4B)           - Number of ring buffer slots (3)\n[16-19]   write_idx (4B)           - Current write index (atomic counter)\n\nPer slot (128 bytes each):\n[20+slot*128 : 84+slot*128]   cudaIpcMemHandle_t (64B)  - GPU memory handle\n[84+slot*128 : 148+slot*128]  cudaIpcEventHandle_t (64B) - GPU event handle\n\n[20+NUM_SLOTS*128]        shutdown_flag (1B)   - Reasserted to 0 every frame; set to 1 on exit\n[21+NUM_SLOTS*128]        metadata (20B)       - width/height/num_comps/dtype/buffer_size\n[41+NUM_SLOTS*128]        timestamp (8B)       - Producer perf_counter() for latency\n```\n\nFor 3 slots: `20 + (3 × 128) + 1 + 20 + 8 = 433 bytes`\n\n## Documentation\n\n- **[TOX Build Guide](docs/TOX_BUILD_GUIDE.md)** - Step-by-step .tox assembly in TouchDesigner\n- **[Architecture](docs/ARCHITECTURE.md)** - Protocol spec, ring buffer design, GPU sync\n- **[Integration Examples](docs/INTEGRATION_EXAMPLES.md)** - TD→PyTorch, TD→OpenCV, multi-stream\n\n## Testing\n\nRun the full test suite:\n\n```bash\ncd C:\\path\\to\\CUDA_IPC\n\n# Protocol tests (no CUDA needed)\npytest tests/test_shm_protocol.py -v\n\n# Unit tests (requires CUDA)\npytest tests/test_cuda_ipc_wrapper.py -v\n\n# All tests\npytest tests/ -v\n\n# Skip slow multi-process tests\npytest tests/ -v -m \"not slow\"\n```\n\n## Benchmarks\n\nAll results on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. RGBA (4-channel) frames.\n\nKey highlights:\n\n- **`export_frame()` standalone** — 22 µs p50 (512×512) → 367 µs (4K) with EXPORT_SYNC=1. CUDA Graphs saves \u003c5% at these sizes (GPU D2D copy dominates).\n- **`get_frame_numpy()` D2H** — 0.18 ms p50 (512×512) → 5.7 ms (4K) at ~22–24 GB/s PCIe 4.0.\n- **Full IPC roundtrip** — IPC notification latency ~136–286 µs cross-process (resolution-independent signaling).\n- **vs CPU SharedMemory** — ~3.4× faster E2E at 1080p, ~2.1× at 512×512. Producer write 4–19× faster (no CPU transit). With `get_frame()` / `get_frame_cupy()` (zero-copy), the consumer read collapses to \u003c5 µs.\n\nFull tables, per-resolution breakdowns, and CUDA Graphs A/B comparison: **[docs/BENCHMARKS.md](docs/BENCHMARKS.md)**\n\n### Performance Tuning (env vars)\n\n| Variable | Default | Effect |\n|---|---|---|\n| `CUDALINK_USE_GRAPHS` | `1` | CUDA Graphs for `export()` (Python-side `Exporter`). Collapses the `stream_wait_event + memcpy_async + record_event` triplet into a single `cudaGraphLaunch`, cutting WDDM kernel-mode transitions from 3 → 2 per frame. With EXPORT_SYNC=1 (default) the GPU D2D copy dominates wall-clock time and the net savings are small (\u003c5% at 1080p on PCIe 4.0); see [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for measured A/B. Set to `0` to revert to the legacy stream path (e.g., if a driver version rejects graph capture). |\n| `CUDALINK_TD_USE_GRAPHS` | `0` | CUDA Graphs for the TouchDesigner-side `CUDAIPCExtension` Sender. Same mechanism as `CUDALINK_USE_GRAPHS`, gated independently because TD ships `cudart64_110.dll` and the per-frame `cudaGraphExecMemcpyNodeSetParams1D` API requires CUDA 11.3+. Auto-disabled on older runtimes (probed via `cudaRuntimeGetVersion` at `initialize()`). Disabled by default. Set to `1` to opt in; falls back to the legacy `cudaMemcpyAsync` stream path automatically on any capture or launch failure. |\n| `CUDALINK_D2H_STREAMS` | `1` | Number of parallel streams for `get_frame_numpy()` D2H copy. Values `2`/`4` may help on PCIe 3.0 systems or GPUs with dual DMA engines; on PCIe 4.0 a single stream already saturates the bus (~23–24 GB/s). Check `nvidia-smi -q \\| findstr \"Async Engines\"` before tuning. |\n| `CUDALINK_EXPORT_SYNC` | `0` | Block CPU on the IPC stream after each `export_frame()`. Default off — the CUDA IPC event already provides correct cross-process GPU ordering. Set to `1` to restore synchronous behaviour (recommended for concurrent TD Sender+Receiver topologies and when `CUDALINK_USE_GRAPHS=0`). Note: `TDConfig` (`td_exporter/TDConfig.py`) still defaults to `True` for TDR-cascade safety in shared-process topologies. |\n| `CUDALINK_ACTIVATION_BARRIER` | `1` | Python-lib side of the cross-process activation barrier (F9). Reads a tiny SHM counter each `export_frame()` and skips publishing while a TD Sender is in its WDDM-saturating init window. No-op in single-pair topologies (counter stays at 0); gracefully skipped if the SHM segment is absent. Set to `0` to opt out. |\n| `CUDALINK_TD_ACTIVATION_BARRIER` | `1` | TD-side counterpart of `CUDALINK_ACTIVATION_BARRIER` — increments the same SHM counter around Sender `initialize()` so the Python producer backs off. Same no-op / graceful-absence behaviour. Set to `0` to opt out. |\n| `CUDALINK_TD_PERSIST_STREAM` | `1` | Skip `stream_destroy` in TD Sender `cleanup()` so the IPC CUDA stream survives `deactivate`→`reactivate` cycles (F8). Free in single-pair (no deactivation ever happens); load-bearing in concurrent — without it, stream recreate on each reactivation collides with in-flight Receiver work, doubling first-settle `post=` latency (Phase 3.6 confirmed). Set to `0` to opt out. |\n| `CUDALINK_TD_STREAM_PRIO` | `normal` | CUDA stream priority for the TD Sender's IPC stream. Default `normal` is safe for both single-pair and concurrent topologies — in single-pair only one stream exists per process so priority is moot; in concurrent, equal priorities prevent WDDM contention accumulation across reactivation cycles (high/high contention produces non-recovering cycle-3 shutdowns, Phase 3.6 Step C confirmed). Set to `high` only for explicit single-pair lowest-latency optimisation. |\n| `CUDALINK_EXPORT_FLUSH_PROBE` | `1` | Insert a non-blocking `cudaStreamQuery(ipc_stream)` after `check_sticky_error` when `EXPORT_SYNC=0`. Forces WDDM-deferred CUDA submissions to drain each frame, preventing Windows Task Manager's 3D-engine counter from inflating when true compute load (per NVML) is low. NVML readings are unchanged — purely cosmetic/observability. Set to `0` to disable. |\n| `CUDALINK_EXPORT_PROFILE` | `0` | Enable fine-grained per-region sub-timers in `export_frame()` and emit a `[PROFILE] pre=…us interop=…us post=…us memcpy=…us record=…us sync=…us sticky=…us flush_probe=…us shm=…us unacc=…us` line every 97 frames. Force-enables `verbose_performance` (TD) / `debug` (lib). Diagnostic-only; negligible overhead when on, zero when unset. |\n| `CUDALINK_NVTX` | `0` | Enable NVTX range annotations on top-level phase boundaries (`cudalink.exporter.flush_probe`, `cudalink.receiver.import_frame`, `cudalink.receiver.event_wait`, etc.) for Nsight Systems GPU timeline correlation. Zero overhead when off. Set to `1` before running any `nsys` capture; see [docs/PROFILING.md](docs/PROFILING.md). |\n| `CUDALINK_NVTX_VERBOSE` | `0` | Enable additional sub-operation NVTX ranges (sticky-error check, D2A copy submit, SHM header read) inside the top-level phase ranges. Only useful for deep per-frame breakdown captures; implies `CUDALINK_NVTX=1`. |\n| `CUDALINK_TD_GRAPHS_DEFERRED` | `0` | Defer CUDA Graph capture to after the second `export_frame()` call (TD Sender). Avoids a first-frame graph-capture stall in latency-sensitive topologies where the graph build cost would be visible. |\n| `CUDALINK_TD_INIT_PACE` | `0` | Throttle the TD Sender init sequence to reduce WDDM saturation during activation windows (experimental). Adds a small sleep between consecutive CUDA API calls at `initialize()` time; useful when concurrent Sender+Receiver activation produces WDDM kernel-mode queue backpressure. |\n| `CUDALINK_TD_BARRIER_SETTLE_FRAMES` | `30` | Number of frames the TD activation barrier counter remains armed after a Sender `initialize()` completes, giving the Python producer time to back off before publishing resumes. Increase if your Python producer's poll loop is slower than 30 frames at your target rate; decrease for tighter single-pair topologies. |\n| `CUDALINK_NVML` | `0` | Append NVML GPU telemetry (utilization %, clocks MHz, PCIe Tx/Rx MB/s, temperature °C, power W, throttle reasons) to the 97-frame periodic stats line emitted by `CUDALINK_EXPORT_PROFILE`. Requires `nvidia-ml-py` (`pip install nvidia-ml-py`). Zero overhead when off. |\n\nFor GPU-timeline profiling (Nsight Systems / Nsight Compute / compute-sanitizer) see [docs/PROFILING.md](docs/PROFILING.md).\n\n\n## Troubleshooting\n\n### \"SharedMemory not found\"\n\n**Cause**: Python importer started before TD exporter initialized.\n\n**Solution**: Ensure the TD component is active before starting the Python process. If starting both together, use `timeout_ms` to give the producer time to initialize:\n\n```python\nfrom cuda_link import Importer, ImportSpec\nimporter = Importer.open(ImportSpec(shm_name=\"my_project_ipc\", timeout_ms=10000.0))  # Wait up to 10s\n```\n\n### \"CUDA IPC overhead unexpectedly high\"\n\n**Cause**: In standalone Python processes (WDDM), `export_frame()` with EXPORT_SYNC=1 typically measures 42–400 µs p50 (512×512 → 4K float32 RGBA, RTX 4090 / PCIe 4.0). Values 2–5× higher than these baselines may indicate GPU driver overhead, context contention, or PCIe bandwidth sharing with other D2H workloads.\n\n**Solution**: Compare against the baseline numbers in [docs/BENCHMARKS.md](docs/BENCHMARKS.md). Contributors with a local clone may reproduce using `python benchmarks/bench_graphs.py` (standalone) or `python benchmarks/bench_sweep.py --quick` (multiprocess).\n\n### \"Version mismatch\" or stale frames\n\n**Cause**: TD re-exported IPC handles (network reset, resolution change).\n\n**Solution**: The importer automatically detects version changes and re-opens handles. No action needed.\n\n### GPU memory leak\n\n**Cause**: Importer not cleaned up properly.\n\n**Solution**: Use the context manager pattern for automatic cleanup:\n\n```python\nfrom cuda_link import Importer, ImportSpec, ImportOutcome\nwith Importer.open(ImportSpec(shm_name=\"my_project_ipc\")) as importer:\n    # importer.close() is called automatically on exit\n    result = importer.get_frame()\n    if result.outcome is ImportOutcome.NEW_FRAME:\n        tensor = result.frame\n```\n\nOr call `importer.close()` explicitly in a `finally` block.\n\n## Distribution\n\ncuda-link uses a **dual distribution model** to support both use cases:\n\n### For Python Consumers (StreamDiffusion, AI/ML pipelines)\n\n#### Method 1: Build wheel (recommended — portable, installs into any environment)\n\n```bash\ngit clone https://github.com/forkni/cuda-link.git\ncd cuda-link\n\n# Run the build script (uses PEP 517 isolated build via python -m build)\nbuild_wheel.cmd\n# Output: dist\\cuda_link-1.7.1-py3-none-any.whl  (~30 KB)\n\n# Install into any Python environment — conda, venv, system Python, TouchDesigner Python:\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[torch]\"   # PyTorch GPU tensors\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[cupy]\"    # CuPy GPU arrays\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[numpy]\"   # NumPy CPU arrays\npip install \"dist\\cuda_link-1.7.1-py3-none-any.whl[all]\"     # All output modes\n\n# Force reinstall to update:\npip install --force-reinstall \"dist\\cuda_link-1.7.1-py3-none-any.whl[torch]\"\n```\n\nThe wheel is a self-contained archive — copy it anywhere and install without needing the source tree.\n\n#### Method 2: Editable install from source (for development)\n\n```bash\ngit clone https://github.com/forkni/cuda-link.git\ncd cuda-link\npip install -e \".[torch]\"   # Changes to src/cuda_link/ apply immediately, no rebuild needed\npip install -e \".[all]\"     # All output modes\n```\n\n#### Method 3: From PyPI (coming soon)\n\n```bash\n# pip install cuda-link[torch]\n```\n\n**Usage**:\n\n```python\nfrom cuda_link import Importer, ImportSpec, ImportOutcome\n\nimporter = Importer.open(ImportSpec(shm_name=\"my_project_ipc\"))\nresult = importer.get_frame()\nif result.outcome is ImportOutcome.NEW_FRAME:\n    tensor = result.frame  # torch.Tensor, GPU zero-copy\n```\n\nThe `cuda-link` package contains only the **consumer-side** Python code (`src/cuda_link/`). The TouchDesigner extension is distributed separately.\n\n### For TouchDesigner Integration\n\n**Option A: Use the .tox component** (recommended)\n\nDrag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TouchDesigner network.\n\n\u003e **Older versions:** Previous `.tox` releases are available as downloadable assets on the\n\u003e [GitHub Releases page](https://github.com/forkni/cuda-link/releases) — pick the tag\n\u003e matching the TouchDesigner build you target.\n\n**Option B: Build from source**\n\nFollow the manual build guide at [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) to assemble the `.tox` from `td_exporter/` source files.\n\nThe TouchDesigner extension (`td_exporter/`) is **not included in the pip package** because it uses TD-specific APIs (`parent()`, `op()`, `me`, COMP-scoped imports) that cannot run outside TouchDesigner.\n\n### Use Cases\n\n| Use Case | TD Side | Python Side |\n|----------|---------|-------------|\n| **TD → Python** (StreamDiffusion, AI pipelines) | `.tox` Sender mode | `pip install dist\\cuda_link-*.whl[torch]` |\n| **Python → TD** (AI output display) | `.tox` Receiver mode | `pip install dist\\cuda_link-*.whl[torch]` |\n| **TD → TD** (two instances communicating) | `.tox` on both sides | Not needed |\n\nBoth sides communicate through the 433-byte SharedMemory protocol — zero import dependencies between TD and Python code.\n\n---\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for the full history.\n\n---\n\n## License\n\nMIT License - See LICENSE file\n\n## Credits\n\nOriginal implementation by Forkni (forkni@gmail.com).\nExtracted and refactored from the StreamDiffusionTD project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforkni%2Fcuda-link","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fforkni%2Fcuda-link","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforkni%2Fcuda-link/lists"}