https://github.com/forkni/cuda-link
Zero-copy bidirectional GPU texture sharing between TouchDesigner and Python via CUDA IPC. Sub-microsecond per-frame overhead with ring buffer architecture and GPU-side synchronization.
https://github.com/forkni/cuda-link
cuda cupy gpu inter-process-communication ipc python pytorch real-time shared-memory texture-sharing touchdesigner zero-copy
Last synced: 3 days ago
JSON representation
Zero-copy bidirectional GPU texture sharing between TouchDesigner and Python via CUDA IPC. Sub-microsecond per-frame overhead with ring buffer architecture and GPU-side synchronization.
- Host: GitHub
- URL: https://github.com/forkni/cuda-link
- Owner: forkni
- License: mit
- Created: 2026-02-12T03:38:11.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2026-05-29T07:48:50.000Z (4 days ago)
- Last Synced: 2026-05-29T09:25:17.285Z (4 days ago)
- Topics: cuda, cupy, gpu, inter-process-communication, ipc, python, pytorch, real-time, shared-memory, texture-sharing, touchdesigner, zero-copy
- Language: Python
- Size: 92.3 MB
- Stars: 17
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# cuda-link
Zero-copy GPU texture transfer between TouchDesigner and Python processes using CUDA IPC.
## Overview
This component enables **zero-copy GPU texture sharing** between TouchDesigner and Python processes using CUDA Inter-Process Communication (IPC). It eliminates CPU memory copies for real-time AI pipelines, video processing, and other GPU-accelerated workflows.
### Key Features
- **Zero-copy GPU transfer** - Textures stay on GPU, no CPU memory copies
- **Bidirectional IPC** - TD → Python (input capture) AND Python → TD (AI output display)
- **Low-overhead IPC** - `export_frame()` 22–367 µs p50 (512×512 → 4K float32, EXPORT_SYNC=1); `get_frame_numpy()` D2H 0.18–5.7 ms p50 (PCIe 4.0 ~22–24 GB/s); IPC notification ~136–286 µs cross-process (see [docs/BENCHMARKS.md](docs/BENCHMARKS.md))
- **Ring buffer architecture** - N-slot pipeline prevents producer/consumer blocking
- **GPU-side synchronization** - CUDA IPC events eliminate CPU polling
- **Triple output modes** - PyTorch tensors (GPU, zero-copy), CuPy arrays (GPU, zero-copy), or numpy arrays (CPU, D2H copy)
- **Production-ready** - Tested at 30+ FPS for hours, handles dynamic resolution changes
### Performance
Measured on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. All Python-side.
| Operation | p50 | Notes |
|-----------|-----|-------|
| `export_frame()` — 512×512 RGBA float32 | 22 µs | Standalone, EXPORT_SYNC=1; GPU D2D + stream_synchronize |
| `export_frame()` — 1080p RGBA float32 | 117 µs | Standalone, EXPORT_SYNC=1 |
| `export_frame()` — 4K RGBA float32 | 367 µs | Standalone, EXPORT_SYNC=1 |
| `get_frame_numpy()` D2H — 512×512 float32 | 0.18 ms | Standalone, ~22 GB/s |
| `get_frame_numpy()` D2H — 1080p float32 | 1.32 ms | Standalone, ~24 GB/s PCIe 4.0 |
| `get_frame_numpy()` D2H — 4K float32 | 5.7 ms | Standalone, ~21 GB/s PCIe 4.0 |
| `get_frame()` / `get_frame_cupy()` GPU | <5 µs | Zero-copy tensor/array view, no D2H |
| IPC notification latency | ~136–286 µs | Producer publish → consumer detect (cross-process) |
| Initialization | ~50–100 µs | One-time IPC handle opening |
## Requirements
- **OS**: Windows 10/11 (CUDA IPC is Windows-only)
- **CUDA**: 12.x (tested with 12.4)
- **GPU**: NVIDIA GPU with CUDA compute capability 3.5+
- **TouchDesigner**: 2022.x or later (for producer side)
- **Python**: 3.9+ (for consumer side)
### Python Dependencies
**Required**: None (pure ctypes CUDA wrapper)
**Optional**:
- `torch>=2.0` - For zero-copy GPU tensor output (recommended for AI pipelines)
- `cupy-cuda12x>=12.0` - For zero-copy GPU array output (CuPy/JAX workflows)
- `numpy>=1.21` - For CPU array output (for OpenCV, etc.)
## Quick Start
### 1. TouchDesigner Side (Exporter)
**Option A: Use the .tox component** (recommended)
1. Drag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TD network
2. Wire your source TOP to the `input` In TOP
3. Set `Ipcmemname` parameter (e.g., `"my_texture_ipc"`)
4. Enable `Active` toggle
The component displays its transfer state in the read-only **Status** custom parameter:
`"x ch"` during active transfer, `"WARNING: ..."` or `"ERROR: ..."` on
faults, and `"Idle"` when inactive. A `warning_emitter` Script TOP inside the COMP also
shows a local warning badge when the component is open. See [`td_exporter/HELP_DOC.md`](td_exporter/HELP_DOC.md)
for per-parameter documentation.
**Option B: Build from source**
See [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) for step-by-step assembly.
**Option C: Library mode (cleaner .tox — fewer Text DATs)**
Install `cuda_link` into a Python environment TouchDesigner can see. The `CUDALinkBootstrap`
DAT then loads the package automatically — the 14 mirror Text DATs (Env, SHMProtocol,
Exporter, Importer, …) are no longer needed in the `.tox`. Run the multi-target installer
(one-time):
```bat
build_wheel.cmd REM build dist\cuda_link-1.7.1-py3-none-any.whl
install_td_library.cmd REM interactive menu — choose one of 5 install modes
```
**Install modes** (`python scripts/install_td_library.py --help`):
| Mode | Flag | Description |
|------|------|-------------|
| 1 | `--target DIR` | Install into a custom folder; set `CUDALINK_LIB_PATH=DIR` before launching TD |
| 2 | `--venv DIR` | Install into an existing venv that TD is configured to use |
| 3 | `--conda ENV` | Install into a conda environment |
| 4 | `--python EXE` | Install into a parallel Python; auto-writes TD Preferences — no env var needed |
| 5 | `--td-python EXE` | Install directly into TD's bundled Python (`app.pythonExecutable`) |
**Mode 4 (recommended for most setups):** auto-discovers both the registered system Python
(`py -3`) and the TouchDesigner install path; sets `Python64 Path` in TD Preferences so
library mode activates on the next TD launch with zero env-var configuration.
```bat
REM Non-interactive mode 4 (auto-discover Python + TD):
install_td_library.cmd --mode 4 --non-interactive
REM Dry-run to preview what would be written:
install_td_library.cmd --mode 4 --dry-run
```
The `TDHost`/`TDConfig`/`TDSender`/`TDReceiver` glue DATs remain in the COMP unchanged.
If `CUDALINK_LIB_PATH` is unset and mode 4 was not used, the bootstrap no-ops and the
classic mirror DATs take over silently. See [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md)
for full instructions.
### 2. Python Side (Importer)
#### Install the package
```bash
# Option A: Build wheel and install (recommended — portable, no source needed):
cd C:\path\to\CUDA_IPC
build_wheel.cmd # Builds dist\cuda_link-1.7.1-py3-none-any.whl
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[torch]" # PyTorch GPU tensors
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[cupy]" # CuPy GPU arrays
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[numpy]" # NumPy CPU arrays
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[all]" # All output modes
# Option B: Editable install from source (for development — changes apply immediately):
pip install -e ".[torch]"
pip install -e ".[all]"
# From PyPI (coming soon):
# pip install cuda-link[torch]
```
#### Use in your Python script
```python
from cuda_link import Importer, ImportSpec, ImportOutcome
importer = Importer.open(
ImportSpec(
shm_name="my_texture_ipc",
shape=(1080, 1920, 4), # height, width, channels (RGBA) — or None for auto-detect
dtype="float32", # "float32", "float16", "uint8" — or None for auto-detect
timeout_ms=5000.0, # Wait up to 5s for producer to appear (default)
)
)
# Option 1: Get torch.Tensor (GPU, zero-copy)
result = importer.get_frame()
if result.outcome is ImportOutcome.NEW_FRAME:
tensor = result.frame # torch.Tensor on GPU, shape (1080, 1920, 4)
# Use directly in AI model:
# output = model(tensor)
# Option 2: Get numpy array (CPU, involves D2H copy)
result = importer.get_frame_numpy()
if result.outcome is ImportOutcome.NEW_FRAME:
array = result.frame # numpy.ndarray on CPU
# Use in OpenCV, PIL, etc.:
# cv2.imwrite("frame.png", array)
# Option 3: Get CuPy array (GPU, zero-copy)
result = importer.get_frame_cupy()
if result.outcome is ImportOutcome.NEW_FRAME:
cupy_arr = result.frame # cupy.ndarray on GPU
# Use in CuPy/JAX workflows
# Context manager (recommended — ensures cleanup on exit)
with Importer.open(ImportSpec(shm_name="my_texture_ipc")) as importer:
for _ in range(100):
result = importer.get_frame()
if result.outcome is ImportOutcome.NEW_FRAME:
tensor = result.frame
# Explicit cleanup
importer.close()
```
### 3. Python → TouchDesigner (AI Output)
Send AI-generated frames **back to TD** for display:
```python
from cuda_link import Exporter, FrameSpec, GpuFrame
with Exporter.open(
FrameSpec(
shm_name="ai_output_ipc", # Must match TD Receiver's Ipcmemname parameter
height=512, width=512,
channels=4, dtype="uint8",
num_slots=2, # Ring buffer slots (double-buffering)
)
) as exporter:
# Export each AI-generated frame (~10-20μs overhead at 512x512)
exporter.export(GpuFrame(
ptr=output_tensor.data_ptr(),
size=output_tensor.nbytes,
))
```
On the TD side, set `CUDAIPCExtension` **Mode** to `Receiver` with matching `Ipcmemname`.
## Architecture
```
Direction A: TD (Producer) → Python (Consumer)
──────────────────────────────────────────────
CUDAIPCExtension facade
└── TDSenderEngine (thin TD adapter) Importer
│ cuda_memory() → GpuFrame │ get_frame() / get_frame_numpy()
│ delegates to Exporter │ Waits on IPC event
└─→ SharedMemory ←─────────────────┘
Direction B: Python (Producer) → TD (Consumer)
───────────────────────────────────────────────
Exporter CUDAIPCExtension facade
│ export(GpuFrame(ptr, size)) └── TDReceiverEngine
│ cudaMemcpy D2D → ring buf │ import_frame(script_top)
└─→ SharedMemory ←──────────────────────┘ copyCUDAMemory()
Both directions share the same v0.5.0 binary protocol.
```
The TD extension uses a **facade-with-delegation** pattern: `CUDAIPCExtension` (~300 LOC) holds either a `TDSenderEngine` or `TDReceiverEngine` and delegates all work to it. `TDSenderEngine` is a thin TD-only adapter (~415 LOC) over the canonical `Exporter` — it owns pixel-format bridging, the `cuda_memory()`→`GpuFrame` translation, dynamic geometry reopen, and `HolderBarrier` lifecycle; all GPU ring-buffer logic delegates to `Exporter`. Mode switches replace the engine entirely — zero cross-mode state leak. All TouchDesigner runtime access (`ownerComp.par.*`, `top.cudaMemory()`, `copyCUDAMemory()`) goes through the `TDHost`/`TOPHandle` adapter seam, making the engine logic testable without a TD runtime.
### Ring Buffer (3 Slots)
The system uses a 3-slot ring buffer to allow producer and consumer to work in parallel:
- **Slot 0**: Producer writes frame N
- **Slot 1**: Producer writes frame N+1 while consumer reads frame N
- **Slot 2**: Producer writes frame N+2 while consumer reads frame N+1
- Wraps back to Slot 0 for frame N+3
This prevents blocking - producer never waits for consumer, consumer is always 1 frame behind.
### SharedMemory Protocol (433 bytes for 3 slots)
```
[0-3] magic "CIPD" (4B) - Protocol validation (0x43495044)
[4-11] version (8B) - Increments on TD re-initialization
[12-15] num_slots (4B) - Number of ring buffer slots (3)
[16-19] write_idx (4B) - Current write index (atomic counter)
Per slot (128 bytes each):
[20+slot*128 : 84+slot*128] cudaIpcMemHandle_t (64B) - GPU memory handle
[84+slot*128 : 148+slot*128] cudaIpcEventHandle_t (64B) - GPU event handle
[20+NUM_SLOTS*128] shutdown_flag (1B) - Reasserted to 0 every frame; set to 1 on exit
[21+NUM_SLOTS*128] metadata (20B) - width/height/num_comps/dtype/buffer_size
[41+NUM_SLOTS*128] timestamp (8B) - Producer perf_counter() for latency
```
For 3 slots: `20 + (3 × 128) + 1 + 20 + 8 = 433 bytes`
## Documentation
- **[TOX Build Guide](docs/TOX_BUILD_GUIDE.md)** - Step-by-step .tox assembly in TouchDesigner
- **[Architecture](docs/ARCHITECTURE.md)** - Protocol spec, ring buffer design, GPU sync
- **[Integration Examples](docs/INTEGRATION_EXAMPLES.md)** - TD→PyTorch, TD→OpenCV, multi-stream
## Testing
Run the full test suite:
```bash
cd C:\path\to\CUDA_IPC
# Protocol tests (no CUDA needed)
pytest tests/test_shm_protocol.py -v
# Unit tests (requires CUDA)
pytest tests/test_cuda_ipc_wrapper.py -v
# All tests
pytest tests/ -v
# Skip slow multi-process tests
pytest tests/ -v -m "not slow"
```
## Benchmarks
All results on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. RGBA (4-channel) frames.
Key highlights:
- **`export_frame()` standalone** — 22 µs p50 (512×512) → 367 µs (4K) with EXPORT_SYNC=1. CUDA Graphs saves <5% at these sizes (GPU D2D copy dominates).
- **`get_frame_numpy()` D2H** — 0.18 ms p50 (512×512) → 5.7 ms (4K) at ~22–24 GB/s PCIe 4.0.
- **Full IPC roundtrip** — IPC notification latency ~136–286 µs cross-process (resolution-independent signaling).
- **vs CPU SharedMemory** — ~3.4× faster E2E at 1080p, ~2.1× at 512×512. Producer write 4–19× faster (no CPU transit). With `get_frame()` / `get_frame_cupy()` (zero-copy), the consumer read collapses to <5 µs.
Full tables, per-resolution breakdowns, and CUDA Graphs A/B comparison: **[docs/BENCHMARKS.md](docs/BENCHMARKS.md)**
### Performance Tuning (env vars)
| Variable | Default | Effect |
|---|---|---|
| `CUDALINK_USE_GRAPHS` | `1` | CUDA Graphs for `export()` (Python-side `Exporter`). Collapses the `stream_wait_event + memcpy_async + record_event` triplet into a single `cudaGraphLaunch`, cutting WDDM kernel-mode transitions from 3 → 2 per frame. With EXPORT_SYNC=1 (default) the GPU D2D copy dominates wall-clock time and the net savings are small (<5% at 1080p on PCIe 4.0); see [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for measured A/B. Set to `0` to revert to the legacy stream path (e.g., if a driver version rejects graph capture). |
| `CUDALINK_TD_USE_GRAPHS` | `0` | CUDA Graphs for the TouchDesigner-side `CUDAIPCExtension` Sender. Same mechanism as `CUDALINK_USE_GRAPHS`, gated independently because TD ships `cudart64_110.dll` and the per-frame `cudaGraphExecMemcpyNodeSetParams1D` API requires CUDA 11.3+. Auto-disabled on older runtimes (probed via `cudaRuntimeGetVersion` at `initialize()`). Disabled by default. Set to `1` to opt in; falls back to the legacy `cudaMemcpyAsync` stream path automatically on any capture or launch failure. |
| `CUDALINK_D2H_STREAMS` | `1` | Number of parallel streams for `get_frame_numpy()` D2H copy. Values `2`/`4` may help on PCIe 3.0 systems or GPUs with dual DMA engines; on PCIe 4.0 a single stream already saturates the bus (~23–24 GB/s). Check `nvidia-smi -q \| findstr "Async Engines"` before tuning. |
| `CUDALINK_EXPORT_SYNC` | `0` | Block CPU on the IPC stream after each `export_frame()`. Default off — the CUDA IPC event already provides correct cross-process GPU ordering. Set to `1` to restore synchronous behaviour (recommended for concurrent TD Sender+Receiver topologies and when `CUDALINK_USE_GRAPHS=0`). Note: `TDConfig` (`td_exporter/TDConfig.py`) still defaults to `True` for TDR-cascade safety in shared-process topologies. |
| `CUDALINK_ACTIVATION_BARRIER` | `1` | Python-lib side of the cross-process activation barrier (F9). Reads a tiny SHM counter each `export_frame()` and skips publishing while a TD Sender is in its WDDM-saturating init window. No-op in single-pair topologies (counter stays at 0); gracefully skipped if the SHM segment is absent. Set to `0` to opt out. |
| `CUDALINK_TD_ACTIVATION_BARRIER` | `1` | TD-side counterpart of `CUDALINK_ACTIVATION_BARRIER` — increments the same SHM counter around Sender `initialize()` so the Python producer backs off. Same no-op / graceful-absence behaviour. Set to `0` to opt out. |
| `CUDALINK_TD_PERSIST_STREAM` | `1` | Skip `stream_destroy` in TD Sender `cleanup()` so the IPC CUDA stream survives `deactivate`→`reactivate` cycles (F8). Free in single-pair (no deactivation ever happens); load-bearing in concurrent — without it, stream recreate on each reactivation collides with in-flight Receiver work, doubling first-settle `post=` latency (Phase 3.6 confirmed). Set to `0` to opt out. |
| `CUDALINK_TD_STREAM_PRIO` | `normal` | CUDA stream priority for the TD Sender's IPC stream. Default `normal` is safe for both single-pair and concurrent topologies — in single-pair only one stream exists per process so priority is moot; in concurrent, equal priorities prevent WDDM contention accumulation across reactivation cycles (high/high contention produces non-recovering cycle-3 shutdowns, Phase 3.6 Step C confirmed). Set to `high` only for explicit single-pair lowest-latency optimisation. |
| `CUDALINK_EXPORT_FLUSH_PROBE` | `1` | Insert a non-blocking `cudaStreamQuery(ipc_stream)` after `check_sticky_error` when `EXPORT_SYNC=0`. Forces WDDM-deferred CUDA submissions to drain each frame, preventing Windows Task Manager's 3D-engine counter from inflating when true compute load (per NVML) is low. NVML readings are unchanged — purely cosmetic/observability. Set to `0` to disable. |
| `CUDALINK_EXPORT_PROFILE` | `0` | Enable fine-grained per-region sub-timers in `export_frame()` and emit a `[PROFILE] pre=…us interop=…us post=…us memcpy=…us record=…us sync=…us sticky=…us flush_probe=…us shm=…us unacc=…us` line every 97 frames. Force-enables `verbose_performance` (TD) / `debug` (lib). Diagnostic-only; negligible overhead when on, zero when unset. |
| `CUDALINK_NVTX` | `0` | Enable NVTX range annotations on top-level phase boundaries (`cudalink.exporter.flush_probe`, `cudalink.receiver.import_frame`, `cudalink.receiver.event_wait`, etc.) for Nsight Systems GPU timeline correlation. Zero overhead when off. Set to `1` before running any `nsys` capture; see [docs/PROFILING.md](docs/PROFILING.md). |
| `CUDALINK_NVTX_VERBOSE` | `0` | Enable additional sub-operation NVTX ranges (sticky-error check, D2A copy submit, SHM header read) inside the top-level phase ranges. Only useful for deep per-frame breakdown captures; implies `CUDALINK_NVTX=1`. |
| `CUDALINK_TD_GRAPHS_DEFERRED` | `0` | Defer CUDA Graph capture to after the second `export_frame()` call (TD Sender). Avoids a first-frame graph-capture stall in latency-sensitive topologies where the graph build cost would be visible. |
| `CUDALINK_TD_INIT_PACE` | `0` | Throttle the TD Sender init sequence to reduce WDDM saturation during activation windows (experimental). Adds a small sleep between consecutive CUDA API calls at `initialize()` time; useful when concurrent Sender+Receiver activation produces WDDM kernel-mode queue backpressure. |
| `CUDALINK_TD_BARRIER_SETTLE_FRAMES` | `30` | Number of frames the TD activation barrier counter remains armed after a Sender `initialize()` completes, giving the Python producer time to back off before publishing resumes. Increase if your Python producer's poll loop is slower than 30 frames at your target rate; decrease for tighter single-pair topologies. |
| `CUDALINK_NVML` | `0` | Append NVML GPU telemetry (utilization %, clocks MHz, PCIe Tx/Rx MB/s, temperature °C, power W, throttle reasons) to the 97-frame periodic stats line emitted by `CUDALINK_EXPORT_PROFILE`. Requires `nvidia-ml-py` (`pip install nvidia-ml-py`). Zero overhead when off. |
For GPU-timeline profiling (Nsight Systems / Nsight Compute / compute-sanitizer) see [docs/PROFILING.md](docs/PROFILING.md).
## Troubleshooting
### "SharedMemory not found"
**Cause**: Python importer started before TD exporter initialized.
**Solution**: Ensure the TD component is active before starting the Python process. If starting both together, use `timeout_ms` to give the producer time to initialize:
```python
from cuda_link import Importer, ImportSpec
importer = Importer.open(ImportSpec(shm_name="my_project_ipc", timeout_ms=10000.0)) # Wait up to 10s
```
### "CUDA IPC overhead unexpectedly high"
**Cause**: In standalone Python processes (WDDM), `export_frame()` with EXPORT_SYNC=1 typically measures 42–400 µs p50 (512×512 → 4K float32 RGBA, RTX 4090 / PCIe 4.0). Values 2–5× higher than these baselines may indicate GPU driver overhead, context contention, or PCIe bandwidth sharing with other D2H workloads.
**Solution**: Compare against the baseline numbers in [docs/BENCHMARKS.md](docs/BENCHMARKS.md). Contributors with a local clone may reproduce using `python benchmarks/bench_graphs.py` (standalone) or `python benchmarks/bench_sweep.py --quick` (multiprocess).
### "Version mismatch" or stale frames
**Cause**: TD re-exported IPC handles (network reset, resolution change).
**Solution**: The importer automatically detects version changes and re-opens handles. No action needed.
### GPU memory leak
**Cause**: Importer not cleaned up properly.
**Solution**: Use the context manager pattern for automatic cleanup:
```python
from cuda_link import Importer, ImportSpec, ImportOutcome
with Importer.open(ImportSpec(shm_name="my_project_ipc")) as importer:
# importer.close() is called automatically on exit
result = importer.get_frame()
if result.outcome is ImportOutcome.NEW_FRAME:
tensor = result.frame
```
Or call `importer.close()` explicitly in a `finally` block.
## Distribution
cuda-link uses a **dual distribution model** to support both use cases:
### For Python Consumers (StreamDiffusion, AI/ML pipelines)
#### Method 1: Build wheel (recommended — portable, installs into any environment)
```bash
git clone https://github.com/forkni/cuda-link.git
cd cuda-link
# Run the build script (uses PEP 517 isolated build via python -m build)
build_wheel.cmd
# Output: dist\cuda_link-1.7.1-py3-none-any.whl (~30 KB)
# Install into any Python environment — conda, venv, system Python, TouchDesigner Python:
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[torch]" # PyTorch GPU tensors
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[cupy]" # CuPy GPU arrays
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[numpy]" # NumPy CPU arrays
pip install "dist\cuda_link-1.7.1-py3-none-any.whl[all]" # All output modes
# Force reinstall to update:
pip install --force-reinstall "dist\cuda_link-1.7.1-py3-none-any.whl[torch]"
```
The wheel is a self-contained archive — copy it anywhere and install without needing the source tree.
#### Method 2: Editable install from source (for development)
```bash
git clone https://github.com/forkni/cuda-link.git
cd cuda-link
pip install -e ".[torch]" # Changes to src/cuda_link/ apply immediately, no rebuild needed
pip install -e ".[all]" # All output modes
```
#### Method 3: From PyPI (coming soon)
```bash
# pip install cuda-link[torch]
```
**Usage**:
```python
from cuda_link import Importer, ImportSpec, ImportOutcome
importer = Importer.open(ImportSpec(shm_name="my_project_ipc"))
result = importer.get_frame()
if result.outcome is ImportOutcome.NEW_FRAME:
tensor = result.frame # torch.Tensor, GPU zero-copy
```
The `cuda-link` package contains only the **consumer-side** Python code (`src/cuda_link/`). The TouchDesigner extension is distributed separately.
### For TouchDesigner Integration
**Option A: Use the .tox component** (recommended)
Drag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TouchDesigner network.
> **Older versions:** Previous `.tox` releases are available as downloadable assets on the
> [GitHub Releases page](https://github.com/forkni/cuda-link/releases) — pick the tag
> matching the TouchDesigner build you target.
**Option B: Build from source**
Follow the manual build guide at [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) to assemble the `.tox` from `td_exporter/` source files.
The TouchDesigner extension (`td_exporter/`) is **not included in the pip package** because it uses TD-specific APIs (`parent()`, `op()`, `me`, COMP-scoped imports) that cannot run outside TouchDesigner.
### Use Cases
| Use Case | TD Side | Python Side |
|----------|---------|-------------|
| **TD → Python** (StreamDiffusion, AI pipelines) | `.tox` Sender mode | `pip install dist\cuda_link-*.whl[torch]` |
| **Python → TD** (AI output display) | `.tox` Receiver mode | `pip install dist\cuda_link-*.whl[torch]` |
| **TD → TD** (two instances communicating) | `.tox` on both sides | Not needed |
Both sides communicate through the 433-byte SharedMemory protocol — zero import dependencies between TD and Python code.
---
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for the full history.
---
## License
MIT License - See LICENSE file
## Credits
Original implementation by Forkni (forkni@gmail.com).
Extracted and refactored from the StreamDiffusionTD project.