https://github.com/forkni/cuda-link

Zero-copy bidirectional GPU texture sharing between TouchDesigner and Python via CUDA IPC. Sub-microsecond per-frame overhead with ring buffer architecture and GPU-side synchronization.
https://github.com/forkni/cuda-link
cuda cupy gpu inter-process-communication ipc python pytorch real-time shared-memory texture-sharing touchdesigner zero-copy
Last synced: about 2 months ago
JSON representation
Zero-copy bidirectional GPU texture sharing between TouchDesigner and Python via CUDA IPC. Sub-microsecond per-frame overhead with ring buffer architecture and GPU-side synchronization.
Host: GitHub
URL: https://github.com/forkni/cuda-link
Owner: forkni
License: mit
Created: 2026-02-12T03:38:11.000Z (5 months ago)
Default Branch: master
Last Pushed: 2026-05-29T07:48:50.000Z (about 2 months ago)
Last Synced: 2026-05-29T09:25:17.285Z (about 2 months ago)
Topics: cuda, cupy, gpu, inter-process-communication, ipc, python, pytorch, real-time, shared-memory, texture-sharing, touchdesigner, zero-copy
Language: Python
Size: 92.3 MB
Stars: 17
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # cuda-link

Zero-copy GPU texture transfer between TouchDesigner and Python processes using CUDA IPC.

## Overview

This component enables **zero-copy GPU texture sharing** between TouchDesigner and Python processes using CUDA Inter-Process Communication (IPC). It eliminates CPU memory copies for real-time AI pipelines, video processing, and other GPU-accelerated workflows.

### Key Features

- **Zero-copy GPU transfer** - Textures stay on GPU, no CPU memory copies

- **Bidirectional IPC** - TD → Python (input capture) AND Python → TD (AI output display)

- **Low-overhead IPC** - `export_frame()` 22–367 µs p50 (512×512 → 4K float32, EXPORT_SYNC=1); `get_frame_numpy()` D2H 0.18–5.7 ms p50 (PCIe 4.0 ~22–24 GB/s); IPC notification ~136–286 µs cross-process (see [docs/BENCHMARKS.md](docs/BENCHMARKS.md))

- **Ring buffer architecture** - N-slot pipeline prevents producer/consumer blocking

- **GPU-side synchronization** - CUDA IPC events eliminate CPU polling

- **Triple output modes** - PyTorch tensors (GPU, zero-copy), CuPy arrays (GPU, zero-copy), or numpy arrays (CPU, D2H copy)

- **Production-ready** - Tested at 30+ FPS for hours, handles dynamic resolution changes

### Performance

Measured on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. All Python-side.

| Operation | p50 | Notes |

|-----------|-----|-------|

| `export_frame()` — 512×512 RGBA float32 | 22 µs | Standalone, EXPORT_SYNC=1; GPU D2D + stream_synchronize |

| `export_frame()` — 1080p RGBA float32 | 117 µs | Standalone, EXPORT_SYNC=1 |

| `export_frame()` — 4K RGBA float32 | 367 µs | Standalone, EXPORT_SYNC=1 |

| `get_frame_numpy()` D2H — 512×512 float32 | 0.18 ms | Standalone, ~22 GB/s |

| `get_frame_numpy()` D2H — 1080p float32 | 1.32 ms | Standalone, ~24 GB/s PCIe 4.0 |

| `get_frame_numpy()` D2H — 4K float32 | 5.7 ms | Standalone, ~21 GB/s PCIe 4.0 |

| `get_frame()` / `get_frame_cupy()` GPU | <5 µs | Zero-copy tensor/array view, no D2H |

| IPC notification latency | ~136–286 µs | Producer publish → consumer detect (cross-process) |

| Initialization | ~50–100 µs | One-time IPC handle opening |

## Requirements

- **OS**: Windows 10/11 (CUDA IPC is Windows-only)

- **CUDA**: 12.x (tested with 12.4)

- **GPU**: NVIDIA GPU with CUDA compute capability 3.5+

- **TouchDesigner**: 2022.x or later (for producer side)

- **Python**: 3.9+ (for consumer side)

### Python Dependencies

**Required**: None (pure ctypes CUDA wrapper)

**Optional**:

- `torch>=2.0` - For zero-copy GPU tensor output (recommended for AI pipelines)

- `cupy-cuda12x>=12.0` - For zero-copy GPU array output (CuPy/JAX workflows)

- `numpy>=1.21` - For CPU array output (for OpenCV, etc.)

## Quick Start

### 1. TouchDesigner Side (Exporter)

**Option A: Use the .tox component** (recommended)

1. Drag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TD network

2. Wire your source TOP to the `input` In TOP

3. Set `Ipcmemname` parameter (e.g., `"my_texture_ipc"`)

4. Enable `Active` toggle

The component displays its transfer state in the read-only **Status** custom parameter:

`"x  ch"` during active transfer, `"WARNING: ..."` or `"ERROR: ..."` on

faults, and `"Idle"` when inactive. A `warning_emitter` Script TOP inside the COMP also

shows a local warning badge when the component is open. See [`td_exporter/HELP_DOC.md`](td_exporter/HELP_DOC.md)

for per-parameter documentation.

**Option B: Build from source**

See [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) for step-by-step assembly.

**Option C: Library mode (cleaner .tox — fewer Text DATs)**

Install `cuda_link` into a Python environment TouchDesigner can see. The `CUDALinkBootstrap`

DAT then loads the package automatically — the 14 mirror Text DATs (Env, SHMProtocol,

Exporter, Importer, …) are no longer needed in the `.tox`. Run the multi-target installer

(one-time):

```bat

build_wheel.cmd                    REM build dist\cuda_link-1.7.1-py3-none-any.whl

install_td_library.cmd             REM interactive menu — choose one of 5 install modes

```

**Install modes** (`python scripts/install_td_library.py --help`):

| Mode | Flag | Description |

|------|------|-------------|

| 1 | `--target DIR` | Install into a custom folder; set `CUDALINK_LIB_PATH=DIR` before launching TD |

| 2 | `--venv DIR` | Install into an existing venv that TD is configured to use |

| 3 | `--conda ENV` | Install into a conda environment |

| 4 | `--python EXE` | Install into a parallel Python; auto-writes TD Preferences — no env var needed |

| 5 | `--td-python EXE` | Install directly into TD's bundled Python (`app.pythonExecutable`) |

**Mode 4 (recommended for most setups):** auto-discovers both the registered system Python

(`py -3`) and the TouchDesigner install path; sets `Python64 Path` in TD Preferences so

library mode activates on the next TD launch with zero env-var configuration.

```bat

REM Non-interactive mode 4 (auto-discover Python + TD):

install_td_library.cmd --mode 4 --non-interactive

REM Dry-run to preview what would be written:

install_td_library.cmd --mode 4 --dry-run

```

The `TDHost`/`TDConfig`/`TDSender`/`TDReceiver` glue DATs remain in the COMP unchanged.

If `CUDALINK_LIB_PATH` is unset and mode 4 was not used, the bootstrap no-ops and the

classic mirror DATs take over silently. See [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md)

for full instructions.

### 2. Python Side (Importer)

#### Install the package

```bash

# Option A: Build wheel and install (recommended — portable, no source needed):

cd C:\path\to\CUDA_IPC

build_wheel.cmd                             # Builds dist\cuda_link-1.7.1-py3-none-any.whl

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[torch]"   # PyTorch GPU tensors

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[cupy]"    # CuPy GPU arrays

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[numpy]"   # NumPy CPU arrays

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[all]"     # All output modes

# Option B: Editable install from source (for development — changes apply immediately):

pip install -e ".[torch]"

pip install -e ".[all]"

# From PyPI (coming soon):

# pip install cuda-link[torch]

```

#### Use in your Python script

```python

from cuda_link import Importer, ImportSpec, ImportOutcome

importer = Importer.open(

    ImportSpec(

        shm_name="my_texture_ipc",

        shape=(1080, 1920, 4),  # height, width, channels (RGBA) — or None for auto-detect

        dtype="float32",         # "float32", "float16", "uint8" — or None for auto-detect

        timeout_ms=5000.0,       # Wait up to 5s for producer to appear (default)

    )

)

# Option 1: Get torch.Tensor (GPU, zero-copy)

result = importer.get_frame()

if result.outcome is ImportOutcome.NEW_FRAME:

    tensor = result.frame  # torch.Tensor on GPU, shape (1080, 1920, 4)

    # Use directly in AI model:

    # output = model(tensor)

# Option 2: Get numpy array (CPU, involves D2H copy)

result = importer.get_frame_numpy()

if result.outcome is ImportOutcome.NEW_FRAME:

    array = result.frame  # numpy.ndarray on CPU

    # Use in OpenCV, PIL, etc.:

    # cv2.imwrite("frame.png", array)

# Option 3: Get CuPy array (GPU, zero-copy)

result = importer.get_frame_cupy()

if result.outcome is ImportOutcome.NEW_FRAME:

    cupy_arr = result.frame  # cupy.ndarray on GPU

    # Use in CuPy/JAX workflows

# Context manager (recommended — ensures cleanup on exit)

with Importer.open(ImportSpec(shm_name="my_texture_ipc")) as importer:

    for _ in range(100):

        result = importer.get_frame()

        if result.outcome is ImportOutcome.NEW_FRAME:

            tensor = result.frame

# Explicit cleanup

importer.close()

```

### 3. Python → TouchDesigner (AI Output)

Send AI-generated frames **back to TD** for display:

```python

from cuda_link import Exporter, FrameSpec, GpuFrame

with Exporter.open(

    FrameSpec(

        shm_name="ai_output_ipc",  # Must match TD Receiver's Ipcmemname parameter

        height=512, width=512,

        channels=4, dtype="uint8",

        num_slots=2,               # Ring buffer slots (double-buffering)

    )

) as exporter:

    # Export each AI-generated frame (~10-20μs overhead at 512x512)

    exporter.export(GpuFrame(

        ptr=output_tensor.data_ptr(),

        size=output_tensor.nbytes,

    ))

```

On the TD side, set `CUDAIPCExtension` **Mode** to `Receiver` with matching `Ipcmemname`.

## Architecture

```

Direction A: TD (Producer) → Python (Consumer)

──────────────────────────────────────────────

CUDAIPCExtension facade

  └── TDSenderEngine (thin TD adapter)   Importer

        │ cuda_memory() → GpuFrame         │ get_frame() / get_frame_numpy()

        │ delegates to Exporter            │ Waits on IPC event

        └─→ SharedMemory ←─────────────────┘

Direction B: Python (Producer) → TD (Consumer)

───────────────────────────────────────────────

Exporter                           CUDAIPCExtension facade

  │ export(GpuFrame(ptr, size))      └── TDReceiverEngine

  │ cudaMemcpy D2D → ring buf             │ import_frame(script_top)

  └─→ SharedMemory ←──────────────────────┘ copyCUDAMemory()

Both directions share the same v0.5.0 binary protocol.

```

The TD extension uses a **facade-with-delegation** pattern: `CUDAIPCExtension` (~300 LOC) holds either a `TDSenderEngine` or `TDReceiverEngine` and delegates all work to it. `TDSenderEngine` is a thin TD-only adapter (~415 LOC) over the canonical `Exporter` — it owns pixel-format bridging, the `cuda_memory()`→`GpuFrame` translation, dynamic geometry reopen, and `HolderBarrier` lifecycle; all GPU ring-buffer logic delegates to `Exporter`. Mode switches replace the engine entirely — zero cross-mode state leak. All TouchDesigner runtime access (`ownerComp.par.*`, `top.cudaMemory()`, `copyCUDAMemory()`) goes through the `TDHost`/`TOPHandle` adapter seam, making the engine logic testable without a TD runtime.

### Ring Buffer (3 Slots)

The system uses a 3-slot ring buffer to allow producer and consumer to work in parallel:

- **Slot 0**: Producer writes frame N

- **Slot 1**: Producer writes frame N+1 while consumer reads frame N

- **Slot 2**: Producer writes frame N+2 while consumer reads frame N+1

- Wraps back to Slot 0 for frame N+3

This prevents blocking - producer never waits for consumer, consumer is always 1 frame behind.

### SharedMemory Protocol (433 bytes for 3 slots)

```

[0-3]     magic "CIPD" (4B)       - Protocol validation (0x43495044)

[4-11]    version (8B)             - Increments on TD re-initialization

[12-15]   num_slots (4B)           - Number of ring buffer slots (3)

[16-19]   write_idx (4B)           - Current write index (atomic counter)

Per slot (128 bytes each):

[20+slot*128 : 84+slot*128]   cudaIpcMemHandle_t (64B)  - GPU memory handle

[84+slot*128 : 148+slot*128]  cudaIpcEventHandle_t (64B) - GPU event handle

[20+NUM_SLOTS*128]        shutdown_flag (1B)   - Reasserted to 0 every frame; set to 1 on exit

[21+NUM_SLOTS*128]        metadata (20B)       - width/height/num_comps/dtype/buffer_size

[41+NUM_SLOTS*128]        timestamp (8B)       - Producer perf_counter() for latency

```

For 3 slots: `20 + (3 × 128) + 1 + 20 + 8 = 433 bytes`

## Documentation

- **[TOX Build Guide](docs/TOX_BUILD_GUIDE.md)** - Step-by-step .tox assembly in TouchDesigner

- **[Architecture](docs/ARCHITECTURE.md)** - Protocol spec, ring buffer design, GPU sync

- **[Integration Examples](docs/INTEGRATION_EXAMPLES.md)** - TD→PyTorch, TD→OpenCV, multi-stream

## Testing

Run the full test suite:

```bash

cd C:\path\to\CUDA_IPC

# Protocol tests (no CUDA needed)

pytest tests/test_shm_protocol.py -v

# Unit tests (requires CUDA)

pytest tests/test_cuda_ipc_wrapper.py -v

# All tests

pytest tests/ -v

# Skip slow multi-process tests

pytest tests/ -v -m "not slow"

```

## Benchmarks

All results on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. RGBA (4-channel) frames.

Key highlights:

- **`export_frame()` standalone** — 22 µs p50 (512×512) → 367 µs (4K) with EXPORT_SYNC=1. CUDA Graphs saves <5% at these sizes (GPU D2D copy dominates).

- **`get_frame_numpy()` D2H** — 0.18 ms p50 (512×512) → 5.7 ms (4K) at ~22–24 GB/s PCIe 4.0.

- **Full IPC roundtrip** — IPC notification latency ~136–286 µs cross-process (resolution-independent signaling).

- **vs CPU SharedMemory** — ~3.4× faster E2E at 1080p, ~2.1× at 512×512. Producer write 4–19× faster (no CPU transit). With `get_frame()` / `get_frame_cupy()` (zero-copy), the consumer read collapses to <5 µs.

Full tables, per-resolution breakdowns, and CUDA Graphs A/B comparison: **[docs/BENCHMARKS.md](docs/BENCHMARKS.md)**

### Performance Tuning (env vars)

| Variable | Default | Effect |

|---|---|---|

| `CUDALINK_USE_GRAPHS` | `1` | CUDA Graphs for `export()` (Python-side `Exporter`). Collapses the `stream_wait_event + memcpy_async + record_event` triplet into a single `cudaGraphLaunch`, cutting WDDM kernel-mode transitions from 3 → 2 per frame. With EXPORT_SYNC=1 (default) the GPU D2D copy dominates wall-clock time and the net savings are small (<5% at 1080p on PCIe 4.0); see [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for measured A/B. Set to `0` to revert to the legacy stream path (e.g., if a driver version rejects graph capture). |

| `CUDALINK_TD_USE_GRAPHS` | `0` | CUDA Graphs for the TouchDesigner-side `CUDAIPCExtension` Sender. Same mechanism as `CUDALINK_USE_GRAPHS`, gated independently because TD ships `cudart64_110.dll` and the per-frame `cudaGraphExecMemcpyNodeSetParams1D` API requires CUDA 11.3+. Auto-disabled on older runtimes (probed via `cudaRuntimeGetVersion` at `initialize()`). Disabled by default. Set to `1` to opt in; falls back to the legacy `cudaMemcpyAsync` stream path automatically on any capture or launch failure. |

| `CUDALINK_D2H_STREAMS` | `1` | Number of parallel streams for `get_frame_numpy()` D2H copy. Values `2`/`4` may help on PCIe 3.0 systems or GPUs with dual DMA engines; on PCIe 4.0 a single stream already saturates the bus (~23–24 GB/s). Check `nvidia-smi -q \| findstr "Async Engines"` before tuning. |

| `CUDALINK_EXPORT_SYNC` | `0` | Block CPU on the IPC stream after each `export_frame()`. Default off — the CUDA IPC event already provides correct cross-process GPU ordering. Set to `1` to restore synchronous behaviour (recommended for concurrent TD Sender+Receiver topologies and when `CUDALINK_USE_GRAPHS=0`). Note: `TDConfig` (`td_exporter/TDConfig.py`) still defaults to `True` for TDR-cascade safety in shared-process topologies. |

| `CUDALINK_ACTIVATION_BARRIER` | `1` | Python-lib side of the cross-process activation barrier (F9). Reads a tiny SHM counter each `export_frame()` and skips publishing while a TD Sender is in its WDDM-saturating init window. No-op in single-pair topologies (counter stays at 0); gracefully skipped if the SHM segment is absent. Set to `0` to opt out. |

| `CUDALINK_TD_ACTIVATION_BARRIER` | `1` | TD-side counterpart of `CUDALINK_ACTIVATION_BARRIER` — increments the same SHM counter around Sender `initialize()` so the Python producer backs off. Same no-op / graceful-absence behaviour. Set to `0` to opt out. |

| `CUDALINK_TD_PERSIST_STREAM` | `1` | Skip `stream_destroy` in TD Sender `cleanup()` so the IPC CUDA stream survives `deactivate`→`reactivate` cycles (F8). Free in single-pair (no deactivation ever happens); load-bearing in concurrent — without it, stream recreate on each reactivation collides with in-flight Receiver work, doubling first-settle `post=` latency (Phase 3.6 confirmed). Set to `0` to opt out. |

| `CUDALINK_TD_STREAM_PRIO` | `normal` | CUDA stream priority for the TD Sender's IPC stream. Default `normal` is safe for both single-pair and concurrent topologies — in single-pair only one stream exists per process so priority is moot; in concurrent, equal priorities prevent WDDM contention accumulation across reactivation cycles (high/high contention produces non-recovering cycle-3 shutdowns, Phase 3.6 Step C confirmed). Set to `high` only for explicit single-pair lowest-latency optimisation. |

| `CUDALINK_EXPORT_FLUSH_PROBE` | `1` | Insert a non-blocking `cudaStreamQuery(ipc_stream)` after `check_sticky_error` when `EXPORT_SYNC=0`. Forces WDDM-deferred CUDA submissions to drain each frame, preventing Windows Task Manager's 3D-engine counter from inflating when true compute load (per NVML) is low. NVML readings are unchanged — purely cosmetic/observability. Set to `0` to disable. |

| `CUDALINK_EXPORT_PROFILE` | `0` | Enable fine-grained per-region sub-timers in `export_frame()` and emit a `[PROFILE] pre=…us interop=…us post=…us memcpy=…us record=…us sync=…us sticky=…us flush_probe=…us shm=…us unacc=…us` line every 97 frames. Force-enables `verbose_performance` (TD) / `debug` (lib). Diagnostic-only; negligible overhead when on, zero when unset. |

| `CUDALINK_NVTX` | `0` | Enable NVTX range annotations on top-level phase boundaries (`cudalink.exporter.flush_probe`, `cudalink.receiver.import_frame`, `cudalink.receiver.event_wait`, etc.) for Nsight Systems GPU timeline correlation. Zero overhead when off. Set to `1` before running any `nsys` capture; see [docs/PROFILING.md](docs/PROFILING.md). |

| `CUDALINK_NVTX_VERBOSE` | `0` | Enable additional sub-operation NVTX ranges (sticky-error check, D2A copy submit, SHM header read) inside the top-level phase ranges. Only useful for deep per-frame breakdown captures; implies `CUDALINK_NVTX=1`. |

| `CUDALINK_TD_GRAPHS_DEFERRED` | `0` | Defer CUDA Graph capture to after the second `export_frame()` call (TD Sender). Avoids a first-frame graph-capture stall in latency-sensitive topologies where the graph build cost would be visible. |

| `CUDALINK_TD_INIT_PACE` | `0` | Throttle the TD Sender init sequence to reduce WDDM saturation during activation windows (experimental). Adds a small sleep between consecutive CUDA API calls at `initialize()` time; useful when concurrent Sender+Receiver activation produces WDDM kernel-mode queue backpressure. |

| `CUDALINK_TD_BARRIER_SETTLE_FRAMES` | `30` | Number of frames the TD activation barrier counter remains armed after a Sender `initialize()` completes, giving the Python producer time to back off before publishing resumes. Increase if your Python producer's poll loop is slower than 30 frames at your target rate; decrease for tighter single-pair topologies. |

| `CUDALINK_NVML` | `0` | Append NVML GPU telemetry (utilization %, clocks MHz, PCIe Tx/Rx MB/s, temperature °C, power W, throttle reasons) to the 97-frame periodic stats line emitted by `CUDALINK_EXPORT_PROFILE`. Requires `nvidia-ml-py` (`pip install nvidia-ml-py`). Zero overhead when off. |

For GPU-timeline profiling (Nsight Systems / Nsight Compute / compute-sanitizer) see [docs/PROFILING.md](docs/PROFILING.md).

## Troubleshooting

### "SharedMemory not found"

**Cause**: Python importer started before TD exporter initialized.

**Solution**: Ensure the TD component is active before starting the Python process. If starting both together, use `timeout_ms` to give the producer time to initialize:

```python

from cuda_link import Importer, ImportSpec

importer = Importer.open(ImportSpec(shm_name="my_project_ipc", timeout_ms=10000.0))  # Wait up to 10s

```

### "CUDA IPC overhead unexpectedly high"

**Cause**: In standalone Python processes (WDDM), `export_frame()` with EXPORT_SYNC=1 typically measures 42–400 µs p50 (512×512 → 4K float32 RGBA, RTX 4090 / PCIe 4.0). Values 2–5× higher than these baselines may indicate GPU driver overhead, context contention, or PCIe bandwidth sharing with other D2H workloads.

**Solution**: Compare against the baseline numbers in [docs/BENCHMARKS.md](docs/BENCHMARKS.md). Contributors with a local clone may reproduce using `python benchmarks/bench_graphs.py` (standalone) or `python benchmarks/bench_sweep.py --quick` (multiprocess).

### "Version mismatch" or stale frames

**Cause**: TD re-exported IPC handles (network reset, resolution change).

**Solution**: The importer automatically detects version changes and re-opens handles. No action needed.

### GPU memory leak

**Cause**: Importer not cleaned up properly.

**Solution**: Use the context manager pattern for automatic cleanup:

```python

from cuda_link import Importer, ImportSpec, ImportOutcome

with Importer.open(ImportSpec(shm_name="my_project_ipc")) as importer:

    # importer.close() is called automatically on exit

    result = importer.get_frame()

    if result.outcome is ImportOutcome.NEW_FRAME:

        tensor = result.frame

```

Or call `importer.close()` explicitly in a `finally` block.

## Distribution

cuda-link uses a **dual distribution model** to support both use cases:

### For Python Consumers (StreamDiffusion, AI/ML pipelines)

#### Method 1: Build wheel (recommended — portable, installs into any environment)

```bash

git clone https://github.com/forkni/cuda-link.git

cd cuda-link

# Run the build script (uses PEP 517 isolated build via python -m build)

build_wheel.cmd

# Output: dist\cuda_link-1.7.1-py3-none-any.whl  (~30 KB)

# Install into any Python environment — conda, venv, system Python, TouchDesigner Python:

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[torch]"   # PyTorch GPU tensors

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[cupy]"    # CuPy GPU arrays

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[numpy]"   # NumPy CPU arrays

pip install "dist\cuda_link-1.7.1-py3-none-any.whl[all]"     # All output modes

# Force reinstall to update:

pip install --force-reinstall "dist\cuda_link-1.7.1-py3-none-any.whl[torch]"

```

The wheel is a self-contained archive — copy it anywhere and install without needing the source tree.

#### Method 2: Editable install from source (for development)

```bash

git clone https://github.com/forkni/cuda-link.git

cd cuda-link

pip install -e ".[torch]"   # Changes to src/cuda_link/ apply immediately, no rebuild needed

pip install -e ".[all]"     # All output modes

```

#### Method 3: From PyPI (coming soon)

```bash

# pip install cuda-link[torch]

```

**Usage**:

```python

from cuda_link import Importer, ImportSpec, ImportOutcome

importer = Importer.open(ImportSpec(shm_name="my_project_ipc"))

result = importer.get_frame()

if result.outcome is ImportOutcome.NEW_FRAME:

    tensor = result.frame  # torch.Tensor, GPU zero-copy

```

The `cuda-link` package contains only the **consumer-side** Python code (`src/cuda_link/`). The TouchDesigner extension is distributed separately.

### For TouchDesigner Integration

**Option A: Use the .tox component** (recommended)

Drag `TOXES/CUDAIPCLink_v1.7.1.tox` into your TouchDesigner network.

> **Older versions:** Previous `.tox` releases are available as downloadable assets on the

> [GitHub Releases page](https://github.com/forkni/cuda-link/releases) — pick the tag

> matching the TouchDesigner build you target.

**Option B: Build from source**

Follow the manual build guide at [`docs/TOX_BUILD_GUIDE.md`](docs/TOX_BUILD_GUIDE.md) to assemble the `.tox` from `td_exporter/` source files.

The TouchDesigner extension (`td_exporter/`) is **not included in the pip package** because it uses TD-specific APIs (`parent()`, `op()`, `me`, COMP-scoped imports) that cannot run outside TouchDesigner.

### Use Cases

| Use Case | TD Side | Python Side |

|----------|---------|-------------|

| **TD → Python** (StreamDiffusion, AI pipelines) | `.tox` Sender mode | `pip install dist\cuda_link-*.whl[torch]` |

| **Python → TD** (AI output display) | `.tox` Receiver mode | `pip install dist\cuda_link-*.whl[torch]` |

| **TD → TD** (two instances communicating) | `.tox` on both sides | Not needed |

Both sides communicate through the 433-byte SharedMemory protocol — zero import dependencies between TD and Python code.

---

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for the full history.

---

## License

MIT License - See LICENSE file

## Credits

Original implementation by Forkni (forkni@gmail.com).

Extracted and refactored from the StreamDiffusionTD project.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/forkni/cuda-link

Awesome Lists containing this project

README