https://github.com/wendylabsinc/tensorrt-swift

TensorRT Swift 6.2 Bindings for Linux
https://github.com/wendylabsinc/tensorrt-swift
cuda nvidia swift tensor tensorrt
Last synced: 5 months ago
JSON representation
TensorRT Swift 6.2 Bindings for Linux
Host: GitHub
URL: https://github.com/wendylabsinc/tensorrt-swift
Owner: wendylabsinc
License: apache-2.0
Created: 2025-12-15T21:27:25.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-01-01T20:37:25.000Z (6 months ago)
Last Synced: 2026-01-14T00:05:40.906Z (5 months ago)
Topics: cuda, nvidia, swift, tensor, tensorrt
Language: Swift
Homepage: https://wendy.sh/docs/
Size: 348 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

          # TensorRT Swift (Linux)

[![CI](https://github.com/wendylabsinc/tensorrt-swift/actions/workflows/ci.yml/badge.svg)](https://github.com/wendylabsinc/tensorrt-swift/actions/workflows/ci.yml)

![Swift 6.2+](https://img.shields.io/badge/Swift-6.2%2B-F05138?logo=swift&logoColor=white)

![Linux](https://img.shields.io/badge/Platform-Linux-FCC624?logo=linux&logoColor=black)

![TensorRT](https://img.shields.io/badge/TensorRT-10.x-76B900?logo=nvidia&logoColor=white)

![CUDA](https://img.shields.io/badge/CUDA-12.6-76B900?logo=nvidia&logoColor=white)

Swift Package that provides Swift-first APIs for working with NVIDIA TensorRT on Linux, with a separate TensorRTLLM product for LLM-specific extensions.

> **Note**: The `TensorRT` product wraps the **TensorRT** inference engine. The `TensorRTLLM` product is a thin extension layer today; full TensorRT-LLM integration (in-flight batching, KV-cache management, tensor parallelism) is planned for future releases.

This repository is **work in progress** and **subject to breaking changes** while the low-level foundations are being established.

Swift 6.2 features are used aggressively where feasible:

- `InlineArray` to keep common small metadata (like shapes/strides) allocation-free

- `Span` / `MutableSpan` / `Data.bytes` for safer, more composable views over contiguous memory

- Actor-based `ExecutionContext` for thread-safe inference

## System Requirements

### Required Libraries

The package links against the following system libraries at **build time** and **runtime**:

| Library | Package | Purpose |

|---------|---------|---------|

| `libnvinfer.so` | TensorRT | Core inference engine |

| `libnvinfer_plugin.so` | TensorRT | Built-in plugins |

| `libnvonnxparser.so` | TensorRT | ONNX model import |

| `libcuda.so` | CUDA Driver | GPU access |

### Installation

#### Option 1: NVIDIA Container (Recommended)

Use the official TensorRT container which includes all dependencies:

```bash

docker run --gpus all -it nvcr.io/nvidia/tensorrt:24.08-py3

```

#### Option 1b: Jetson Container (Orin Nano, AGX Thor)

Jetson uses aarch64 containers and must match the host JetPack/L4T release. See

`docs/jetson-container.md` for a full recipe.

#### Option 2: System Installation (Ubuntu/Debian)

```bash

# 1. Install CUDA 12.6

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt-get update

sudo apt-get install -y cuda-toolkit-12-6

# 2. Install TensorRT 10.x

sudo apt-get install -y libnvinfer10 libnvinfer-plugin10 libnvonnxparser10 libnvinfer-dev

# 3. Add CUDA to your path

export PATH=/usr/local/cuda-12.6/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH

```

#### Option 3: From NVIDIA Developer Downloads

1. Download [CUDA Toolkit 12.6](https://developer.nvidia.com/cuda-downloads)

2. Download [TensorRT 10.x](https://developer.nvidia.com/tensorrt) (requires NVIDIA Developer account)

3. Follow NVIDIA's installation guides

### Verifying Installation

```bash

# Check CUDA

nvcc --version

# Check TensorRT

dpkg -l | grep nvinfer

# or

ls /usr/lib/x86_64-linux-gnu/libnvinfer*

```

### Swift Installation

Install Swift 6.2+ via [Swiftly](https://swiftlang.github.io/swiftly/):

```bash

curl -L https://swiftlang.github.io/swiftly/swiftly-install.sh | bash

swiftly install 6.2

```

### Development Workflow (macOS/Windows)

You can write code on macOS or Windows, but **building and running must happen on Linux** with

TensorRT/CUDA libraries available. The recommended workflow is:

- Develop locally on macOS/Windows.

- Build/test inside a Linux container (Option 1 / 1b) or on a Linux host.

Cross-compiling from macOS/Windows to Linux is possible but fragile and not recommended.

## What Works Today

### Core APIs

| API | Description |

|-----|-------------|

| `TensorRTRuntime.buildEngine(onnxURL:options:)` | Build TensorRT engine from ONNX |

| `TensorRTRuntime.deserializeEngine(from:)` | Load serialized engine plan |

| `Engine.save(to:)` / `Engine.load(from:)` | Persist/load engines to disk |

| `ExecutionContext.enqueue(_:)` | Execute inference (host buffers) |

| `ExecutionContext.enqueueDevice(...)` | Execute with device pointers |

| `ExecutionContext.warmup(iterations:)` | Warmup for stable latency |

### GPU & Device APIs

| API | Description |

|-----|-------------|

| `TensorRTSystem.cudaDeviceCount()` | Number of available GPUs |

| `TensorRTSystem.deviceProperties(device:)` | GPU name, compute capability, memory |

| `TensorRTSystem.memoryInfo(device:)` | Free/total GPU memory |

| `TensorRTSystem.CUDAStream` | RAII stream wrapper |

| `TensorRTSystem.CUDAEvent` | RAII event wrapper |

### Dynamic Shapes & Profiles

| API | Description |

|-----|-------------|

| `ExecutionContext.reshape(bindings:)` | Set input shapes at runtime |

| `ExecutionContext.setOptimizationProfile(named:)` | Switch optimization profiles |

| `OptimizationProfile` | Define min/opt/max shapes |

### LLM Extensions (TensorRTLLM)

| API | Description |

|-----|-------------|

| `ExecutionContext.stream(...)` | Streaming inference (AsyncSequence) |

| `StreamingConfiguration` | Configure token-by-token generation |

| `StreamingInferenceStep` | Per-step metadata and outputs |

### Swift-y Conveniences

```swift

// TensorShape with array literal

let shape: TensorShape = [1, 3, 224, 224]

print(shape)        // "TensorShape[1, 3, 224, 224]"

print(shape[0])     // 1

// Engine persistence

try engine.save(to: URL(fileURLWithPath: "model.engine"))

let loaded = try Engine.load(from: URL(fileURLWithPath: "model.engine"))

// Query GPU before loading

let mem = try TensorRTSystem.memoryInfo()

print("Free GPU memory: \(mem.free / 1_000_000_000) GB")

```

## Quick Start

### Add the package to your `Package.swift`

```swift

// swift-tools-version: 6.2

import PackageDescription

let package = Package(

    name: "MyApp",

    dependencies: [

        .package(url: "https://github.com/wendylabsinc/tensorrt-swift", from: "0.0.1"),

    ],

    targets: [

        .executableTarget(

            name: "MyApp",

            dependencies: [

                .product(name: "TensorRT", package: "tensorrt-swift"),

            ]

        ),

    ]

)

```

To use the LLM extension module for streaming inference and other LLM utilities:

```swift

.product(name: "TensorRTLLM", package: "tensorrt-swift")

```

### Query GPU and TensorRT version

```swift

import TensorRT

// Check TensorRT version

let version = try TensorRTRuntimeProbe.inferRuntimeVersion()

print("TensorRT version: \(version)")

// Check GPU

let props = try TensorRTSystem.deviceProperties()

print("GPU: \(props.name)")

print("Compute Capability: \(props.computeCapability)")

print("Memory: \(props.totalMemory / 1_000_000_000) GB")

let mem = try TensorRTSystem.memoryInfo()

print("Free: \(mem.free / 1_000_000_000) GB / \(mem.total / 1_000_000_000) GB")

```

### Build an engine from ONNX and run inference

```swift

import TensorRT

let runtime = TensorRTRuntime()

let engine = try runtime.buildEngine(

    onnxURL: URL(fileURLWithPath: "model.onnx"),

    options: EngineBuildOptions(

        precision: [.fp32],

        workspaceSizeBytes: 1 << 28

    )

)

// Save for later use (avoid rebuild)

try engine.save(to: URL(fileURLWithPath: "model.engine"))

let ctx = try engine.makeExecutionContext()

// Warmup for stable latency

let warmup = try await ctx.warmup(iterations: 10)

print("Warmup avg: \(warmup.average ?? .zero)")

// Run inference

let inputDesc = engine.description.inputs[0].descriptor

let input: [Float] = (0..`.

The wrapper keeps build artifacts in `/tmp` by default; override with `SWIFT_BUILD_PATH` if needed.

### Beginner Examples

| Example | Description | Command |

|---------|-------------|---------|

| **HelloTensorRT** | Minimal "hello world" - probe version, build identity engine, run inference | `./scripts/swiftw run HelloTensorRT` |

| **ONNXInference** | Load ONNX model, build engine, run inference with throughput measurement | `./scripts/swiftw run ONNXInference` |

| **BatchProcessing** | Process multiple batches, latency statistics (p50/p95/p99) | `./scripts/swiftw run BatchProcessing` |

### Intermediate Examples

| Example | Description | Command |

|---------|-------------|---------|

| **DynamicBatching** | Dynamic shapes for variable batch sizes at runtime | `./scripts/swiftw run DynamicBatching` |

| **MultiProfile** | Multiple optimization profiles for different workloads | `./scripts/swiftw run MultiProfile` |

| **AsyncInference** | Non-blocking inference with CUDA streams and events | `./scripts/swiftw run AsyncInference` |

| **ImageClassifier** | End-to-end pipeline: preprocess → inference → postprocess | `./scripts/swiftw run ImageClassifier` |

| **DeviceMemoryPipeline** | Keep tensors on GPU, avoid H2D/D2H transfers | `./scripts/swiftw run DeviceMemoryPipeline` |

### LLM Examples (TensorRTLLM)

LLM examples live under `ExamplesLLM/`.

| Example | Description | Command |

|---------|-------------|---------|

| **StreamingLLM** | Token-by-token generation with KV-cache pattern | `./scripts/swiftw run StreamingLLM` |

### Advanced Examples

| Example | Description | Command |

|---------|-------------|---------|

| **MultiGPU** | Distribute inference across multiple GPUs | `./scripts/swiftw run MultiGPU` |

| **CUDAEventPipelining** | Overlap compute with data transfer using events | `./scripts/swiftw run CUDAEventPipelining` |

| **BenchmarkSuite** | Comprehensive throughput/latency measurement | `./scripts/swiftw run BenchmarkSuite` |

| **FP16Quantization** | Compare FP32 vs FP16 precision and performance | `./scripts/swiftw run FP16Quantization` |

### Real-World Examples

| Example | Description | Command |

|---------|-------------|---------|

| **TextEmbedding** | Sentence transformer for semantic search | `./scripts/swiftw run TextEmbedding` |

| **ObjectDetection** | YOLO-style detection with NMS postprocessing | `./scripts/swiftw run ObjectDetection` |

| **WhisperTranscription** | Audio transcription pipeline (encoder pattern) | `./scripts/swiftw run WhisperTranscription` |

| **VisionTransformer** | ViT image classification with patch embeddings | `./scripts/swiftw run VisionTransformer` |

### Example Output: BenchmarkSuite

```

=== TensorRT Benchmark Suite ===

┌──────────┬────────────┬────────────┬────────────┬────────────┐

│ Elements │ Throughput │ p50        │ p95        │ p99        │

├──────────┼────────────┼────────────┼────────────┼────────────┤

│ 64       │ 91.0K      │ 10.4 µs    │ 12.5 µs    │ 22.8 µs    │

│ 1024     │ 75.5K      │ 11.5 µs    │ 22.1 µs    │ 23.1 µs    │

│ 16384    │ 31.3K      │ 31.8 µs    │ 33.2 µs    │ 37.1 µs    │

└──────────┴────────────┴────────────┴────────────┴────────────┘

```

## Tests

Run:

```bash

./scripts/swiftw test

```

This wrapper keeps build artifacts in `/tmp` by default to avoid `.build` permission issues. Override with

`SWIFT_BUILD_PATH=/your/path ./scripts/swiftw test` if needed.

The test suite includes end-to-end GPU tests that build engines (TensorRT builder and `nvonnxparser`),

deserialize them, and run inference (host buffers, device pointers, external streams, and CUDA events).

## Troubleshooting

### `libnvinfer.so: cannot open shared object file`

TensorRT libraries are not in your library path. Add them:

```bash

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# or wherever TensorRT is installed

```

### `CUDA driver version is insufficient`

Your NVIDIA driver is too old for CUDA 12.6. Update your driver:

```bash

sudo apt-get install nvidia-driver-550  # or newer

```

### Swift can't find CUDA headers

Ensure CUDA is installed and the include path is correct:

```bash

ls /usr/local/cuda/include/cuda.h

# If not found, create symlink or adjust Package.swift

```

## License

See [LICENSE.txt](LICENSE.txt).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wendylabsinc/tensorrt-swift

Awesome Lists containing this project

README