An open API service indexing awesome lists of open source software.

https://github.com/rdcm/triton-ng

Rust SDK for writing custom backends for NVIDIA Triton Inference Server
https://github.com/rdcm/triton-ng

custom-backend infrence nvidia rust triton-inference-server

Last synced: 7 days ago
JSON representation

Rust SDK for writing custom backends for NVIDIA Triton Inference Server

Awesome Lists containing this project

README

          

> **WIP** — work in progress, API is unstable

# triton-ng

Rust SDK for [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server).

Provides two things:
- A safe Rust API for writing **custom Triton backends** (compiled as `.so` and loaded by Triton)
- A high-level async **gRPC client** for sending inference requests to a running Triton server

## Crates

| Crate | Description |
|---|---|
| `triton-ng-sys` | Raw FFI bindings generated by bindgen from `tritonbackend.h` |
| `triton-ng` | Safe Rust wrapper over `triton-ng-sys` |
| `triton-ng-macros` | Proc-macros for `triton-ng` |
| `triton-ng-client` | High-level async gRPC client |
| `example/custom-backend` | Example custom backend (MNIST, proxies to ONNX model) |
| `example/app` | Example client application |

## Writing a custom backend

Implement the `Backend` trait and register it with `declare_backend!`:

```rust
use triton_ng::backend::Backend;
use triton_ng::{BackendHandle, DataType, Error, InferenceRequest, Response};

struct MyBackend;

impl Backend for MyBackend {
fn initialize(backend: &BackendHandle) -> Result<(), Error> {
Ok(())
}

fn model_instance_execute(
model: triton_ng::Model,
requests: &[triton_ng::Request],
) -> Result<(), Error> {
for request in requests {
let input = request.get_input("INPUT")?;
let data = input.as_fp32_vec()?;

// ... run inference ...

let mut response = Response::new(request)?;
response
.create_output("OUTPUT", DataType::Fp32, &[1, 10])?
.write_fp32_vec(&result)?;
response.send()?;
}
Ok(())
}
}

triton_ng::declare_backend!(MyBackend);
```

Build as a `cdylib`:

```toml
# Cargo.toml
[lib]
crate-type = ["cdylib"]
```

## Using the gRPC client

```rust
use triton_ng_client::{InferInput, InferOptions, TritonClient, TritonClientConfig};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
let client = TritonClient::new(TritonClientConfig::new("http://localhost:8001")).await?;

let meta = client.model_metadata("my_model", None, None).await?;
let n: usize = meta.inputs[0].shape.iter().map(|&d| d as usize).product();

let response = client
.infer(
"my_model",
None,
[InferInput::fp32("INPUT", meta.inputs[0].shape.clone(), vec![0.0f32; n])],
["OUTPUT"],
InferOptions::default(),
)
.await?;

println!("{:?}", response.outputs[0].data);
Ok(())
}
```

TLS:

```rust
use triton_ng_client::{ClientTlsConfig, TritonClientConfig};

let config = TritonClientConfig::new("https://triton.example.com:8001")
.with_tls(ClientTlsConfig::new()); // uses system roots
```

## Getting started

### Prerequisites

- Rust stable
- NVIDIA driver 570+ (580+ for Blackwell / RTX 50xx)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
- Docker

### First run

```bash
git submodule update --init --recursive
make build # compile custom backend → target/release/libtriton_custom_backend.so
make download-model # download mnist_onnx + create model version dirs
make docker-env-up # start Triton (mounts .so and models/)
```

### Run the example app

```bash
cargo run --manifest-path=example/app/Cargo.toml --release
```

Triton must be running with both models in READY state.

### Run integration tests

```bash
make tests # cargo nextest run --workspace
```

Tests require a running Triton instance (`make docker-env-up`).

### Rebuild after backend changes

```bash
make build
make docker-env-down && make docker-env-up
```

## Features

| Feature | Description |
|---|---|
| `cuda` | Enable GPU and pinned memory allocation in `ResponseAllocator` |

```toml
triton-ng = { version = "0.1", features = ["cuda"] }
```

## License

MIT