https://github.com/trymirai/uzu

A high-performance inference engine for AI models
https://github.com/trymirai/uzu

ai high-performance inference llm metal rust

Last synced: about 2 months ago
JSON representation

A high-performance inference engine for AI models

Host: GitHub
URL: https://github.com/trymirai/uzu
Owner: trymirai
License: mit
Created: 2025-06-23T21:55:11.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-08-05T03:33:37.000Z (2 months ago)
Last Synced: 2025-08-05T05:24:40.730Z (2 months ago)
Topics: ai, high-performance, inference, llm, metal, rust
Language: Rust
Homepage: https://trymirai.com
Size: 215 KB
Stars: 1,174
Watchers: 8
Forks: 27
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-repositories - trymirai/uzu - A high-performance inference engine for AI models (Rust)

README

          


  

    

  











[![License](https://img.shields.io/badge/License-MIT-blue)](LICENSE)

# uzu

A high-performance inference engine for AI models on Apple Silicon. Key features:

- Simple, high-level API

- [Hybrid architecture](https://docs.trymirai.com/components/inference-engine#before-we-start), where layers can be computed as GPU kernels or via MPSGraph (a low-level API beneath CoreML with [ANE](https://trymirai.com/blog/iphone-hardware) access)

- Unified model configurations, making it easy to add support for new models

- Traceable computations to ensure correctness against the source-of-truth implementation

- Utilizes unified memory on Apple devices

## Overview

For a detailed explanation of the architecture, please refer to the [documentation](https://docs.trymirai.com/components/inference-engine).

### [Models](https://trymirai.com/models)

`uzu` uses its own model format. To export a specific model, use [lalamo](https://github.com/trymirai/lalamo). First, get the list of supported models:

```bash

uv run lalamo list-models

```

Then, export the specific one:

```bash

uv run lalamo convert meta-llama/Llama-3.2-1B-Instruct --precision float16

```

Alternatively, you can download a prepared model using the sample script:

```bash

./scripts/download_test_model.sh $MODEL_PATH

```

### Bindings

- [uzu-swift](https://github.com/trymirai/uzu-swift) - a prebuilt Swift framework, ready to use with SPM

### CLI

You can run `uzu` in a [CLI](https://docs.trymirai.com/components/cli) mode:

```bash

cargo run --release -p cli -- help

```

```bash

Usage: uzu_cli [COMMAND]



Commands:

  run    Run a model with the specified path

  serve  Start a server with the specified model path

  help   Print this message or the help of the given subcommand(s)

```

## Quick Start

First, add the `uzu` dependency to your `Cargo.toml`:

```toml

[dependencies]

uzu = { git = "https://github.com/trymirai/uzu", branch = "main", package = "uzu" }

```

Then, create an inference `Session` with a specific model and configuration:

```rust

use std::path::PathBuf;

use uzu::{

    backends::metal::sampling_config::SamplingConfig,

    session::{

        session::Session, session_config::SessionConfig,

        session_input::SessionInput, session_output::SessionOutput,

        session_run_config::SessionRunConfig,

    },

};

fn main() -> Result<(), Box> {

    let model_path = PathBuf::from("MODEL_PATH");

    

    let mut session = Session::new(model_path.clone())?;

    session.load_with_session_config(SessionConfig::default())?;

    let input = SessionInput::Text("Tell about London".to_string());

    let tokens_limit = 128;

    let run_config = SessionRunConfig::new_with_sampling_config(

        tokens_limit,

        Some(SamplingConfig::default())

    );

    let output = session.run(input, run_config, Some(|_: SessionOutput| {

        return true;

    }))?;

    println!("{}", output.text);

    Ok(())

}

```

## Benchmarks

Here are the performance metrics for various models:

| `Apple M2`, `tokens/s` | Llama-3.2-1B-Instruct | Qwen2.5-1.5B-Instruct | Qwen3-0.6B | Qwen3-4B | R1-Distill-Qwen-1.5B | SmolLM2-1.7B-Instruct | Gemma-3-1B-Instruct |

| ---------------------- | --------------------- | --------------------- | ---------- | -------- | -------------------- | --------------------- | ------------------- |

| `uzu`                  | 35.17                 | 28.32                 | 68.9       | 11.28    | 20.47                | 25.01                 | 41.50               |

| `llama.cpp`            | 32.48                 | 25.85                 | 5.37       | 1.08     | 2.81                 | 23.74                 | 37.68               |

> Note that all performance comparisons were done using bf16/f16 precision. Comparing quantized models isn't entirely fair, as different engines use different quantization approaches. For running llama.cpp, we used LM Studio (v0.3.17, Metal llama.cpp runtime v1.39.0). It's also worth mentioning that using the `release` build profile is crucial for obtaining the most accurate performance metrics.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/trymirai/uzu

Awesome Lists containing this project

README