An open API service indexing awesome lists of open source software.

https://github.com/ohdearquant/lattice

pure rust inference engine
https://github.com/ohdearquant/lattice

Last synced: 26 days ago
JSON representation

pure rust inference engine

Awesome Lists containing this project

README

          

# Lattice

Pure Rust inference engine for transformer models on Apple Silicon.

[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Crates.io](https://img.shields.io/crates/v/lattice-embed.svg)](https://crates.io/crates/lattice-embed)
[![CI](https://github.com/ohdearquant/lattice/actions/workflows/ci.yml/badge.svg)](https://github.com/ohdearquant/lattice/actions)

No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph
— weight loading, tokenization, forward pass, and vector operations — in Rust, with hand-written
Metal shaders and SIMD kernels.

![Benchmark: Lattice vs Ollama — fair end-to-end decode on Qwen3.5-0.8B](docs/bench_results/lattice_benchmark.png)

**Lattice is the only inference engine that correctly runs Qwen3.5's hybrid GatedDeltaNet architecture at 4-bit on Apple Silicon** — with QuaRot-Q4 and LoRA hot-swap that neither Ollama nor MLX support. On a fair end-to-end decode measurement it is **~1.9× faster than Ollama/llama.cpp**. Apple's MLX (Metal-native) decodes faster than Lattice at raw throughput — Lattice's edge is portability (pure Rust, no Python/framework) plus those Q4 + adapter capabilities. Full table and methodology below.

```bash
# Reproduce on your hardware (macOS + ollama + uv):
./scripts/bench_apples_to_apples.sh
```

Built for inference on CPU and macOS GPU. Optimized for AVX2 (x86), NEON (ARM),
and Metal (Apple Silicon) — not CUDA. If you need NVIDIA GPU inference, use
[candle](https://github.com/huggingface/candle) or [mistral.rs](https://github.com/EricLBuehler/mistral.rs).
Lattice targets the other 90% of deployments: servers, edge, laptops, and library dependencies
that shouldn't drag in a 300 MB ONNX runtime.

---

## Quick Start

```toml
[dependencies]
lattice-embed = "0.1"
tokio = { version = "1", features = ["full"] }
```

```rust
use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};

#[tokio::main]
async fn main() -> Result<(), Box> {
let service = NativeEmbeddingService::default();

// Single embedding (BGE-small-en-v1.5, 384 dimensions)
let embedding = service
.embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
.await?;

println!("Dimensions: {}", embedding.len()); // 384

// Batch
let texts = vec![
"First document".to_string(),
"Second document".to_string(),
];
let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;

// SIMD-accelerated similarity
let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
println!("Similarity: {:.4}", similarity);

Ok(())
}
```

Model weights are downloaded from HuggingFace on first use and cached at `~/.lattice/models`
(or `$LATTICE_MODEL_CACHE`).

---

## Features

| Feature | Description |
| ----------------------------- | -------------------------------------------------------------------------------------------- |
| Pure Rust compute | Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA. |
| Two transformer architectures | BERT/BGE encoder-only (mean pooling) and Qwen3 decoder-only (causal GQA, last-token pooling) |
| 9 supported models | BGE, mE5, MiniLM, Qwen3-Embedding families — see table below |
| Metal GPU backend | Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform. |
| Three pure Rust tokenizers | WordPiece, SentencePiece, BPE — no Hugging Face tokenizers C extension |
| Safetensors native | Memory-mapped weight loading from HuggingFace `.safetensors` format |
| MRL support | Matryoshka truncation for Qwen3 models (configurable output dimension >= 32) |
| LRU embedding cache | `CachedEmbeddingService` with sharded in-memory cache and hit/miss stats |
| LoRA adapter injection | Inject trained LoRA adapters into inference models at runtime |
| Knowledge distillation | Train small models from Claude/GPT/Gemini teacher soft labels |
| Optimal transport | Sinkhorn-Knopp solver (log-domain, epsilon-scaling) for embedding drift detection |
| Tiny fast networks | `lattice-fann`: sub-5ms classifiers with pre-allocated buffers, zero-alloc forward pass |

---

## Architecture

```
Application
|
v
lattice-embed (public API — embedding service, SIMD distance ops, LRU cache)
|
v
lattice-inference (transformer kernel — BERT/Qwen3 forward pass, tokenizers, weights)
|
+---> CPU (primary) Metal (macOS) WGPU (fallback)
AVX2/NEON kernels Apple Silicon Vulkan/DX12

lattice-fann (standalone — tiny network primitives, <5ms CPU inference)
lattice-transport (standalone — optimal transport math, Wasserstein distances)
lattice-tune (depends on fann + inference — LoRA, distillation, model registry)
```

The three leaf crates (`inference`, `fann`, `transport`) have zero intra-workspace dependencies
and can be used standalone.

---

## Crates

| Crate | Description | LOC |
| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| [`lattice-embed`](crates/embed/) | Embedding service — `EmbeddingService` trait, `NativeEmbeddingService`, `CachedEmbeddingService`, SIMD cosine/dot/euclidean, backfill, migration | ~14 k |
| [`lattice-inference`](crates/inference/) | Transformer kernel — safetensors loading, BERT/BGE/Qwen3 forward pass, WordPiece/SentencePiece/BPE tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding | ~69 k |
| [`lattice-fann`](crates/fann/) | Fast neural network primitives — `NetworkBuilder`, pre-allocated layers, zero-alloc forward pass, backprop trainer, FANN binary format | ~7.5 k |
| [`lattice-tune`](crates/tune/) | Training infrastructure — knowledge distillation pipeline, dataset management, LoRA adapter management, model registry with semver lineage | ~13 k |
| [`lattice-transport`](crates/transport/) | Optimal transport math — Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection, log-domain throughout | ~5.3 k |

---

## Supported Models

All local models load from HuggingFace safetensors format.

| Model | Variant | Architecture | Dimensions | Max Tokens | Tokenizer |
| ----------------------------------- | ----------------------------------------------------------- | ------------------ | ---------- | ---------- | ------------- |
| `BgeSmallEnV15` | BAAI/bge-small-en-v1.5 | BERT encoder | 384 | 512 | WordPiece |
| `BgeBaseEnV15` | BAAI/bge-base-en-v1.5 | BERT encoder | 768 | 512 | WordPiece |
| `BgeLargeEnV15` | BAAI/bge-large-en-v1.5 | BERT encoder | 1024 | 512 | WordPiece |
| `MultilingualE5Small` | intfloat/multilingual-e5-small | BERT encoder | 384 | 512 | SentencePiece |
| `MultilingualE5Base` | intfloat/multilingual-e5-base | BERT encoder | 768 | 512 | SentencePiece |
| `AllMiniLmL6V2` | sentence-transformers/all-MiniLM-L6-v2 | BERT encoder | 384 | 256 | WordPiece |
| `ParaphraseMultilingualMiniLmL12V2` | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | BERT encoder | 384 | 128 | WordPiece |
| `Qwen3Embedding0_6B` | Qwen/Qwen3-Embedding-0.6B | Decoder (GQA+RoPE) | 1024 | 8192 | BPE |
| `Qwen3Embedding4B` | Qwen/Qwen3-Embedding-4B | Decoder (GQA+RoPE) | 2560* | 8192 | BPE |

*Qwen3-Embedding-4B supports MRL truncation to any dimension >= 32.

E5 models expect asymmetric `"query: "` / `"passage: "` prefixes for retrieval. Qwen3 models
expect an instruction prefix. The `EmbeddingModel` enum exposes `query_instruction()` and
`document_instruction()` so callers can apply these correctly.

---

## Selecting a Model

```rust
use lattice_embed::EmbeddingModel;

// Fast general-purpose English retrieval
let model = EmbeddingModel::BgeSmallEnV15; // 384-dim, fastest

// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15; // 768-dim

// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15; // 1024-dim

// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base; // 768-dim, 100+ languages

// Long context + multilingual (requires GPU for practical throughput)
let model = EmbeddingModel::Qwen3Embedding0_6B; // 1024-dim, 8K context

// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;
```

---

## Feature Flags

### lattice-embed

| Feature | Default | Description |
| ----------- | ------- | ------------------------------------------- |
| `native` | yes | Pure Rust inference via `lattice-inference` |
| `metal-gpu` | no | Metal GPU acceleration (macOS) |
| `avx512` | no | AVX-512 SIMD kernels (requires nightly) |

### lattice-inference

| Feature | Default | Description |
| ----------- | ------- | ------------------------------------------------------ |
| `f16` | no | Half-precision weights |
| `metal-gpu` | no | Metal compute backend |
| `wgpu-gpu` | no | WGPU cross-platform GPU backend |
| `download` | yes | HuggingFace weight download with checksum verification |
| `backfill` | no | Re-embedding coordinator (requires `rusqlite`) |

```toml
# GPU acceleration on macOS
lattice-embed = { version = "0.1", features = ["metal-gpu"] }

# Cross-platform GPU
lattice-embed = { version = "0.1", features = ["wgpu-gpu"] }
```

---

## Vector Operations

`lattice-embed` exposes SIMD-accelerated vector utilities as a stable public API:

```rust
use lattice_embed::utils;

// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);

utils::normalize(&mut vector); // in-place L2 normalization

// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);
```

Measured performance on normalized 384-dim vectors (internal benchmarks, subject to hardware):

| Operation | Scalar | SIMD |
| ---------------------------- | -------- | ------- |
| cosine similarity (384-dim) | ~650 ns | ~90 ns |
| cosine similarity (768-dim) | ~1300 ns | ~180 ns |
| cosine similarity (1024-dim) | ~1700 ns | ~240 ns |

---

## lattice-fann: Fast Neural Networks

For tiny classifiers that need to run in under 5 ms on CPU:

```rust
use lattice_fann::{NetworkBuilder, Activation, BackpropTrainer, TrainingConfig, Trainer};

// Build a network
let mut network = NetworkBuilder::new()
.input(784)
.hidden(128, Activation::ReLU)
.hidden(64, Activation::ReLU)
.output(10, Activation::Softmax)
.build()?;

println!("{}", network.architecture()); // "784 -> ReLU(128) -> ReLU(64) -> Softmax(10)"
println!("Parameters: {}", network.total_params());

// Forward pass (no heap allocation)
let output = network.forward(&input)?;

// Serialize to compact binary (magic "FANN")
let bytes = network.to_bytes();
let restored = lattice_fann::Network::from_bytes(&bytes)?;
```

---

## lattice-transport: Optimal Transport

Entropy-regularized optimal transport for measuring embedding geometry drift:

```rust
// Sinkhorn-Knopp in log-domain (numerically stable, no Gibbs kernel materialization)
// Balanced OT, unbalanced OT (KL-relaxed), Wasserstein barycenters
// Pre-allocated SinkhornWorkspace for zero-alloc inner loops
```

Primary use case: detect when an embedding model update has shifted the distribution of stored
vectors enough to warrant re-indexing.

---

## Benchmarks

### Qwen3.5-0.8B Decode Throughput (Apple M2 Max)

Fair end-to-end measurement — **slope method**: `tok/s = (N₂−N₁) / (T(N₂)−T(N₁))` for a fixed
prompt, so prompt prefill, model load, and per-call overhead cancel and every engine is measured
the *same* way (greedy, median of 5 runs, N₁=32, N₂=256).

| Engine | Quant | decode tok/s | Notes |
|--------|-------|--------------|-------|
| MLX | Q8 (g64) | **247** | Apple's Metal-native framework — fastest |
| **Lattice** | f16 | **157** | Pure Rust + Metal; ~1.9× Ollama |
| Ollama | Q8_0 | **84** | llama.cpp Metal backend |

Cross-check: Ollama's slope (84) matches its own `eval_duration` rate (87), confirming the
methodology is sound.

**Honest caveats.** Lattice's Qwen3.5 decode path is **f16** (`MetalQwen35State::new`); MLX is Q8
and Ollama is Q8_0, so this is not quant-matched. There is **no end-to-end Q8 path** for Qwen3.5
in Lattice — the earlier "Q8 139" number was an f16 `forward_step` *micro-benchmark* mislabeled
"Q8" (the bench's `load_q8_state` builds the same f16 state as the f16 path). 157 tok/s (f16,
true e2e) is Lattice's actual decode rate for this model; the only other real e2e path is **Q4**
(`from_q4_dir`). **MLX decodes faster than Lattice.** Lattice's value is portability plus
capabilities no other engine has on this model:

| Lattice-only capability | MLX | Ollama |
|---|---|---|
| QuaRot 4-bit (rotated quant) | ✗ | ✗ |
| Q4 + LoRA r8 hot-swap (no reload) | ✗ | ✗ |
| Pure Rust, zero Python / framework | ✗ | ✗ |

A previous version of this table claimed "+8% vs MLX"; that was a measurement artifact (a
criterion `forward_step` micro-bench compared against MLX end-to-end with prefill counted in the
decode rate) and has been corrected. All three engines implement the full GDN recurrence for
Qwen3.5's hybrid architecture (18 GatedDeltaNet + 6 GQA layers). Reproducible via
`./scripts/bench_apples_to_apples.sh`.

### Embedding & Kernel Benchmarks

```bash
# Embedding throughput
cargo bench --package lattice-embed

# Metal GPU decode (macOS only, requires model weights)
cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode

# Attention kernel
cargo bench --package lattice-inference --bench attention_bench
```

Performance depends on hardware, model size, batch size, and sequence length. Run the benchmarks
on your target hardware to get representative numbers.

---

## Documentation

- [Architecture](docs/architecture.md) — crate dependency graph, design decisions, stability tiers
- ADR directory: `docs/adr/` — architectural decision records

---

## License

Apache-2.0. See [LICENSE](LICENSE).

---

Built by [Ocean (HaiyangLi)](https://github.com/ohdearquant). Powers
[khive](https://khive.ai), a cognitive infrastructure for AI agents.