{"id":49901736,"url":"https://github.com/ohdearquant/lattice","last_synced_at":"2026-05-25T23:03:07.697Z","repository":{"id":357652082,"uuid":"1237862355","full_name":"ohdearquant/lattice","owner":"ohdearquant","description":"pure rust inference engine","archived":false,"fork":false,"pushed_at":"2026-05-17T01:34:50.000Z","size":2013,"stargazers_count":12,"open_issues_count":13,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-17T08:02:48.676Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ohdearquant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-13T15:22:45.000Z","updated_at":"2026-05-16T23:06:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"3582a6c4-0bd3-4afa-abad-99df745a9eb1","html_url":"https://github.com/ohdearquant/lattice","commit_stats":null,"previous_names":["ohdearquant/lattice"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/ohdearquant/lattice","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ohdearquant%2Flattice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ohdearquant%2Flattice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ohdearquant%2Flattice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ohdearquant%2Flattice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ohdearquant","download_url":"https://codeload.github.com/ohdearquant/lattice/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ohdearquant%2Flattice/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33172173,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T05:43:36.989Z","status":"ssl_error","status_checked_at":"2026-05-18T05:43:19.133Z","response_time":71,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-16T07:15:28.341Z","updated_at":"2026-05-25T23:03:07.690Z","avatar_url":"https://github.com/ohdearquant.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lattice\n\nPure Rust inference engine for transformer models on Apple Silicon.\n\n[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)\n[![Crates.io](https://img.shields.io/crates/v/lattice-embed.svg)](https://crates.io/crates/lattice-embed)\n[![CI](https://github.com/ohdearquant/lattice/actions/workflows/ci.yml/badge.svg)](https://github.com/ohdearquant/lattice/actions)\n\nNo ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph\n— weight loading, tokenization, forward pass, and vector operations — in Rust, with hand-written\nMetal shaders and SIMD kernels.\n\n![Benchmark: Lattice vs Ollama — fair end-to-end decode on Qwen3.5-0.8B](docs/bench_results/lattice_benchmark.png)\n\n**Lattice is the only inference engine that correctly runs Qwen3.5's hybrid GatedDeltaNet architecture at 4-bit on Apple Silicon** — with QuaRot-Q4 and LoRA hot-swap that neither Ollama nor MLX support. On a fair end-to-end decode measurement it is **~1.9× faster than Ollama/llama.cpp**. Apple's MLX (Metal-native) decodes faster than Lattice at raw throughput — Lattice's edge is portability (pure Rust, no Python/framework) plus those Q4 + adapter capabilities. Full table and methodology below.\n\n```bash\n# Reproduce on your hardware (macOS + ollama + uv):\n./scripts/bench_apples_to_apples.sh\n```\n\nBuilt for inference on CPU and macOS GPU. Optimized for AVX2 (x86), NEON (ARM),\nand Metal (Apple Silicon) — not CUDA. If you need NVIDIA GPU inference, use\n[candle](https://github.com/huggingface/candle) or [mistral.rs](https://github.com/EricLBuehler/mistral.rs).\nLattice targets the other 90% of deployments: servers, edge, laptops, and library dependencies\nthat shouldn't drag in a 300 MB ONNX runtime.\n\n---\n\n## Quick Start\n\n```toml\n[dependencies]\nlattice-embed = \"0.1\"\ntokio = { version = \"1\", features = [\"full\"] }\n```\n\n```rust\nuse lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};\n\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let service = NativeEmbeddingService::default();\n\n    // Single embedding (BGE-small-en-v1.5, 384 dimensions)\n    let embedding = service\n        .embed_one(\"The quick brown fox jumps over the lazy dog\", EmbeddingModel::default())\n        .await?;\n\n    println!(\"Dimensions: {}\", embedding.len()); // 384\n\n    // Batch\n    let texts = vec![\n        \"First document\".to_string(),\n        \"Second document\".to_string(),\n    ];\n    let embeddings = service.embed(\u0026texts, EmbeddingModel::BgeSmallEnV15).await?;\n\n    // SIMD-accelerated similarity\n    let similarity = lattice_embed::utils::cosine_similarity(\u0026embeddings[0], \u0026embeddings[1]);\n    println!(\"Similarity: {:.4}\", similarity);\n\n    Ok(())\n}\n```\n\nModel weights are downloaded from HuggingFace on first use and cached at `~/.lattice/models`\n(or `$LATTICE_MODEL_CACHE`).\n\n---\n\n## Features\n\n| Feature                       | Description                                                                                  |\n| ----------------------------- | -------------------------------------------------------------------------------------------- |\n| Pure Rust compute             | Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA.                             |\n| Two transformer architectures | BERT/BGE encoder-only (mean pooling) and Qwen3 decoder-only (causal GQA, last-token pooling) |\n| 9 supported models            | BGE, mE5, MiniLM, Qwen3-Embedding families — see table below                                 |\n| Metal GPU backend             | Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform.   |\n| Three pure Rust tokenizers    | WordPiece, SentencePiece, BPE — no Hugging Face tokenizers C extension                       |\n| Safetensors native            | Memory-mapped weight loading from HuggingFace `.safetensors` format                          |\n| MRL support                   | Matryoshka truncation for Qwen3 models (configurable output dimension \u003e= 32)                 |\n| LRU embedding cache           | `CachedEmbeddingService` with sharded in-memory cache and hit/miss stats                     |\n| LoRA adapter injection        | Inject trained LoRA adapters into inference models at runtime                                |\n| Knowledge distillation        | Train small models from Claude/GPT/Gemini teacher soft labels                                |\n| Optimal transport             | Sinkhorn-Knopp solver (log-domain, epsilon-scaling) for embedding drift detection            |\n| Tiny fast networks            | `lattice-fann`: sub-5ms classifiers with pre-allocated buffers, zero-alloc forward pass      |\n\n---\n\n## Architecture\n\n```\nApplication\n    |\n    v\nlattice-embed          (public API — embedding service, SIMD distance ops, LRU cache)\n    |\n    v\nlattice-inference      (transformer kernel — BERT/Qwen3 forward pass, tokenizers, weights)\n    |\n    +---\u003e CPU (primary)      Metal (macOS)     WGPU (fallback)\n          AVX2/NEON kernels   Apple Silicon      Vulkan/DX12\n\n\nlattice-fann           (standalone — tiny network primitives, \u003c5ms CPU inference)\nlattice-transport      (standalone — optimal transport math, Wasserstein distances)\nlattice-tune           (depends on fann + inference — LoRA, distillation, model registry)\n```\n\nThe three leaf crates (`inference`, `fann`, `transport`) have zero intra-workspace dependencies\nand can be used standalone.\n\n---\n\n## Crates\n\n| Crate                                    | Description                                                                                                                                                                    | LOC    |\n| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |\n| [`lattice-embed`](crates/embed/)         | Embedding service — `EmbeddingService` trait, `NativeEmbeddingService`, `CachedEmbeddingService`, SIMD cosine/dot/euclidean, backfill, migration                               | ~14 k  |\n| [`lattice-inference`](crates/inference/) | Transformer kernel — safetensors loading, BERT/BGE/Qwen3 forward pass, WordPiece/SentencePiece/BPE tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding | ~69 k  |\n| [`lattice-fann`](crates/fann/)           | Fast neural network primitives — `NetworkBuilder`, pre-allocated layers, zero-alloc forward pass, backprop trainer, FANN binary format                                         | ~7.5 k |\n| [`lattice-tune`](crates/tune/)           | Training infrastructure — knowledge distillation pipeline, dataset management, LoRA adapter management, model registry with semver lineage                                     | ~13 k  |\n| [`lattice-transport`](crates/transport/) | Optimal transport math — Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection, log-domain throughout                                     | ~5.3 k |\n\n---\n\n## Supported Models\n\nAll local models load from HuggingFace safetensors format.\n\n| Model                               | Variant                                                     | Architecture       | Dimensions | Max Tokens | Tokenizer     |\n| ----------------------------------- | ----------------------------------------------------------- | ------------------ | ---------- | ---------- | ------------- |\n| `BgeSmallEnV15`                     | BAAI/bge-small-en-v1.5                                      | BERT encoder       | 384        | 512        | WordPiece     |\n| `BgeBaseEnV15`                      | BAAI/bge-base-en-v1.5                                       | BERT encoder       | 768        | 512        | WordPiece     |\n| `BgeLargeEnV15`                     | BAAI/bge-large-en-v1.5                                      | BERT encoder       | 1024       | 512        | WordPiece     |\n| `MultilingualE5Small`               | intfloat/multilingual-e5-small                              | BERT encoder       | 384        | 512        | SentencePiece |\n| `MultilingualE5Base`                | intfloat/multilingual-e5-base                               | BERT encoder       | 768        | 512        | SentencePiece |\n| `AllMiniLmL6V2`                     | sentence-transformers/all-MiniLM-L6-v2                      | BERT encoder       | 384        | 256        | WordPiece     |\n| `ParaphraseMultilingualMiniLmL12V2` | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | BERT encoder       | 384        | 128        | WordPiece     |\n| `Qwen3Embedding0_6B`                | Qwen/Qwen3-Embedding-0.6B                                   | Decoder (GQA+RoPE) | 1024       | 8192       | BPE           |\n| `Qwen3Embedding4B`                  | Qwen/Qwen3-Embedding-4B                                     | Decoder (GQA+RoPE) | 2560*      | 8192       | BPE           |\n\n*Qwen3-Embedding-4B supports MRL truncation to any dimension \u003e= 32.\n\nE5 models expect asymmetric `\"query: \"` / `\"passage: \"` prefixes for retrieval. Qwen3 models\nexpect an instruction prefix. The `EmbeddingModel` enum exposes `query_instruction()` and\n`document_instruction()` so callers can apply these correctly.\n\n---\n\n## Selecting a Model\n\n```rust\nuse lattice_embed::EmbeddingModel;\n\n// Fast general-purpose English retrieval\nlet model = EmbeddingModel::BgeSmallEnV15;   // 384-dim, fastest\n\n// Balanced quality/speed\nlet model = EmbeddingModel::BgeBaseEnV15;   // 768-dim\n\n// Best quality, English\nlet model = EmbeddingModel::BgeLargeEnV15;  // 1024-dim\n\n// Multilingual retrieval\nlet model = EmbeddingModel::MultilingualE5Base;  // 768-dim, 100+ languages\n\n// Long context + multilingual (requires GPU for practical throughput)\nlet model = EmbeddingModel::Qwen3Embedding0_6B;  // 1024-dim, 8K context\n\n// MRL: variable output dimension\nuse lattice_embed::ModelConfig;\nlet config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;\n```\n\n---\n\n## Feature Flags\n\n### lattice-embed\n\n| Feature     | Default | Description                                 |\n| ----------- | ------- | ------------------------------------------- |\n| `native`    | yes     | Pure Rust inference via `lattice-inference` |\n| `metal-gpu` | no      | Metal GPU acceleration (macOS)              |\n| `avx512`    | no      | AVX-512 SIMD kernels (requires nightly)     |\n\n### lattice-inference\n\n| Feature     | Default | Description                                            |\n| ----------- | ------- | ------------------------------------------------------ |\n| `f16`       | no      | Half-precision weights                                 |\n| `metal-gpu` | no      | Metal compute backend                                  |\n| `wgpu-gpu`  | no      | WGPU cross-platform GPU backend                        |\n| `download`  | yes     | HuggingFace weight download with checksum verification |\n| `backfill`  | no      | Re-embedding coordinator (requires `rusqlite`)         |\n\n```toml\n# GPU acceleration on macOS\nlattice-embed = { version = \"0.1\", features = [\"metal-gpu\"] }\n\n# Cross-platform GPU\nlattice-embed = { version = \"0.1\", features = [\"wgpu-gpu\"] }\n```\n\n---\n\n## Vector Operations\n\n`lattice-embed` exposes SIMD-accelerated vector utilities as a stable public API:\n\n```rust\nuse lattice_embed::utils;\n\n// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere\nlet sim = utils::cosine_similarity(\u0026a, \u0026b);\nlet dot = utils::dot_product(\u0026a, \u0026b);\nlet dist = utils::euclidean_distance(\u0026a, \u0026b);\n\nutils::normalize(\u0026mut vector);  // in-place L2 normalization\n\n// Batch operations\nlet sims = utils::batch_cosine_similarity(\u0026pairs);\n```\n\nMeasured performance on normalized 384-dim vectors (internal benchmarks, subject to hardware):\n\n| Operation                    | Scalar   | SIMD    |\n| ---------------------------- | -------- | ------- |\n| cosine similarity (384-dim)  | ~650 ns  | ~90 ns  |\n| cosine similarity (768-dim)  | ~1300 ns | ~180 ns |\n| cosine similarity (1024-dim) | ~1700 ns | ~240 ns |\n\n---\n\n## lattice-fann: Fast Neural Networks\n\nFor tiny classifiers that need to run in under 5 ms on CPU:\n\n```rust\nuse lattice_fann::{NetworkBuilder, Activation, BackpropTrainer, TrainingConfig, Trainer};\n\n// Build a network\nlet mut network = NetworkBuilder::new()\n    .input(784)\n    .hidden(128, Activation::ReLU)\n    .hidden(64, Activation::ReLU)\n    .output(10, Activation::Softmax)\n    .build()?;\n\nprintln!(\"{}\", network.architecture()); // \"784 -\u003e ReLU(128) -\u003e ReLU(64) -\u003e Softmax(10)\"\nprintln!(\"Parameters: {}\", network.total_params());\n\n// Forward pass (no heap allocation)\nlet output = network.forward(\u0026input)?;\n\n// Serialize to compact binary (magic \"FANN\")\nlet bytes = network.to_bytes();\nlet restored = lattice_fann::Network::from_bytes(\u0026bytes)?;\n```\n\n---\n\n## lattice-transport: Optimal Transport\n\nEntropy-regularized optimal transport for measuring embedding geometry drift:\n\n```rust\n// Sinkhorn-Knopp in log-domain (numerically stable, no Gibbs kernel materialization)\n// Balanced OT, unbalanced OT (KL-relaxed), Wasserstein barycenters\n// Pre-allocated SinkhornWorkspace for zero-alloc inner loops\n```\n\nPrimary use case: detect when an embedding model update has shifted the distribution of stored\nvectors enough to warrant re-indexing.\n\n---\n\n## Benchmarks\n\n### Qwen3.5-0.8B Decode Throughput (Apple M2 Max)\n\nFair end-to-end measurement — **slope method**: `tok/s = (N₂−N₁) / (T(N₂)−T(N₁))` for a fixed\nprompt, so prompt prefill, model load, and per-call overhead cancel and every engine is measured\nthe *same* way (greedy, median of 5 runs, N₁=32, N₂=256).\n\n| Engine | Quant | decode tok/s | Notes |\n|--------|-------|--------------|-------|\n| MLX | Q8 (g64) | **247** | Apple's Metal-native framework — fastest |\n| **Lattice** | f16 | **157** | Pure Rust + Metal; ~1.9× Ollama |\n| Ollama | Q8_0 | **84** | llama.cpp Metal backend |\n\nCross-check: Ollama's slope (84) matches its own `eval_duration` rate (87), confirming the\nmethodology is sound.\n\n**Honest caveats.** Lattice's Qwen3.5 decode path is **f16** (`MetalQwen35State::new`); MLX is Q8\nand Ollama is Q8_0, so this is not quant-matched. There is **no end-to-end Q8 path** for Qwen3.5\nin Lattice — the earlier \"Q8 139\" number was an f16 `forward_step` *micro-benchmark* mislabeled\n\"Q8\" (the bench's `load_q8_state` builds the same f16 state as the f16 path). 157 tok/s (f16,\ntrue e2e) is Lattice's actual decode rate for this model; the only other real e2e path is **Q4**\n(`from_q4_dir`). **MLX decodes faster than Lattice.** Lattice's value is portability plus\ncapabilities no other engine has on this model:\n\n| Lattice-only capability | MLX | Ollama |\n|---|---|---|\n| QuaRot 4-bit (rotated quant) | ✗ | ✗ |\n| Q4 + LoRA r8 hot-swap (no reload) | ✗ | ✗ |\n| Pure Rust, zero Python / framework | ✗ | ✗ |\n\nA previous version of this table claimed \"+8% vs MLX\"; that was a measurement artifact (a\ncriterion `forward_step` micro-bench compared against MLX end-to-end with prefill counted in the\ndecode rate) and has been corrected. All three engines implement the full GDN recurrence for\nQwen3.5's hybrid architecture (18 GatedDeltaNet + 6 GQA layers). Reproducible via\n`./scripts/bench_apples_to_apples.sh`.\n\n### Embedding \u0026 Kernel Benchmarks\n\n```bash\n# Embedding throughput\ncargo bench --package lattice-embed\n\n# Metal GPU decode (macOS only, requires model weights)\ncargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode\n\n# Attention kernel\ncargo bench --package lattice-inference --bench attention_bench\n```\n\nPerformance depends on hardware, model size, batch size, and sequence length. Run the benchmarks\non your target hardware to get representative numbers.\n\n---\n\n## Documentation\n\n- [Architecture](docs/architecture.md) — crate dependency graph, design decisions, stability tiers\n- ADR directory: `docs/adr/` — architectural decision records\n\n---\n\n## License\n\nApache-2.0. See [LICENSE](LICENSE).\n\n---\n\nBuilt by [Ocean (HaiyangLi)](https://github.com/ohdearquant). Powers\n[khive](https://khive.ai), a cognitive infrastructure for AI agents.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fohdearquant%2Flattice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fohdearquant%2Flattice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fohdearquant%2Flattice/lists"}