https://github.com/debanjan06/latency-serve-edge
Native Rust edge inference engine with zero-copy memmap2 tensor loading, register-fused Linear+ReLU kernels, and scenario-aware MoE routing via rayon work-stealing — achieving 352µs lightweight and 1.39ms dense expert execution.
https://github.com/debanjan06/latency-serve-edge
edge-computing high-performance inference-engine memory-mapping mixture-of-experts operator-fusion parallel-computing rust systems-programming zero-copy
Last synced: 15 days ago
JSON representation
Native Rust edge inference engine with zero-copy memmap2 tensor loading, register-fused Linear+ReLU kernels, and scenario-aware MoE routing via rayon work-stealing — achieving 352µs lightweight and 1.39ms dense expert execution.
- Host: GitHub
- URL: https://github.com/debanjan06/latency-serve-edge
- Owner: debanjan06
- Created: 2026-05-30T21:48:24.000Z (27 days ago)
- Default Branch: main
- Last Pushed: 2026-05-31T12:34:41.000Z (26 days ago)
- Last Synced: 2026-05-31T14:09:07.944Z (26 days ago)
- Topics: edge-computing, high-performance, inference-engine, memory-mapping, mixture-of-experts, operator-fusion, parallel-computing, rust, systems-programming, zero-copy
- Language: Rust
- Homepage:
- Size: 11.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LatencyServe-Edge
A high-performance, ultra-low-latency inference engine written in native Rust. Designed for hardware-constrained edge deployment environments, LatencyServe-Edge addresses memory-bandwidth bottlenecks and cache starvation by eliminating redundant allocations and localizing memory operations directly within hardware registers.
The architecture pairs OS-level virtual memory mapping with a scenario-aware Mixture-of-Experts (MoE) routing engine to dynamically scale computational workloads between power-efficient single-threaded execution and multi-threaded parallel work-stealing loops.
## Core Architectural Pillars
- **Zero-Copy Tensor Ingestion:** Leverages virtual memory maps via `memmap2` to project raw binary weight files straight into the application's virtual address space. Tensors are exposed as zero-copy array views (`&[f32]`), achieving O(1) load times and bypassing physical RAM reallocation overhead.
- **Inline Graph Fusion:** Eliminates intermediate memory write-backs by fusing adjacent layers (Linear + ReLU) into a single localized runtime loop `Y = max(0, XW + B)`. Accumulation occurs entirely within CPU registers to maintain maximum L1/L2 cache locality.
- **Scenario-Aware MoE Routing:** Inspects incoming feature characteristics (e.g., spatial variance metrics) to intelligently assign inference tasks across execution paths:
- *Lightweight Expert:* Single-threaded execution track for minimal power draw on simple payloads.
- *Dense Expert:* Multi-threaded execution track driven by a `rayon` work-stealing parallel engine for massive parameter blocks.
## Directory Layout
```text
latency-serve-edge/
├── Cargo.toml # Dependency manifests and aggressive release compiler profiles
├── src/
│ ├── lib.rs # Root library entry point exposing core submodules
│ ├── memory.rs # Zero-copy memory management and region window views
│ ├── fusion.rs # Operator fusion structures and Rayon parallel fusers
│ └── routing.rs # Scenario-aware MoE routing engine mechanics
├── examples/
│ └── benchmark.rs # Performance execution suite validating processing times
└── tests/
└── test_server.rs # Precision arithmetic and boundary validation integration tests
```
## Compilation Profile
To achieve deterministic sub-millisecond execution, the release pipeline relies on strict Link-Time Optimization and aggressive single-codegen-unit loop optimizations:
```toml
[dependencies]
memmap2 = "0.9.10"
rayon = "1.12.0"
thiserror = "1.0.69"
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
panic = "abort"
```
## Performance Benchmark
Verified on local hardware simulating a **2.09 Million Parameter Matrix (2048 × 1024):**
### Debug Build (`cargo run --example benchmark`)
Unoptimized instruction loops illustrating raw hardware variance before compiler optimizations:
| Expert | Execution Mode | Time |
|---|---|---|
| Lightweight Expert | Single-Thread | ~2.01ms |
| Dense Expert | Multi-Thread | ~1.33ms |
### Release Build (`cargo run --example benchmark --release`)
With `opt-level=3`, `lto=true`, and `codegen-units=1`, the compiler applies function inlining, register tracking, and loop unrolling:
| Expert | Execution Mode | Time |
|---|---|---|
| Lightweight Expert | Single-Thread | **352.8µs** |
| Dense Expert | Multi-Thread Parallel | **1.39ms** |
> Lightweight Expert achieves sub-400µs execution via register-localized fused kernels. Dense Expert distributes multi-million parameter matrices across all available CPU cores via Rayon work-stealing.
## Verification and Testing
The codebase maintains a zero-panic safety record validated by automated integration tests covering arithmetic logic, virtual memory slicing, and routing boundaries.
```bash
cargo test
```
```
running 3 tests
test test_scenario_aware_routing_boundaries ... ok
test test_register_fused_linear_relu_math ... ok
test test_zero_copy_alignment_and_slicing ... ok
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
```
## Getting Started
**Verify compilation integrity:**
```bash
cargo check
```
**Run optimized performance benchmark:**
```bash
cargo run --example benchmark --release
```
**Run integration test suite:**
```bash
cargo test
```
## License
This project is open-source and available under the MIT License.