{"id":50759391,"url":"https://github.com/debanjan06/latency-serve-edge","last_synced_at":"2026-06-11T08:30:54.221Z","repository":{"id":361607261,"uuid":"1254677664","full_name":"debanjan06/latency-serve-edge","owner":"debanjan06","description":"Native Rust edge inference engine with zero-copy memmap2 tensor loading, register-fused Linear+ReLU kernels, and scenario-aware MoE routing via rayon work-stealing — achieving 352µs lightweight and 1.39ms dense expert execution.","archived":false,"fork":false,"pushed_at":"2026-05-31T12:34:41.000Z","size":12,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T14:09:07.944Z","etag":null,"topics":["edge-computing","high-performance","inference-engine","memory-mapping","mixture-of-experts","operator-fusion","parallel-computing","rust","systems-programming","zero-copy"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/debanjan06.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-30T21:48:24.000Z","updated_at":"2026-05-31T12:34:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/debanjan06/latency-serve-edge","commit_stats":null,"previous_names":["debanjan06/latency-serve-edge"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/debanjan06/latency-serve-edge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Flatency-serve-edge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Flatency-serve-edge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Flatency-serve-edge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Flatency-serve-edge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/debanjan06","download_url":"https://codeload.github.com/debanjan06/latency-serve-edge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debanjan06%2Flatency-serve-edge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34190582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["edge-computing","high-performance","inference-engine","memory-mapping","mixture-of-experts","operator-fusion","parallel-computing","rust","systems-programming","zero-copy"],"created_at":"2026-06-11T08:30:53.110Z","updated_at":"2026-06-11T08:30:54.213Z","avatar_url":"https://github.com/debanjan06.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LatencyServe-Edge\r\n\r\nA high-performance, ultra-low-latency inference engine written in native Rust. Designed for hardware-constrained edge deployment environments, LatencyServe-Edge addresses memory-bandwidth bottlenecks and cache starvation by eliminating redundant allocations and localizing memory operations directly within hardware registers.\r\n\r\nThe architecture pairs OS-level virtual memory mapping with a scenario-aware Mixture-of-Experts (MoE) routing engine to dynamically scale computational workloads between power-efficient single-threaded execution and multi-threaded parallel work-stealing loops.\r\n\r\n## Core Architectural Pillars\r\n\r\n- **Zero-Copy Tensor Ingestion:** Leverages virtual memory maps via `memmap2` to project raw binary weight files straight into the application's virtual address space. Tensors are exposed as zero-copy array views (`\u0026[f32]`), achieving O(1) load times and bypassing physical RAM reallocation overhead.\r\n- **Inline Graph Fusion:** Eliminates intermediate memory write-backs by fusing adjacent layers (Linear + ReLU) into a single localized runtime loop `Y = max(0, XW + B)`. Accumulation occurs entirely within CPU registers to maintain maximum L1/L2 cache locality.\r\n- **Scenario-Aware MoE Routing:** Inspects incoming feature characteristics (e.g., spatial variance metrics) to intelligently assign inference tasks across execution paths:\r\n  - *Lightweight Expert:* Single-threaded execution track for minimal power draw on simple payloads.\r\n  - *Dense Expert:* Multi-threaded execution track driven by a `rayon` work-stealing parallel engine for massive parameter blocks.\r\n\r\n## Directory Layout\r\n\r\n```text\r\nlatency-serve-edge/\r\n├── Cargo.toml           # Dependency manifests and aggressive release compiler profiles\r\n├── src/\r\n│   ├── lib.rs           # Root library entry point exposing core submodules\r\n│   ├── memory.rs        # Zero-copy memory management and region window views\r\n│   ├── fusion.rs        # Operator fusion structures and Rayon parallel fusers\r\n│   └── routing.rs       # Scenario-aware MoE routing engine mechanics\r\n├── examples/\r\n│   └── benchmark.rs     # Performance execution suite validating processing times\r\n└── tests/\r\n    └── test_server.rs   # Precision arithmetic and boundary validation integration tests\r\n```\r\n\r\n## Compilation Profile\r\n\r\nTo achieve deterministic sub-millisecond execution, the release pipeline relies on strict Link-Time Optimization and aggressive single-codegen-unit loop optimizations:\r\n\r\n```toml\r\n[dependencies]\r\nmemmap2 = \"0.9.10\"\r\nrayon  = \"1.12.0\"\r\nthiserror = \"1.0.69\"\r\n\r\n[profile.release]\r\nopt-level     = 3\r\nlto           = true\r\ncodegen-units = 1\r\npanic         = \"abort\"\r\n```\r\n\r\n## Performance Benchmark\r\n\r\nVerified on local hardware simulating a **2.09 Million Parameter Matrix (2048 × 1024):**\r\n\r\n### Debug Build (`cargo run --example benchmark`)\r\n\r\nUnoptimized instruction loops illustrating raw hardware variance before compiler optimizations:\r\n\r\n| Expert | Execution Mode | Time |\r\n|---|---|---|\r\n| Lightweight Expert | Single-Thread | ~2.01ms |\r\n| Dense Expert | Multi-Thread | ~1.33ms |\r\n\r\n### Release Build (`cargo run --example benchmark --release`)\r\n\r\nWith `opt-level=3`, `lto=true`, and `codegen-units=1`, the compiler applies function inlining, register tracking, and loop unrolling:\r\n\r\n| Expert | Execution Mode | Time |\r\n|---|---|---|\r\n| Lightweight Expert | Single-Thread | **352.8µs** |\r\n| Dense Expert | Multi-Thread Parallel | **1.39ms** |\r\n\r\n\u003e Lightweight Expert achieves sub-400µs execution via register-localized fused kernels. Dense Expert distributes multi-million parameter matrices across all available CPU cores via Rayon work-stealing.\r\n\r\n## Verification and Testing\r\n\r\nThe codebase maintains a zero-panic safety record validated by automated integration tests covering arithmetic logic, virtual memory slicing, and routing boundaries.\r\n\r\n```bash\r\ncargo test\r\n```\r\n\r\n```\r\nrunning 3 tests\r\ntest test_scenario_aware_routing_boundaries ... ok\r\ntest test_register_fused_linear_relu_math   ... ok\r\ntest test_zero_copy_alignment_and_slicing   ... ok\r\n\r\ntest result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s\r\n```\r\n\r\n## Getting Started\r\n\r\n**Verify compilation integrity:**\r\n```bash\r\ncargo check\r\n```\r\n\r\n**Run optimized performance benchmark:**\r\n```bash\r\ncargo run --example benchmark --release\r\n```\r\n\r\n**Run integration test suite:**\r\n```bash\r\ncargo test\r\n```\r\n\r\n## License\r\n\r\nThis project is open-source and available under the MIT License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebanjan06%2Flatency-serve-edge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdebanjan06%2Flatency-serve-edge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebanjan06%2Flatency-serve-edge/lists"}