{"id":47865044,"url":"https://github.com/baremetalrt/baremetalrt","last_synced_at":"2026-04-18T08:12:30.756Z","repository":{"id":349046307,"uuid":"1200832129","full_name":"baremetalrt/baremetalrt","owner":"baremetalrt","description":"BareMetalRT — edge GPU compute mesh","archived":false,"fork":false,"pushed_at":"2026-04-14T07:45:09.000Z","size":1053,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-14T08:33:55.564Z","etag":null,"topics":["cuda","distributed-computing","gpu","inference","llm","nvidia","tensorrt","windows"],"latest_commit_sha":null,"homepage":"https://baremetalrt.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baremetalrt.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-03T22:01:17.000Z","updated_at":"2026-04-14T07:45:13.000Z","dependencies_parsed_at":"2026-04-06T20:00:23.546Z","dependency_job_id":null,"html_url":"https://github.com/baremetalrt/baremetalrt","commit_stats":null,"previous_names":["baremetalrt/baremetalrt"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/baremetalrt/baremetalrt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baremetalrt%2Fbaremetalrt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baremetalrt%2Fbaremetalrt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baremetalrt%2Fbaremetalrt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baremetalrt%2Fbaremetalrt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baremetalrt","download_url":"https://codeload.github.com/baremetalrt/baremetalrt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baremetalrt%2Fbaremetalrt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31861708,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"ssl_error","status_checked_at":"2026-04-15T15:24:39.138Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","distributed-computing","gpu","inference","llm","nvidia","tensorrt","windows"],"created_at":"2026-04-04T00:04:05.677Z","updated_at":"2026-04-15T22:01:36.394Z","avatar_url":"https://github.com/baremetalrt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BareMetalRT\n\n[![PyPI](https://img.shields.io/pypi/v/baremetalrt)](https://pypi.org/project/baremetalrt/)\n[![Release](https://img.shields.io/github/v/release/baremetalrt/baremetalrt?include_prereleases)](https://github.com/baremetalrt/baremetalrt/releases/latest)\n[![License](https://img.shields.io/badge/license-proprietary-blue)](LICENSE)\n\n**The world's first global GPU-native edge compute mesh.**\n\nIntelligence shouldn't be owned by the hyperscalers alone. BareMetalRT turns the 200+ million NVIDIA GPUs running Windows into a distributed compute mesh — using NVIDIA's own TensorRT-LLM CUDA kernels, the same engine that powers cloud inference APIs. Built for the edge, not the cloud.\n\n**[Download Installer](https://github.com/baremetalrt/baremetalrt/releases/latest)** | **[Live Demo](https://baremetalrt.ai/demo)** | **[Documentation](https://baremetalrt.ai/docs)** | **[PyPI](https://pypi.org/project/baremetalrt/)** | **[Technical Paper](paper/main.pdf)**\n\n## How It Works\n\n1. **Install** the BareMetalRT daemon on each Windows machine with an NVIDIA GPU\n2. **Connect** your GPU to your account at [baremetalrt.ai](https://baremetalrt.ai)\n3. **Run inference** — the system automatically shards models across your available GPUs\n\n```\n┌──────────────────────────────────┐\n│  baremetalrt.ai                  │\n│  Auth, routing, OpenAI-compat API│\n└──────────────┬───────────────────┘\n               │ WebSocket\n    ┌──────────┼──────────┐\n    │          │          │\n┌───▼───┐ ┌───▼───┐ ┌───▼───┐\n│Node A │ │Node B │ │Node C │\n│3090   │ │4060   │ │3060   │\n│24GB   │ │8GB    │ │12GB   │\n│Daemon │ │Daemon │ │Daemon │\n└───┬───┘ └───┬───┘ └───┬───┘\n    └── TCP AllReduce ──┘\n```\n\n## Why BareMetalRT\n\n| | Cloud (OpenAI, etc.) | Local (Ollama, LM Studio) | Distributed (Petals, Exo) | **BareMetalRT** |\n|---|---|---|---|---|\n| **Kernels** | Optimized (proprietary) | Generic CUDA | Generic CUDA | **TensorRT-LLM (1,500+ optimized .cu)** |\n| **Multi-GPU** | NVLink (datacenter only) | Single GPU only | Pipeline parallelism | **Tensor parallelism over TCP** |\n| **Heterogeneous GPUs** | No | N/A | No | **Yes — different VRAM, different SMs** |\n| **Windows native** | N/A | Yes | Partial | **Yes** |\n| **Cost** | Per-token pricing | Free (your hardware) | Free (your hardware) | **Free (your hardware)** |\n| **Privacy** | Data leaves your machine | Fully local | Weights distributed | **Weights distributed, execution local** |\n\n**The key difference:** every other consumer GPU project uses pipeline parallelism, which leaves GPUs idle 50% of the time. BareMetalRT is the first to achieve tensor parallelism across heterogeneous consumer GPUs — both GPUs compute on every layer, every token.\n\n## Benchmarks\n\nTested with Mistral 7B Instruct (14 GB FP16) across an RTX 4070 Super (12 GB) and an RTX 4060 Laptop (8 GB) — **a model too large for either GPU alone**.\n\n| Configuration | Latency | Throughput | Notes |\n|---|---|---|---|\n| llama.cpp — 4070S single GPU | 3.4 ms/tok | 295 tok/s | Q8 quantized, fits on one card |\n| BareMetalRT TP=2 — WiFi | 277 ms/tok | 3.6 tok/s | TinyLlama 1.1B, 316ms ping |\n| BareMetalRT TP=2 — Ethernet | 276 ms/tok | 3.6 tok/s | TinyLlama 1.1B, 1ms ping |\n| **BareMetalRT TP=2 — Mistral 7B** | **80 ms/tok** | **12.5 tok/s** | **KV cache + overlapped AllReduce** |\n\n\u003e **Key finding:** A 300x improvement in network speed (WiFi → ethernet) yielded zero throughput improvement. GPU synchronization overhead — not network latency — is the dominant bottleneck. The network is not the problem.\n\n12.5 tok/s streams faster than a human reads. The throughput is practical for interactive use, and the correctness result — identical output from mismatched GPUs over a commodity network — is what matters.\n\n## System Requirements\n\n- Windows 10/11 (64-bit)\n- NVIDIA GPU (RTX 2000+ recommended)\n- CUDA Toolkit 12.4+\n- TensorRT 10.15+\n\n## Quick Start\n\n### 1. Download and Install\n\nDownload the latest installer from [GitHub Releases](https://github.com/baremetalrt/baremetalrt/releases/latest) and run it. The installer will check for NVIDIA prerequisites and guide you through setup.\n\n### 2. Connect Your GPU\n\nSign in at [baremetalrt.ai/app](https://baremetalrt.ai/app) — the web app automatically detects the daemon running on your machine and links your GPU to your account.\n\n### 3. Chat\n\nUse the web interface at [baremetalrt.ai/app](https://baremetalrt.ai/app) or connect any OpenAI-compatible client:\n\n```bash\ncurl https://baremetalrt.ai/v1/chat/completions \\\n  -H \"Authorization: Bearer bmrt_your_api_key\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"mistral-7b\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'\n```\n\nWorks with any OpenAI-compatible client — Python `openai` SDK, Continue, Cursor, or `curl`.\n\n## CLI\n\n```bash\npip install baremetalrt\nbmrt status\nbmrt models\nbmrt run mistral-7b\n```\n\n## Technical Details\n\n- **FP32 precision correctness** — custom CUDA kernel performs AllReduce on-GPU in FP32, achieving the theoretical floor of IEEE 754 arithmetic. 2,500x more accurate than FP16. Identical to NCCL on NVLink.\n- **Asymmetric-tolerant transport** — GPUs with different VRAM and compute capabilities participate in the same AllReduce without barrier stalls. The slower GPU sets the pace; the faster GPU waits on a non-blocking receive.\n- **TensorRT plugin integration** — custom `IPluginV2DynamicExt` plugins intercept every AllReduce/AllGather call at execution time, replacing NCCL with our TCP transport without modifying TRT-LLM's model definitions.\n- **Overlapped AllReduce** — TCP recv runs in a background thread during GPU sync wait. When sync_wait \u003e recv_time, the network transfer adds zero time to the critical path.\n- **Double-buffered pinned memory** — four page-locked host buffers alternate between consecutive AllReduce calls, preventing data races between in-flight transfers.\n- **TensorRT-LLM on Windows** — full native port of NVIDIA's inference engine (Conan profiles, FMHA kernels, nanobind bindings, MSVC/CUDA interop). No WSL, no Docker.\n\nSee [Architecture](docs/ARCHITECTURE.md) for the full system design, or read the [technical paper](paper/main.pdf).\n\n## What's in This Repo\n\nThis is the **public product repo** — the server, web UI, installer, and documentation.\n\n```\nbaremetalrt/\n├── server/        # FastAPI server (auth, chat relay, node management)\n├── web/           # Product web app (chat UI, account, downloads)\n├── site/          # Landing page and demo\n├── installer/     # Windows installer (Inno Setup)\n├── cli/           # bmrt CLI (pip install baremetalrt)\n├── docs/          # Documentation\n│   ├── ARCHITECTURE.md\n│   ├── API.md\n│   ├── QUICKSTART.md\n│   └── MISSION.md\n└── paper/         # Technical paper (arXiv-ready LaTeX)\n```\n\nThe inference engine, transport layer, and daemon are in a separate private repository.\n\n## API\n\nBareMetalRT exposes an OpenAI-compatible API. See [API docs](https://baremetalrt.ai/docs) or the [API reference](docs/API.md).\n\n## Current Status\n\n**v0.5.1-beta** — [Changelog](CHANGELOG.md)\n\n- Single-GPU and TP=2 multi-GPU inference on Windows\n- Mistral 7B at 12.5 tok/s across heterogeneous GPUs over TCP\n- Web chat UI with streaming\n- Windows installer with automatic GPU claiming\n- OpenAI-compatible API\n- User accounts with Google OAuth + API key auth\n- Chat history encrypted on-device — never stored on our servers\n\n## Roadmap\n\n- **TP=4+** — scale beyond two GPUs using ring AllReduce (transport implemented, untested beyond TP=2)\n- **Mixture-of-Experts** — replace dense AllReduce with sparse expert routing. 2 of 8 experts per token = 75% of the mesh available for concurrent serving. See [Architecture § Future](docs/ARCHITECTURE.md#future-mixture-of-experts).\n- **Distributed KV cache** — page KV entries to peer GPU VRAM across the mesh, enabling 128K+ context on consumer hardware\n- **Asymmetric weight splitting** — proportional column assignment based on per-GPU VRAM\n- **Continuous batching** — serve multiple users per forward pass\n\n## Security\n\nFound a vulnerability? See [SECURITY.md](SECURITY.md) for our responsible disclosure policy.\n\n## License\n\nBareMetalRT is proprietary software. See [LICENSE](LICENSE) for terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaremetalrt%2Fbaremetalrt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaremetalrt%2Fbaremetalrt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaremetalrt%2Fbaremetalrt/lists"}