https://github.com/stampby/halo-ai-core
Bare-metal AI platform for AMD Strix Halo. One script. Everything works. Lego blocks β snap in what you need.
https://github.com/stampby/halo-ai-core
agent-framework ai amd arch-linux bare-metal caddy gaia gpu inference lemonade llama-cpp local-ai privacy rocm ryzen-ai self-hosted strix-halo systemd
Last synced: about 2 months ago
JSON representation
Bare-metal AI platform for AMD Strix Halo. One script. Everything works. Lego blocks β snap in what you need.
- Host: GitHub
- URL: https://github.com/stampby/halo-ai-core
- Owner: stampby
- License: mit
- Created: 2026-04-08T02:37:35.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-15T23:13:13.000Z (about 2 months ago)
- Last Synced: 2026-04-16T00:19:21.200Z (about 2 months ago)
- Topics: agent-framework, ai, amd, arch-linux, bare-metal, caddy, gaia, gpu, inference, lemonade, llama-cpp, local-ai, privacy, rocm, ryzen-ai, self-hosted, strix-halo, systemd
- Language: Shell
- Size: 907 KB
- Stars: 34
- Watchers: 0
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: docs/SECURITY.md
Awesome Lists containing this project
README
π **English** | [FranΓ§ais](README.fr.md) | [EspaΓ±ol](README.es.md) | [Deutsch](README.de.md) | [PortuguΓͺs](README.pt.md) | [ζ₯ζ¬θͺ](README.ja.md) | [δΈζ](README.zh.md) | [νκ΅μ΄](README.ko.md) | [Π ΡΡΡΠΊΠΈΠΉ](README.ru.md) | [ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯](README.hi.md) | [Ψ§ΩΨΉΨ±Ψ¨ΩΨ©](README.ar.md)
# halo-ai core
### the 1-bit monster β local ai inference, bare metal, no python at runtime
**rocm c++ Β· ternary weights (.h1b) Β· fused HIP kernels Β· wave32 wmma Β· 17 c++ specialists Β· zero telemetry Β· zero cloud**
*stamped by the architect*
[](LICENSE)
[](https://github.com/ROCm/TheRock)
[](https://github.com/stampby/rocm-cpp)
[](https://github.com/stampby/agent-cpp)
[](https://github.com/stampby/halo-1bit)
[](https://discord.gg/dSyV646eBs)
[](https://www.reddit.com/r/MidlifeCrisisAI/)
[](https://github.com/stampby/halo-ai-core)
---
## what is this
halo-ai core is the **install script for the 1-bit monster** β a full local AI stack that runs entirely in C++ on AMD Strix Halo hardware. no python at runtime. no cloud. no telemetry. no subscriptions.
one script, three engineering repos:
| repo | what it is |
|------|-----------|
| [**rocm-cpp**](https://github.com/stampby/rocm-cpp) | the inference engine. pure HIP, fused ternary kernels, OpenAI-compatible server with SSE streaming. |
| [**agent-cpp**](https://github.com/stampby/agent-cpp) | the agent framework. 17 single-purpose specialists on a message bus, hash-chained audit log, consent-verification gate. |
| [**halo-1bit**](https://github.com/stampby/halo-1bit) | the model format (.h1b) + training pipeline. absmean ternary, QAT with straight-through estimator, distillation from bf16 teachers. |
halo-ai core clones them, builds them from source, wires them into systemd, and points a caddy reverse proxy at the result. one command, you get a running LLM, a voice loop, a discord bot, a CI runner, and an audit trail. everything local.
*"I know kung fu."*
## install
two paths. the script auto-detects your GPU and picks the right one.
```bash
git clone https://github.com/stampby/halo-ai-core.git
cd halo-ai-core
./install.sh # auto-dispatch: strixhalo β fast; else β source
```
| path | who it's for | time | what it does |
|------|--------|------|------|
| [`./install-strixhalo.sh`](install-strixhalo.sh) | **gfx1151** (Strix Halo) | ~5 min | downloads pre-built binaries from GH Releases, verifies SHA256 + GPG, wires systemd |
| [`./install-source.sh`](install-source.sh) | any other AMD GPU | ~4 hrs | builds TheRock + rocm-cpp + agent-cpp + halo-1bit from source for your arch |
why two scripts: every Strix Halo is the same silicon (gfx1151, wave32, 128 GB unified). one build produces a binary that runs bit-identically on every such box β no reason to rebuild from source every time. for anything else (gfx1030, gfx1100, gfx1201, CDNA), the wave32 WMMA kernels don't port 1:1, so source build with arch-specific codegen is the safe option.
**running something other than a strix halo and want the kernels built for your GPU?** see [`release/KERNELS.md`](release/KERNELS.md) for arch coverage, how to build your own, and how to share community builds back.
[](docs/install-rocmpp.cast)
## the stack
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β agent-cpp β 17 C++ specialists β
β muse Β· planner Β· forge Β· warden (CVG) Β· scribe β
β sommelier Β· herald Β· sentinel Β· carpenter Β· anvil β
β quartermaster Β· magistrate Β· librarian Β· cartograph β
β echo_ear Β· echo_mouth Β· stdout_sink β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β rocm-cpp server (:8080) β OpenAI-compat, SSE streaming β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β librocm_cpp β HIP kernels Β· WMMA wave32 Β· KV cache β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ternary model (.h1b v2) Β· halo-1bit tokenizer (.htok) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β whisper-server (STT) Β· kokoro (TTS) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ROCm 7.13.0 Β· gfx1151 wave32 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Arch Linux Β· systemd Β· btrfs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
> *every layer is someone else's lego block if they want it. take the whole monster or take one piece.*
## numbers that matter
| metric | value | note |
|---|---|---|
| **decode speed** | 85 tok/s | BitNet-b1.58-2B, greedy, Strix Halo |
| **model size** | 1.1 GiB | TQ1_0 format, 4Γ smaller than F16 |
| **KLD vs F16** | 0.0023 | mean bits/token β indistinguishable in practice |
| **top-1 agreement** | 96.3% | vs F16 reference, same argmax token |
| **agent binary** | 1.3 MB | agent_cpp, statically linked |
| **cold start** | < 2s | bitnet_decode --server |
| **runtime deps** | 0 python | libc, pthreads, httplib, nlohmann-json, OpenSSL |
details and methodology: [docs/benchmark-comparison.md](docs/benchmark-comparison.md) Β· [docs/replicate.md](docs/replicate.md)
## what you get
### the engine
- **bitnet_decode** β OpenAI-compatible HTTP server on :8080 with SSE streaming. chat completions, models list, bearer auth optional.
- **`.h1b` loader** β ternary weights, magic `H1B`, 9 int32 config + 2 float32 params. memory-mapped, zero-copy.
- **HIP kernels** β fused ternary MatMul, RMSNorm, SiLU, RoPE. wave32 WMMA. no CK, no hipBLAS at runtime.
### the agents
- **17 specialists** β each one job, each one thread. message bus with tamper-evident journal.
- **consent-verification gate** β warden enforces policy/intent/consent/bounds. structural, not advisory.
- **hash-chained audit log** β every inbound and outbound message SHA-256 chained, genesis-seeded per session.
- **optional plugins** β Discord read (sentinel) + write (herald), GitHub triage/PR-review/docs (quartermaster/magistrate/librarian), CI runner (anvil), install-help (carpenter).
### the model + training
- **halo-1bit** β absmean quantization, QAT with STE, distillation from Qwen3-32B bf16 teacher.
- **.h1b v2 format** β production artifacts shipped with each release.
## lego blocks
pick what you want. drop the rest.
| block | what it does | status |
|-------|------|--------|
| **bitnet_decode** | inference server | required |
| **agent_cpp** | agent framework | required |
| **agent_cpp β sentinel+herald** | Discord bot | optional (set DISCORD_TOKEN) |
| **agent_cpp β echo_ear+echo_mouth** | voice loop | optional (whisper + kokoro services) |
| **agent_cpp β quartermaster/magistrate/librarian** | GitHub automation | optional (set GH_TOKEN) |
| **agent_cpp β anvil** | CI runner | optional |
| **caddy** | reverse proxy + bearer auth | optional |
| **man-cave TUI** | FTXUI dashboard over SSH | optional (v2) |
| **orchestrator** | systemd unit wiring | included |
## philosophy
> every piece snaps in and snaps out. no hard dependencies. no vendor lock-in. no cloud tethers.
python shipped the LLM era. C++ owns the next one. python at training time is fine; python at runtime is a liability on hardware you own. **halo-ai core has zero python at runtime.**
the AI industry wants you renting someone else's computer. we think you should own the whole stack β the hardware, the models, the weights, the pipeline. when you control your own software, you control your own destiny.
*"they get the kingdom. they forge their own keys."*
## privacy
**zero telemetry. zero tracking. zero data collection.** nothing phones home. your data stays on your machine.
paid API providers (OpenAI, Anthropic, Groq, DeepSeek, xAI, OpenRouter) are supported through sommelier with your own keys β but that's your choice, not our default. local-first means local-first.
*"there is no cloud. there is only zuul."*
## docs
| doc | what it covers |
|-----|---|
| [docs/INTEGRATIONS.md](docs/INTEGRATIONS.md) | **point your apps at the stack β openai sdk, curl, python, node, c++, webui, mobile** |
| [docs/benchmark-comparison.md](docs/benchmark-comparison.md) | reproducible numbers vs llama.cpp / vLLM / MLX |
| [docs/replicate.md](docs/replicate.md) | step-by-step: build the monster on your box |
| [docs/mlx-setup-guide.md](docs/mlx-setup-guide.md) | the MLX path (comparison / optional) |
| [orchestrator/README.md](orchestrator/README.md) | systemd unit wiring |
| [prototypes/](prototypes/) | next-rung experiments (ternary dequant, etc.) |
| [docs/archive/](docs/archive/) | legacy wiki + pre-monster docs |
## options
```
./install.sh --dry-run preview without installing
./install.sh --yes-all install everything
./install.sh --status check what's running
./install.sh --skip- skip any optional block
./install.sh --help all options
```
## requirements
- Arch Linux (bare metal preferred; podman works for headless)
- AMD Ryzen AI hardware β Strix Halo (gfx1151) or Strix Point (gfx1150)
- passwordless sudo
- ~20 GiB free disk (build artifacts, kernels, models)
## credits
this project stands on the shoulders of the people who ship open source.
built on [llama.cpp](https://github.com/ggml-org/llama.cpp) (for eval tooling), [TheRock](https://github.com/ROCm/TheRock) (ROCm distribution), [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann/json](https://github.com/nlohmann/json), [usearch](https://github.com/unum-cloud/usearch), [FTXUI](https://github.com/ArthurSonzogni/FTXUI), [whisper.cpp](https://github.com/ggerganov/whisper.cpp), [Kokoro TTS](https://github.com/remsky/Kokoro-FastAPI), [microsoft/bitnet-b1.58-2B-4T](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) (reference model).
special thanks to the `r/MidlifeCrisisAI` community for hard questions, especially the ones about PPL and KLD.
---
*"the 1-bit monster is already here. it just had to learn to count."* β **stamped by the architect**
MIT