https://github.com/barrel-platform/barrel_inference
OTP-native LLM inference runtime with token-exact tiered KV cache, plus an OpenAI/Anthropic/Ollama-compatible HTTP daemon. Erlang/OTP + llama.cpp.
https://github.com/barrel-platform/barrel_inference
erlang inference kv-cache llama-cpp llm otp
Last synced: 6 days ago
JSON representation
OTP-native LLM inference runtime with token-exact tiered KV cache, plus an OpenAI/Anthropic/Ollama-compatible HTTP daemon. Erlang/OTP + llama.cpp.
- Host: GitHub
- URL: https://github.com/barrel-platform/barrel_inference
- Owner: barrel-platform
- Created: 2026-05-23T20:07:01.000Z (23 days ago)
- Default Branch: main
- Last Pushed: 2026-06-09T20:42:50.000Z (6 days ago)
- Last Synced: 2026-06-09T21:09:36.512Z (6 days ago)
- Topics: erlang, inference, kv-cache, llama-cpp, llm, otp
- Language: C++
- Size: 7.37 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Barrel Inference
OTP-native LLM inference for the BEAM: dirty NIFs over `llama.cpp`, supervised
per-model processes, and a byte-exact tiered KV cache, with an
OpenAI/Anthropic/Ollama-compatible HTTP daemon on top.
Inference as a first-class OTP citizen, not a Python sidecar. The wedge is
supervision, per-model queues, and cancel-on-disconnect, with the cache more
warm state than fits in RAM.
## Layout
A rebar3 umbrella; each app is a separately publishable Hex package and the
repo is versioned as a whole.
| App | What it is |
|-----|------------|
| [`apps/barrel_inference`](apps/barrel_inference) | The runtime: dirty NIFs over llama.cpp, supervised model processes, byte-exact tiered KV cache. |
| [`apps/barrel_inference_server`](apps/barrel_inference_server) | The API daemon: OpenAI-, Anthropic-, and Ollama-compatible HTTP, model registry, per-model queues, keep-alive, metrics. |
| [`apps/barrel_inference_cli`](apps/barrel_inference_cli) | The `barrel-inference` CLI: `serve` boots the daemon; `pull`/`run`/`ps`/`rm` drive a running one over HTTP. |
A distributed control plane (`barrel_inference_cluster`: routing, cache-aware
placement, node discovery) is a planned follow-up.
## Build
rebar3 compile # builds the NIF (vendored llama.cpp via cmake)
rebar3 as prod release # the barrel_inference_server daemon release
rebar3 escriptize # the barrel-inference CLI
Requires Erlang/OTP 28 and rebar3 3.25+, plus cmake and a C/C++ toolchain for
the NIF. See each app's README for the public API and configuration.
## Run
barrel-inference serve # start the API server
barrel-inference pull # fetch a model
barrel-inference run "hello" # one-shot completion
barrel-inference ps # list loaded models
Or with Docker:
docker compose up
## Documentation
One site covers the whole project, for both operators and contributors:
. It is a mkdocs build at
the repo root (`mkdocs.yml`, `docs/`) that surfaces each app's guides in place.
Function-level API reference ships per package on hexdocs:
[barrel_inference](https://hexdocs.pm/barrel_inference) and
[barrel_inference_server](https://hexdocs.pm/barrel_inference_server).
Build the site locally:
pip install -r docs-requirements.txt
mkdocs serve
## License
MIT. Part of the [barrel-platform](https://github.com/barrel-platform) project.