https://github.com/avilum/minrlm
Stop forcing LLMs to answer in one pass. Give them a runtime. Recursive Language Model that improves any LLM, while reducing token usage up to 4X.
https://github.com/avilum/minrlm
agent ai-agents cost-optimization latency-optimization llm llm-inference llmops recursive-language-model rlm token-optimization
Last synced: 19 days ago
JSON representation
Stop forcing LLMs to answer in one pass. Give them a runtime. Recursive Language Model that improves any LLM, while reducing token usage up to 4X.
- Host: GitHub
- URL: https://github.com/avilum/minrlm
- Owner: avilum
- License: mit
- Created: 2026-01-31T13:26:42.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2026-04-05T11:23:09.000Z (21 days ago)
- Last Synced: 2026-04-05T11:23:55.448Z (21 days ago)
- Topics: agent, ai-agents, cost-optimization, latency-optimization, llm, llm-inference, llmops, recursive-language-model, rlm, token-optimization
- Language: Python
- Homepage: https://avilum.github.io/minrlm/
- Size: 14.1 MB
- Stars: 60
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
minRLM
Stop forcing LLMs to answer in one pass. Give them a runtime.
Took a base model. Wrapped it in a tiny recursive loop: **generate code - execute - refine - repeat**.
Didn't change the model. Didn't add training. Didn't add data.
Just stopped forcing it to answer in one pass.
The performance jump is not subtle:
| | Vanilla (one-shot) | minRLM (recursive) |
|---|---|---|
| **AIME 2025** | 0% | **96%** |
| **Sudoku Extreme** | 0% | **80%** |
| **Overall (GPT-5.2)** | 48.2% | **78.2%** (+30pp) |
| **Tokens used** | 20,967 | **8,151** (3.6x less) |
| **Cost** | $7.92 | **$2.86** (2.8x cheaper) |
6,600+ evaluations across 4 models and 13 tasks. Full blog post | Detailed results
---
## Try it in 10 seconds
```bash
pip install minrlm
export OPENAI_API_KEY="sk-..."
# Analyze a file - data never enters the prompt
uvx minrlm "How many ERROR lines in the last hour?" ./server.log
# Pure computation - the REPL writes the algorithm
uvx minrlm "Return all primes up to 1,000,000, reversed."
# -> 78,498 primes in 6,258 tokens. Output: 616K chars. 25x savings.
# Pipe anything
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"
# Chain: solve a Sudoku, then pipe the solution to verify it
uvx minrlm -s "Solve this Sudoku:
..3|.1.|...
.4.|...|8..
...|..6|.2.
---+---+---
.8.|.5.|..1
...|...|...
5..|.8.|.6.
---+---+---
.7.|6..|...
..2|...|.5.
...|.3.|9.." \
| uvx minrlm -s 'Verify this sudoku board, is it valid? return {"board":str, "valid": bool}'
```
```python
from minrlm import RLM
rlm = RLM(model="gpt-5-mini")
# 50MB CSV? Same cost as 5KB. Data never enters the prompt.
answer = rlm.completion(
task="Which product had the highest return rate in Q3?",
context=open("q3_returns.csv").read()
)
```
---
## How it works
```
Standard LLM:
[System prompt] + [500K tokens of raw context] + [Question]
= Expensive. Slow. Accuracy degrades with length.
minRLM:
input_0 = "<500K chars in REPL memory>" # never in prompt
LLM writes: errors = [l for l in input_0.splitlines() if "ERROR" in l]
FINAL(len(errors))
= Code runs. Answer returned. ~4K tokens total.
```
The model writes Python to query the data. Attention runs only on the results. A 7M-character document costs the same as a 7K one.
**Not ReAct.** One REPL, 1-2 iterations, no growing context. Every step is Python you can read, rerun, and debug.
### What makes it work
- **Entropy profiling** - zlib compression heatmap of the input. A needle in 7MB shows up as an entropy spike; the model skips straight to it
- **Task routing** - auto-detects structured data, MCQ, code retrieval, math, search & extract. Each gets a specialized code pattern
- **Two-pass search** - if the first pass returns "unknown", a second pass runs with keywords from first-pass evidence
- **Sub-LLM delegation** - outer model gathers evidence via `search()`, passes it to `sub_llm(task, evidence)` for focused reasoning
- **Flat token cost** - context never enters the conversation. Only the entropy map and a head/mid/tail preview do
- **DockerREPL** - every execution in a sandboxed container with seccomp. No network, no filesystem, stdlib only
---
## The scaling story
The REPL isn't a crutch for weak models - it's a lever that better models pull harder.
| Model | minRLM | Vanilla | Gap | Tasks won |
|-------|--------|---------|-----|-----------|
| GPT-5-nano (small) | 53.7% | 63.2% | -9.5 | 4/12 |
| GPT-5-mini (mid) | 72.7% | 69.5% | +3.2 | 7/12 |
| GPT-5.4-mini (mid, newer) | 69.5% | 47.2% | +22.3 | 8/12 |
| GPT-5.2 (frontier) | **78.2%** | 48.2% | **+30.0** | **11/12** |
Small model? Recursion adds overhead. Frontier model? Recursion dominates.
The gap isn't model size. It's the execution model.
| | | |
|---|---|---|
|  |  |  |
|  |  |  |
---
## When to use it (and when not to)
**Use it when:**
- Large context (docs, logs, CSV, JSON) - cost stays flat as data grows
- You want debuggable reasoning - every step is readable Python, not hidden attention
- Token efficiency matters - 3.6x fewer tokens than comparable approaches
**Skip it when:**
- Short context (<8K tokens) - a direct call is simpler
- Code retrieval (RepoQA) - the one task where vanilla wins everywhere
- You need third-party packages - the sandbox is stdlib-only
---
## REPL tools
| Function | What it does |
|----------|--------------|
| `input_0` | Your context data (string, never in the prompt) |
| `search(text, pattern)` | Substring search with context windows |
| `sub_llm(task, context)` | Recursive LLM call on a sub-chunk |
| `FINAL(answer)` | Return answer and stop |
---
## Works with any OpenAI-compatible endpoint
```python
# Local / self-hosted
rlm = RLM(model="llama-3.1-70b", base_url="http://localhost:8000/v1")
# Hugging Face
from openai import OpenAI
hf = OpenAI(base_url="https://router.huggingface.co/v1", api_key="hf_...")
rlm = RLM(model="openai/gpt-oss-120b", client=hf)
```
Works with: OpenAI, Hugging Face, Anthropic (via proxy), vLLM, Ollama, LiteLLM, or anything OpenAI-compatible.
---
## More ways to run
Visualizer (Gradio UI)
```bash
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra visualizer
uv run python examples/visualizer.py # http://localhost:7860
```
OpenCode integration
**1. Start the proxy:**
```bash
uv run --with ".[proxy]" examples/proxy.py
# RLM Proxy initialized | model=gpt-5-mini | docker=False
# Uvicorn running on http://0.0.0.0:8000
```
**2. Config** (`opencode/opencode.json`): set `provider.minrlm.api` to `http://localhost:8000/v1`. See [opencode/opencode.json](opencode/opencode.json).
**3. Run:**
```bash
OPENCODE_CONFIG=opencode.json opencode run "First prime after 1 million"
# > 1000003
```
**[Full tutorial](docs/opencode-minrlm-tutorial.md)**
Docker sandbox
LLM-generated code runs in isolated Docker containers. No network, read-only filesystem, memory-capped, seccomp-filtered.
```python
rlm = RLM(model="gpt-5-mini", use_docker=True, docker_memory="256m")
```
Run the benchmarks yourself
```bash
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval
# Smoke test
uv run python eval/quickstart.py
# Full benchmark (reproduces the tables above)
uv run python eval/run.py \
--tasks all \
--runners minrlm-reasoning,vanilla,official \
--runs 50 --parallel 12 --task-parallel 12 \
--output-dir logs/my_eval
```
Full results: [`eval/README.md`](eval/README.md)
Examples
```bash
uv run python examples/minimal.py # vanilla vs RLM side-by-side
uv run python examples/advanced_usage.py # search, sub_llm, callbacks
uv run python examples/visualizer.py # Gradio UI
uv run uvicorn examples.proxy:app --port 8000 # OpenAI-compatible proxy
```
---
## Why this matters
[Context window rot](https://arxiv.org/abs/2509.21361) is real - model accuracy degrades as input grows, even when the answer is right there. Bigger windows aren't the fix. Less input, better targeted, is.
The same pattern is showing up everywhere: Anthropic's [web search tool](https://docs.anthropic.com/en/docs/build-with-claude/tool-use/web-search-tool) writes code to filter results, [MCP](https://modelcontextprotocol.io/) standardizes code execution access, [smolagents](https://huggingface.co/docs/smolagents/en/index) goes further. They all converge on the same idea: let the model use code to work with data instead of attending to all of it.
Feels less like "prompting" and more like giving the model a runtime.
---
## Future work
- **More models** - Claude Opus 4.6, Gemini 2.5, open-weight models. Does the scaling trend hold across providers?
- **Agentic pipelines** - using the RLM pattern as a retrieval step inside multi-step agent workflows
- **More tasks** - stress-testing edge cases and domains where the approach might break
Contributions welcome. Open an issue or PR.
---
## Credits
Built by [Avi Lumelsky](https://github.com/avilum). Independent implementation - not a fork.
The RLM concept comes from [Zhang, Kraska, and Khattab (2025)](https://arxiv.org/abs/2512.24601). Official implementation: [github.com/alexzhang13/rlm](https://github.com/alexzhang13/rlm).
Citation
```
@misc{zhang2026recursivelanguagemodels,
title={Recursive Language Models},
author={Alex L. Zhang and Tim Kraska and Omar Khattab},
year={2026},
eprint={2512.24601},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.24601},
}
```
## Star History
## License
MIT