https://github.com/avilum/minrlm

Stop forcing LLMs to answer in one pass. Give them a runtime. Recursive Language Model that improves any LLM, while reducing token usage up to 4X.
https://github.com/avilum/minrlm

agent ai-agents cost-optimization latency-optimization llm llm-inference llmops recursive-language-model rlm token-optimization

Last synced: 19 days ago
JSON representation

Stop forcing LLMs to answer in one pass. Give them a runtime. Recursive Language Model that improves any LLM, while reducing token usage up to 4X.

Host: GitHub
URL: https://github.com/avilum/minrlm
Owner: avilum
License: mit
Created: 2026-01-31T13:26:42.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-04-05T11:23:09.000Z (21 days ago)
Last Synced: 2026-04-05T11:23:55.448Z (21 days ago)
Topics: agent, ai-agents, cost-optimization, latency-optimization, llm, llm-inference, llmops, recursive-language-model, rlm, token-optimization
Language: Python
Homepage: https://avilum.github.io/minrlm/
Size: 14.1 MB
Stars: 60
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  
minRLM

  

    Stop forcing LLMs to answer in one pass. Give them a runtime.

  

  

    

    

    

    

  




  



Took a base model. Wrapped it in a tiny recursive loop: **generate code - execute - refine - repeat**.

Didn't change the model. Didn't add training. Didn't add data.

Just stopped forcing it to answer in one pass.

The performance jump is not subtle:

| | Vanilla (one-shot) | minRLM (recursive) |

|---|---|---|

| **AIME 2025** | 0% | **96%** |

| **Sudoku Extreme** | 0% | **80%** |

| **Overall (GPT-5.2)** | 48.2% | **78.2%** (+30pp) |

| **Tokens used** | 20,967 | **8,151** (3.6x less) |

| **Cost** | $7.92 | **$2.86** (2.8x cheaper) |

_{6,600+ evaluations across 4 models and 13 tasks. Full blog post | Detailed results}

---

## Try it in 10 seconds

```bash

pip install minrlm

export OPENAI_API_KEY="sk-..."

# Analyze a file - data never enters the prompt

uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pure computation - the REPL writes the algorithm

uvx minrlm "Return all primes up to 1,000,000, reversed."

# -> 78,498 primes in 6,258 tokens. Output: 616K chars. 25x savings.

# Pipe anything

cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Chain: solve a Sudoku, then pipe the solution to verify it

uvx minrlm -s "Solve this Sudoku:

  ..3|.1.|...

  .4.|...|8..

  ...|..6|.2.

  ---+---+---

  .8.|.5.|..1

  ...|...|...

  5..|.8.|.6.

  ---+---+---

  .7.|6..|...

  ..2|...|.5.

  ...|.3.|9.." \

  | uvx minrlm -s 'Verify this sudoku board, is it valid? return {"board":str, "valid": bool}'

```

```python

from minrlm import RLM

rlm = RLM(model="gpt-5-mini")

# 50MB CSV? Same cost as 5KB. Data never enters the prompt.

answer = rlm.completion(

    task="Which product had the highest return rate in Q3?",

    context=open("q3_returns.csv").read()

)

```

---

## How it works

```

Standard LLM:

  [System prompt] + [500K tokens of raw context] + [Question]

  = Expensive. Slow. Accuracy degrades with length.

minRLM:

  input_0 = "<500K chars in REPL memory>"     # never in prompt

  LLM writes: errors = [l for l in input_0.splitlines() if "ERROR" in l]

              FINAL(len(errors))

  = Code runs. Answer returned. ~4K tokens total.

```

The model writes Python to query the data. Attention runs only on the results. A 7M-character document costs the same as a 7K one.

**Not ReAct.** One REPL, 1-2 iterations, no growing context. Every step is Python you can read, rerun, and debug.

### What makes it work

- **Entropy profiling** - zlib compression heatmap of the input. A needle in 7MB shows up as an entropy spike; the model skips straight to it

- **Task routing** - auto-detects structured data, MCQ, code retrieval, math, search & extract. Each gets a specialized code pattern

- **Two-pass search** - if the first pass returns "unknown", a second pass runs with keywords from first-pass evidence

- **Sub-LLM delegation** - outer model gathers evidence via `search()`, passes it to `sub_llm(task, evidence)` for focused reasoning

- **Flat token cost** - context never enters the conversation. Only the entropy map and a head/mid/tail preview do

- **DockerREPL** - every execution in a sandboxed container with seccomp. No network, no filesystem, stdlib only

---

## The scaling story

The REPL isn't a crutch for weak models - it's a lever that better models pull harder.

| Model | minRLM | Vanilla | Gap | Tasks won |

|-------|--------|---------|-----|-----------|

| GPT-5-nano (small) | 53.7% | 63.2% | -9.5 | 4/12 |

| GPT-5-mini (mid) | 72.7% | 69.5% | +3.2 | 7/12 |

| GPT-5.4-mini (mid, newer) | 69.5% | 47.2% | +22.3 | 8/12 |

| GPT-5.2 (frontier) | **78.2%** | 48.2% | **+30.0** | **11/12** |

Small model? Recursion adds overhead. Frontier model? Recursion dominates.

The gap isn't model size. It's the execution model.

| | | |

|---|---|---|

| ![Summary](docs/summary_dashboard.png) | ![Accuracy](docs/accuracy_per_task.png) | ![Tokens](docs/token_savings.png) |

| ![Cost](docs/accuracy_vs_cost.png) | ![Latency](docs/accuracy_vs_latency.png) | ![Per Task](docs/cost_per_task.png) |

---

## When to use it (and when not to)

**Use it when:**

- Large context (docs, logs, CSV, JSON) - cost stays flat as data grows

- You want debuggable reasoning - every step is readable Python, not hidden attention

- Token efficiency matters - 3.6x fewer tokens than comparable approaches

**Skip it when:**

- Short context (<8K tokens) - a direct call is simpler

- Code retrieval (RepoQA) - the one task where vanilla wins everywhere

- You need third-party packages - the sandbox is stdlib-only

---

## REPL tools

| Function | What it does |

|----------|--------------|

| `input_0` | Your context data (string, never in the prompt) |

| `search(text, pattern)` | Substring search with context windows |

| `sub_llm(task, context)` | Recursive LLM call on a sub-chunk |

| `FINAL(answer)` | Return answer and stop |

---

## Works with any OpenAI-compatible endpoint

```python

# Local / self-hosted

rlm = RLM(model="llama-3.1-70b", base_url="http://localhost:8000/v1")

# Hugging Face

from openai import OpenAI

hf = OpenAI(base_url="https://router.huggingface.co/v1", api_key="hf_...")

rlm = RLM(model="openai/gpt-oss-120b", client=hf)

```

Works with: OpenAI, Hugging Face, Anthropic (via proxy), vLLM, Ollama, LiteLLM, or anything OpenAI-compatible.

---

## More ways to run

Visualizer (Gradio UI)

```bash

git clone https://github.com/avilum/minrlm && cd minrlm

uv sync --extra visualizer

uv run python examples/visualizer.py   # http://localhost:7860

```

OpenCode integration

**1. Start the proxy:**

```bash

uv run --with ".[proxy]" examples/proxy.py

# RLM Proxy initialized | model=gpt-5-mini | docker=False

# Uvicorn running on http://0.0.0.0:8000

```

**2. Config** (`opencode/opencode.json`): set `provider.minrlm.api` to `http://localhost:8000/v1`. See [opencode/opencode.json](opencode/opencode.json).

**3. Run:**

```bash

OPENCODE_CONFIG=opencode.json opencode run "First prime after 1 million"

# > 1000003

```

**[Full tutorial](docs/opencode-minrlm-tutorial.md)**

Docker sandbox

LLM-generated code runs in isolated Docker containers. No network, read-only filesystem, memory-capped, seccomp-filtered.

```python

rlm = RLM(model="gpt-5-mini", use_docker=True, docker_memory="256m")

```

Run the benchmarks yourself

```bash

git clone https://github.com/avilum/minrlm && cd minrlm

uv sync --extra eval

# Smoke test

uv run python eval/quickstart.py

# Full benchmark (reproduces the tables above)

uv run python eval/run.py \

    --tasks all \

    --runners minrlm-reasoning,vanilla,official \

    --runs 50 --parallel 12 --task-parallel 12 \

    --output-dir logs/my_eval

```

Full results: [`eval/README.md`](eval/README.md)

Examples

```bash

uv run python examples/minimal.py            # vanilla vs RLM side-by-side

uv run python examples/advanced_usage.py     # search, sub_llm, callbacks

uv run python examples/visualizer.py         # Gradio UI

uv run uvicorn examples.proxy:app --port 8000  # OpenAI-compatible proxy

```

---

## Why this matters

[Context window rot](https://arxiv.org/abs/2509.21361) is real - model accuracy degrades as input grows, even when the answer is right there. Bigger windows aren't the fix. Less input, better targeted, is.

The same pattern is showing up everywhere: Anthropic's [web search tool](https://docs.anthropic.com/en/docs/build-with-claude/tool-use/web-search-tool) writes code to filter results, [MCP](https://modelcontextprotocol.io/) standardizes code execution access, [smolagents](https://huggingface.co/docs/smolagents/en/index) goes further. They all converge on the same idea: let the model use code to work with data instead of attending to all of it.

Feels less like "prompting" and more like giving the model a runtime.

---

## Future work

- **More models** - Claude Opus 4.6, Gemini 2.5, open-weight models. Does the scaling trend hold across providers?

- **Agentic pipelines** - using the RLM pattern as a retrieval step inside multi-step agent workflows

- **More tasks** - stress-testing edge cases and domains where the approach might break

Contributions welcome. Open an issue or PR.

---

## Credits

Built by [Avi Lumelsky](https://github.com/avilum). Independent implementation - not a fork.

The RLM concept comes from [Zhang, Kraska, and Khattab (2025)](https://arxiv.org/abs/2512.24601). Official implementation: [github.com/alexzhang13/rlm](https://github.com/alexzhang13/rlm).

Citation

```

@misc{zhang2026recursivelanguagemodels,

      title={Recursive Language Models},

      author={Alex L. Zhang and Tim Kraska and Omar Khattab},

      year={2026},

      eprint={2512.24601},

      archivePrefix={arXiv},

      primaryClass={cs.AI},

      url={https://arxiv.org/abs/2512.24601},

}

```

## Star History



 

   

   

   

 



## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/avilum/minrlm

Awesome Lists containing this project

README

minRLM