https://github.com/apenab/rlm-runtime
Minimal runtime for Recursive Language Models (RLMs) inspired by the MIT CSAIL paper "Recursive Language Models".
https://github.com/apenab/rlm-runtime
ai anthropic gpt llm openai rlm
Last synced: 5 months ago
JSON representation
Minimal runtime for Recursive Language Models (RLMs) inspired by the MIT CSAIL paper "Recursive Language Models".
- Host: GitHub
- URL: https://github.com/apenab/rlm-runtime
- Owner: apenab
- License: mit
- Created: 2026-01-18T12:15:33.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-01-21T14:42:11.000Z (5 months ago)
- Last Synced: 2026-01-23T16:53:03.868Z (5 months ago)
- Topics: ai, anthropic, gpt, llm, openai, rlm
- Language: Python
- Homepage: https://arxiv.org/html/2512.24601v1
- Size: 6.73 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rlm-runtime
Minimal runtime for **Recursive Language Models (RLMs)** inspired by the [MIT CSAIL paper](docs/rlm-paper-mit.pdf) "Recursive Language Models".
## The Problem
Standard LLM approaches fail when context exceeds the model's window size:
- **Truncation**: Important information gets cut off
- **RAG**: Requires complex retrieval infrastructure and may miss relevant context
- **Long-context models**: Expensive and still have hard limits
## The RLM Solution
RLMs treat the long context as **environment state** instead of direct input:
- Context lives in a Python REPL as variable `P`
- The LLM only sees metadata + REPL outputs (not the full context)
- The LLM writes code to inspect, search, and chunk the context
- The LLM can make **recursive subcalls** to sub-LLMs on small snippets
- Result: Handle arbitrarily large contexts with constant token usage per step
## Quickstart
```bash
# Install
uv pip install -e .
# Set your API key
export LLM_API_KEY="your-api-key-here"
# Run a simple example
uv run python examples/minimal.py
```
**Basic usage:**
```python
from rlm_runtime import RLM, Context
from rlm_runtime.adapters import OpenAICompatAdapter
# Create context from your long documents
documents = [
"Document 1: Very long content...",
"Document 2: More content...",
# ... could be 100s of documents, millions of tokens
]
context = Context.from_documents(documents)
# Initialize RLM with OpenAI-compatible adapter
rlm = RLM(adapter=OpenAICompatAdapter())
# Ask questions over the entire context
query = "What are the main themes across all documents?"
answer, trace = rlm.run(query, context)
print(answer)
```
**Works with:** OpenAI, Anthropic Claude, local Llama/Ollama servers, or any OpenAI-compatible endpoint.
## Demo: RLM vs Baseline Comparison
The `rlm_vs_baseline.py` example demonstrates the core advantage of RLMs: maintaining accuracy as context grows beyond the LLM's window, while a naive baseline fails due to truncation.
### Running the Demo
```bash
# Quick demo (5 and 30 documents)
RLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py
# Full benchmark showing crossover point (5, 20, 50, 120 documents)
RLM_CONTEXT_SIZES=5,20,50,120 uv run python examples/rlm_vs_baseline.py
# Show detailed RLM execution trajectory
SHOW_TRAJECTORY=1 RLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py
```
### What the Demo Shows
This benchmark implements a **needle-in-haystack** task (similar to the MIT paper's S-NIAH):
- The context contains N documents, with one containing a hidden key term
- The query asks: "What is the key term?"
- **Baseline approach**: Sends entire context directly to LLM (truncates if too large)
- **RLM approach**: Context lives in REPL, LLM writes code to search and make subcalls
### The Crossover Point (MIT Paper Figure 1)
The MIT paper demonstrates that RLMs maintain near-perfect accuracy as context grows, while baseline approaches degrade:

*Figure 1: RLM accuracy remains high as distractor documents increase, while baseline accuracy drops due to truncation. This implementation reproduces this behavior.*
### Expected Results
Our benchmark visualizes this **crossover point** where RLM starts outperforming baseline:
```
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CROSSOVER ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Plot 1: Success Rate vs Context Size
────────────────────────────────────
5 docs │ B (baseline OK)
20 docs │ B (baseline OK)
50 docs │ b R (baseline FAIL, RLM OK) ← CROSSOVER POINT
120 docs │ b R (baseline FAIL, RLM OK)
Legend: B=baseline success, b=baseline fail, R=RLM success, r=RLM fail
Plot 2: Token Usage Comparison
───────────────────────────────
5 docs │ baseline: ████░░░░░░ (8.8K) 🏆
│ rlm: ████████░░ (17.3K)
20 docs │ baseline: ████████░░ (18.5K) 🏆
│ rlm: ████████░░ (18.0K)
50 docs │ baseline: FAIL (truncated)
│ rlm: █████████░ (20.9K) 🏆
120 docs │ baseline: FAIL (truncated)
│ rlm: ██████████ (23.5K) 🏆
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RESULTS SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Detailed Comparison:
┌─────────┬──────────┬────────┬───────┬────────┬────────────┬─────────┐
│ Docs │ Tokens │ Time │ OK? │ Answer │ Method │ Winner │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│ 5 │ 8,831 │ 1.2s │ ✓ │ ✓ │ baseline │ 🏆 base │
│ │ 17,298 │ 2.8s │ ✓ │ ✓ │ rlm │ │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│ 20 │ 18,454 │ 2.1s │ ✓ │ ✓ │ baseline │ 🏆 base │
│ │ 18,039 │ 3.1s │ ✓ │ ✓ │ rlm │ │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│ 50 │ TRUNCATED - Answer lost in truncation │
│ │ 20,866 │ 3.8s │ ✓ │ ✓ │ rlm │ 🏆 rlm │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│ 120 │ TRUNCATED - Answer lost in truncation │
│ │ 23,489 │ 4.5s │ ✓ │ ✓ │ rlm │ 🏆 rlm │
└─────────┴──────────┴────────┴───────┴────────┴────────────┴─────────┘
Summary Statistics:
• Baseline wins: 2 (at small context sizes)
• RLM wins: 2 (at large context sizes where baseline truncates)
• Crossover point: ~50 documents (baseline starts truncating)
RLM Efficiency Metrics:
• Avg subcalls per task: 0 when Phase 0 succeeds, 1+ when semantic search needed
• Phase 0 success rate: ~100% for needle-in-haystack tasks
• Token overhead: ~2x at small contexts (vs baseline), but RLM still wins at large contexts
```
### Key Insights
**When to use RLMs:**
1. **Small contexts (5-20 docs)**: Baseline is slightly more efficient (fewer tokens, faster)
- RLM overhead is minimal (~2x tokens) due to Phase 0 optimization
- If speed is critical and context always fits, baseline wins
2. **Large contexts (50+ docs)**: RLM wins decisively when baseline truncates
- RLM maintains 100% accuracy while baseline fails completely
- Uses only ~1-2K tokens regardless of context size (constant overhead from Phase 0)
**How RLMs achieve this:**
- **Phase 0 optimization**: Try deterministic extraction first (`extract_after`) - 0 subcalls, instant
- **Conditional subcalls**: Only uses sub-LLMs when deterministic methods fail
- **Constant overhead**: Token usage stays roughly constant regardless of context size
- **Smart chunking**: When subcalls are needed, processes documents in optimal chunks
**The crossover point**: Around 50 documents (~100K+ characters), where the context exceeds the LLM's effective window and baseline accuracy drops to 0%.
This reproduces the key finding from Figure 1 of the MIT paper: RLMs maintain performance as context grows, while baseline approaches degrade.
## Use Cases: When to Use RLMs
### Tasks from the MIT Paper
The MIT paper evaluated RLMs on several categories of long-context tasks:
1. **Deep Research & Multi-hop QA** (BrowseComp-Plus)
- Answering complex questions requiring reasoning across 100s-1000s of documents
- Finding evidence scattered across multiple sources
- Synthesizing information from diverse materials
2. **Code Repository Understanding** (CodeQA)
- Analyzing large codebases (900K+ tokens)
- Finding specific implementations across multiple files
- Understanding architectural decisions
3. **Information Aggregation** (OOLONG)
- Processing datasets with semantic transformations
- Aggregating statistics across thousands of entries
- Computing results that require examining every line
4. **Complex Pairwise Reasoning** (OOLONG-Pairs)
- Finding relationships between pairs of elements
- Quadratic complexity tasks (O(N²) processing)
- Tasks requiring examination of all combinations
### Practical Applications for rlm-runtime
**1. Document Analysis at Scale**
- Legal contract review across hundreds of agreements
- Academic research: analyzing 50+ papers for literature reviews
- Technical documentation: processing entire API documentation sets
- Medical records: analyzing patient histories across multiple visits
**2. Development & DevOps**
- Code repository audits and security reviews
- Log analysis: finding patterns across millions of log lines
- Configuration management: validating consistency across microservices
- Documentation generation from large codebases
**3. Business Intelligence**
- Customer feedback analysis across thousands of reviews/tickets
- Competitive analysis: processing competitor documentation and materials
- Market research: synthesizing reports from multiple sources
- Compliance audits: checking regulations across documents
**4. Content & Media**
- Transcript analysis: processing hours of meeting recordings
- Book/article summarization and cross-referencing
- Research assistance: finding connections across academic papers
- Content moderation at scale
**5. Integration with Model Context Protocol (MCP)**
RLM-runtime is particularly well-suited as an **MCP server** that provides long-context processing capabilities:
```python
# Example: RLM as an MCP server
# Expose RLM as a tool that other applications can call
from mcp.server import Server
from rlm_runtime import RLM, Context
server = Server("rlm-processor")
@server.tool()
async def process_long_context(query: str, documents: list[str]) -> str:
"""Process arbitrarily long context using RLM"""
context = Context.from_documents(documents)
rlm = RLM(adapter=OpenAICompatAdapter())
output, trace = rlm.run(query, context)
return output
```
**MCP Use Cases:**
- **Claude Desktop/Web**: Add RLM as a tool for processing large file sets
- **IDE Extensions**: Analyze entire projects beyond editor context limits
- **Research Tools**: Process multiple papers/books in citation managers
- **Data Analysis**: Query large datasets through natural language
**6. When RLM Wins Over Alternatives**
Use RLM when:
- ✅ Context size > 100K tokens (beyond most model windows)
- ✅ Information is scattered across the entire context
- ✅ Task requires examining most/all of the input
- ✅ Accuracy is more important than speed
- ✅ Context doesn't fit in RAG chunk paradigm
Don't use RLM when:
- ❌ Context always fits in model window (<50K tokens)
- ❌ Simple keyword search would work
- ❌ Information is localized (RAG would be faster)
- ❌ Real-time response required (milliseconds)
### Example: Research Assistant
```python
# Analyze 50 academic papers to answer a research question
from rlm_runtime import RLM, Context
from rlm_runtime.adapters import OpenAICompatAdapter
# Load papers (could be 1M+ tokens total)
papers = [read_pdf(f"paper_{i}.pdf") for i in range(50)]
context = Context.from_documents(papers)
rlm = RLM(adapter=OpenAICompatAdapter())
query = """
What are the main methodologies used for evaluating long-context
language models across these papers? Provide a comparison table.
"""
answer, trace = rlm.run(query, context)
print(answer)
```
## Configuration
### Environment Variables
```bash
# API Configuration (OpenAI-compatible endpoints)
export LLM_API_KEY="your-key" # or OPENAI_API_KEY
export LLM_BASE_URL="https://..." # optional, for custom endpoints
# For local models (no auth needed)
export LLM_BASE_URL="http://localhost:11434/v1" # Example: Ollama
```
### Supported Providers
- **OpenAI**: GPT-4, GPT-3.5, etc.
- **Anthropic**: Claude Sonnet, Opus (via OpenAI-compatible proxy)
- **Local**: Ollama, LM Studio, vLLM, or any OpenAI-compatible server
- **Custom**: Implement your own adapter by extending `BaseAdapter`
## Examples
- **[minimal.py](examples/minimal.py)**: Simplest possible RLM example
- **[rlm_vs_baseline.py](examples/rlm_vs_baseline.py)**: Full benchmark showing crossover point
- **[complex_reasoning.py](examples/complex_reasoning.py)**: Multi-step reasoning over long documents
- **[hybrid_audit.py](examples/hybrid_audit.py)**: Trajectory visualization
- **[smart_router_demo.py](examples/smart_router_demo.py)**: Auto baseline/RLM selection
- **[ollama_example.py](examples/ollama_example.py)**: Using local Ollama models
- **[cloud_example.py](examples/cloud_example.py)**: Cloud provider integration
## Development
```bash
# Linting and formatting
uv run ruff check .
uv run ruff format .
# Type checking
uv run ty check
# Tests
uv run pytest
```
## References
- [MIT CSAIL Paper: Recursive Language Models](docs/rlm-paper-mit.pdf)
- Original paper authors: Zhou, et al.
- This implementation is not affiliated with MIT
## License
MIT License - see LICENSE file for details