{"id":41244376,"url":"https://github.com/apenab/rlm-runtime","last_synced_at":"2026-01-25T03:00:33.770Z","repository":{"id":333441063,"uuid":"1136816928","full_name":"apenab/rlm-runtime","owner":"apenab","description":"Minimal runtime for Recursive Language Models (RLMs) inspired by the MIT CSAIL paper \"Recursive Language Models\".","archived":false,"fork":false,"pushed_at":"2026-01-21T14:42:11.000Z","size":7062,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-01-23T16:53:03.868Z","etag":null,"topics":["ai","anthropic","gpt","llm","openai","rlm"],"latest_commit_sha":null,"homepage":"https://arxiv.org/html/2512.24601v1","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apenab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-18T12:15:33.000Z","updated_at":"2026-01-22T17:02:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/apenab/rlm-runtime","commit_stats":null,"previous_names":["apenab/rlm-runtime"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/apenab/rlm-runtime","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apenab%2Frlm-runtime","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apenab%2Frlm-runtime/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apenab%2Frlm-runtime/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apenab%2Frlm-runtime/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apenab","download_url":"https://codeload.github.com/apenab/rlm-runtime/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apenab%2Frlm-runtime/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28742973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T02:46:29.005Z","status":"ssl_error","status_checked_at":"2026-01-25T02:44:29.968Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","anthropic","gpt","llm","openai","rlm"],"created_at":"2026-01-23T01:24:33.520Z","updated_at":"2026-01-25T03:00:33.745Z","avatar_url":"https://github.com/apenab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rlm-runtime\n\nMinimal runtime for **Recursive Language Models (RLMs)** inspired by the [MIT CSAIL paper](docs/rlm-paper-mit.pdf) \"Recursive Language Models\".\n\n## The Problem\n\nStandard LLM approaches fail when context exceeds the model's window size:\n- **Truncation**: Important information gets cut off\n- **RAG**: Requires complex retrieval infrastructure and may miss relevant context\n- **Long-context models**: Expensive and still have hard limits\n\n## The RLM Solution\n\nRLMs treat the long context as **environment state** instead of direct input:\n- Context lives in a Python REPL as variable `P`\n- The LLM only sees metadata + REPL outputs (not the full context)\n- The LLM writes code to inspect, search, and chunk the context\n- The LLM can make **recursive subcalls** to sub-LLMs on small snippets\n- Result: Handle arbitrarily large contexts with constant token usage per step\n\n## Quickstart\n\n```bash\n# Install\nuv pip install -e .\n\n# Set your API key\nexport LLM_API_KEY=\"your-api-key-here\"\n\n# Run a simple example\nuv run python examples/minimal.py\n```\n\n**Basic usage:**\n\n```python\nfrom rlm_runtime import RLM, Context\nfrom rlm_runtime.adapters import OpenAICompatAdapter\n\n# Create context from your long documents\ndocuments = [\n    \"Document 1: Very long content...\",\n    \"Document 2: More content...\",\n    # ... could be 100s of documents, millions of tokens\n]\ncontext = Context.from_documents(documents)\n\n# Initialize RLM with OpenAI-compatible adapter\nrlm = RLM(adapter=OpenAICompatAdapter())\n\n# Ask questions over the entire context\nquery = \"What are the main themes across all documents?\"\nanswer, trace = rlm.run(query, context)\nprint(answer)\n```\n\n**Works with:** OpenAI, Anthropic Claude, local Llama/Ollama servers, or any OpenAI-compatible endpoint.\n\n## Demo: RLM vs Baseline Comparison\n\nThe `rlm_vs_baseline.py` example demonstrates the core advantage of RLMs: maintaining accuracy as context grows beyond the LLM's window, while a naive baseline fails due to truncation.\n\n### Running the Demo\n\n```bash\n# Quick demo (5 and 30 documents)\nRLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py\n\n# Full benchmark showing crossover point (5, 20, 50, 120 documents)\nRLM_CONTEXT_SIZES=5,20,50,120 uv run python examples/rlm_vs_baseline.py\n\n# Show detailed RLM execution trajectory\nSHOW_TRAJECTORY=1 RLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py\n```\n\n### What the Demo Shows\n\nThis benchmark implements a **needle-in-haystack** task (similar to the MIT paper's S-NIAH):\n- The context contains N documents, with one containing a hidden key term\n- The query asks: \"What is the key term?\"\n- **Baseline approach**: Sends entire context directly to LLM (truncates if too large)\n- **RLM approach**: Context lives in REPL, LLM writes code to search and make subcalls\n\n### The Crossover Point (MIT Paper Figure 1)\n\nThe MIT paper demonstrates that RLMs maintain near-perfect accuracy as context grows, while baseline approaches degrade:\n\n![Figure 1 from MIT Paper](docs/figure1-mit-rlm.png)\n\n*Figure 1: RLM accuracy remains high as distractor documents increase, while baseline accuracy drops due to truncation. This implementation reproduces this behavior.*\n\n### Expected Results\n\nOur benchmark visualizes this **crossover point** where RLM starts outperforming baseline:\n\n```\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\nCROSSOVER ANALYSIS\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\nPlot 1: Success Rate vs Context Size\n────────────────────────────────────\n  5 docs │ B (baseline OK)\n 20 docs │ B (baseline OK)\n 50 docs │ b R (baseline FAIL, RLM OK) ← CROSSOVER POINT\n120 docs │ b R (baseline FAIL, RLM OK)\n\nLegend: B=baseline success, b=baseline fail, R=RLM success, r=RLM fail\n\n\nPlot 2: Token Usage Comparison\n───────────────────────────────\n  5 docs │ baseline: ████░░░░░░ (8.8K)  🏆\n         │      rlm: ████████░░ (17.3K)\n\n 20 docs │ baseline: ████████░░ (18.5K) 🏆\n         │      rlm: ████████░░ (18.0K)\n\n 50 docs │ baseline: FAIL (truncated)\n         │      rlm: █████████░ (20.9K) 🏆\n\n120 docs │ baseline: FAIL (truncated)\n         │      rlm: ██████████ (23.5K) 🏆\n\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\nRESULTS SUMMARY\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\nDetailed Comparison:\n┌─────────┬──────────┬────────┬───────┬────────┬────────────┬─────────┐\n│   Docs  │  Tokens  │  Time  │ OK?   │ Answer │   Method   │ Winner  │\n├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤\n│    5    │   8,831  │  1.2s  │  ✓    │  ✓     │  baseline  │ 🏆 base │\n│         │  17,298  │  2.8s  │  ✓    │  ✓     │     rlm    │         │\n├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤\n│   20    │  18,454  │  2.1s  │  ✓    │  ✓     │  baseline  │ 🏆 base │\n│         │  18,039  │  3.1s  │  ✓    │  ✓     │     rlm    │         │\n├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤\n│   50    │  TRUNCATED - Answer lost in truncation                    │\n│         │  20,866  │  3.8s  │  ✓    │  ✓     │     rlm    │ 🏆 rlm  │\n├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤\n│  120    │  TRUNCATED - Answer lost in truncation                    │\n│         │  23,489  │  4.5s  │  ✓    │  ✓     │     rlm    │ 🏆 rlm  │\n└─────────┴──────────┴────────┴───────┴────────┴────────────┴─────────┘\n\nSummary Statistics:\n  • Baseline wins: 2 (at small context sizes)\n  • RLM wins: 2 (at large context sizes where baseline truncates)\n  • Crossover point: ~50 documents (baseline starts truncating)\n\nRLM Efficiency Metrics:\n  • Avg subcalls per task: 0 when Phase 0 succeeds, 1+ when semantic search needed\n  • Phase 0 success rate: ~100% for needle-in-haystack tasks\n  • Token overhead: ~2x at small contexts (vs baseline), but RLM still wins at large contexts\n```\n\n### Key Insights\n\n**When to use RLMs:**\n1. **Small contexts (5-20 docs)**: Baseline is slightly more efficient (fewer tokens, faster)\n   - RLM overhead is minimal (~2x tokens) due to Phase 0 optimization\n   - If speed is critical and context always fits, baseline wins\n2. **Large contexts (50+ docs)**: RLM wins decisively when baseline truncates\n   - RLM maintains 100% accuracy while baseline fails completely\n   - Uses only ~1-2K tokens regardless of context size (constant overhead from Phase 0)\n\n**How RLMs achieve this:**\n- **Phase 0 optimization**: Try deterministic extraction first (`extract_after`) - 0 subcalls, instant\n- **Conditional subcalls**: Only uses sub-LLMs when deterministic methods fail\n- **Constant overhead**: Token usage stays roughly constant regardless of context size\n- **Smart chunking**: When subcalls are needed, processes documents in optimal chunks\n\n**The crossover point**: Around 50 documents (~100K+ characters), where the context exceeds the LLM's effective window and baseline accuracy drops to 0%.\n\nThis reproduces the key finding from Figure 1 of the MIT paper: RLMs maintain performance as context grows, while baseline approaches degrade.\n\n## Use Cases: When to Use RLMs\n\n### Tasks from the MIT Paper\n\nThe MIT paper evaluated RLMs on several categories of long-context tasks:\n\n1. **Deep Research \u0026 Multi-hop QA** (BrowseComp-Plus)\n   - Answering complex questions requiring reasoning across 100s-1000s of documents\n   - Finding evidence scattered across multiple sources\n   - Synthesizing information from diverse materials\n\n2. **Code Repository Understanding** (CodeQA)\n   - Analyzing large codebases (900K+ tokens)\n   - Finding specific implementations across multiple files\n   - Understanding architectural decisions\n\n3. **Information Aggregation** (OOLONG)\n   - Processing datasets with semantic transformations\n   - Aggregating statistics across thousands of entries\n   - Computing results that require examining every line\n\n4. **Complex Pairwise Reasoning** (OOLONG-Pairs)\n   - Finding relationships between pairs of elements\n   - Quadratic complexity tasks (O(N²) processing)\n   - Tasks requiring examination of all combinations\n\n### Practical Applications for rlm-runtime\n\n**1. Document Analysis at Scale**\n- Legal contract review across hundreds of agreements\n- Academic research: analyzing 50+ papers for literature reviews\n- Technical documentation: processing entire API documentation sets\n- Medical records: analyzing patient histories across multiple visits\n\n**2. Development \u0026 DevOps**\n- Code repository audits and security reviews\n- Log analysis: finding patterns across millions of log lines\n- Configuration management: validating consistency across microservices\n- Documentation generation from large codebases\n\n**3. Business Intelligence**\n- Customer feedback analysis across thousands of reviews/tickets\n- Competitive analysis: processing competitor documentation and materials\n- Market research: synthesizing reports from multiple sources\n- Compliance audits: checking regulations across documents\n\n**4. Content \u0026 Media**\n- Transcript analysis: processing hours of meeting recordings\n- Book/article summarization and cross-referencing\n- Research assistance: finding connections across academic papers\n- Content moderation at scale\n\n**5. Integration with Model Context Protocol (MCP)**\n\nRLM-runtime is particularly well-suited as an **MCP server** that provides long-context processing capabilities:\n\n```python\n# Example: RLM as an MCP server\n# Expose RLM as a tool that other applications can call\n\nfrom mcp.server import Server\nfrom rlm_runtime import RLM, Context\n\nserver = Server(\"rlm-processor\")\n\n@server.tool()\nasync def process_long_context(query: str, documents: list[str]) -\u003e str:\n    \"\"\"Process arbitrarily long context using RLM\"\"\"\n    context = Context.from_documents(documents)\n    rlm = RLM(adapter=OpenAICompatAdapter())\n    output, trace = rlm.run(query, context)\n    return output\n```\n\n**MCP Use Cases:**\n- **Claude Desktop/Web**: Add RLM as a tool for processing large file sets\n- **IDE Extensions**: Analyze entire projects beyond editor context limits\n- **Research Tools**: Process multiple papers/books in citation managers\n- **Data Analysis**: Query large datasets through natural language\n\n**6. When RLM Wins Over Alternatives**\n\nUse RLM when:\n- ✅ Context size \u003e 100K tokens (beyond most model windows)\n- ✅ Information is scattered across the entire context\n- ✅ Task requires examining most/all of the input\n- ✅ Accuracy is more important than speed\n- ✅ Context doesn't fit in RAG chunk paradigm\n\nDon't use RLM when:\n- ❌ Context always fits in model window (\u003c50K tokens)\n- ❌ Simple keyword search would work\n- ❌ Information is localized (RAG would be faster)\n- ❌ Real-time response required (milliseconds)\n\n### Example: Research Assistant\n\n```python\n# Analyze 50 academic papers to answer a research question\nfrom rlm_runtime import RLM, Context\nfrom rlm_runtime.adapters import OpenAICompatAdapter\n\n# Load papers (could be 1M+ tokens total)\npapers = [read_pdf(f\"paper_{i}.pdf\") for i in range(50)]\ncontext = Context.from_documents(papers)\n\nrlm = RLM(adapter=OpenAICompatAdapter())\nquery = \"\"\"\nWhat are the main methodologies used for evaluating long-context\nlanguage models across these papers? Provide a comparison table.\n\"\"\"\n\nanswer, trace = rlm.run(query, context)\nprint(answer)\n```\n\n## Configuration\n\n### Environment Variables\n\n```bash\n# API Configuration (OpenAI-compatible endpoints)\nexport LLM_API_KEY=\"your-key\"          # or OPENAI_API_KEY\nexport LLM_BASE_URL=\"https://...\"     # optional, for custom endpoints\n\n# For local models (no auth needed)\nexport LLM_BASE_URL=\"http://localhost:11434/v1\"  # Example: Ollama\n```\n\n### Supported Providers\n\n- **OpenAI**: GPT-4, GPT-3.5, etc.\n- **Anthropic**: Claude Sonnet, Opus (via OpenAI-compatible proxy)\n- **Local**: Ollama, LM Studio, vLLM, or any OpenAI-compatible server\n- **Custom**: Implement your own adapter by extending `BaseAdapter`\n\n## Examples\n\n- **[minimal.py](examples/minimal.py)**: Simplest possible RLM example\n- **[rlm_vs_baseline.py](examples/rlm_vs_baseline.py)**: Full benchmark showing crossover point\n- **[complex_reasoning.py](examples/complex_reasoning.py)**: Multi-step reasoning over long documents\n- **[hybrid_audit.py](examples/hybrid_audit.py)**: Trajectory visualization\n- **[smart_router_demo.py](examples/smart_router_demo.py)**: Auto baseline/RLM selection\n- **[ollama_example.py](examples/ollama_example.py)**: Using local Ollama models\n- **[cloud_example.py](examples/cloud_example.py)**: Cloud provider integration\n\n## Development\n\n```bash\n# Linting and formatting\nuv run ruff check .\nuv run ruff format .\n\n# Type checking\nuv run ty check\n\n# Tests\nuv run pytest\n```\n\n## References\n\n- [MIT CSAIL Paper: Recursive Language Models](docs/rlm-paper-mit.pdf)\n- Original paper authors: Zhou, et al.\n- This implementation is not affiliated with MIT\n\n## License\n\nMIT License - see LICENSE file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapenab%2Frlm-runtime","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapenab%2Frlm-runtime","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapenab%2Frlm-runtime/lists"}