{"id":44193355,"url":"https://github.com/userfrm/rpg-encoder","last_synced_at":"2026-02-24T20:02:42.183Z","repository":{"id":336820759,"uuid":"1150562246","full_name":"userFRM/rpg-encoder","owner":"userFRM","description":"[Independent] - Repository Program Graph Encoder — semantic code understanding via LLM-powered graph construction (arXiv:2602.02084)","archived":false,"fork":false,"pushed_at":"2026-02-09T17:55:47.000Z","size":824,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-09T20:37:35.884Z","etag":null,"topics":["agentic-tools","code-analysis","code-understanding","embeddings","llm","mcp","program-graph","rust","semantic-search","tree-sitter"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/userFRM.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-05T12:30:55.000Z","updated_at":"2026-02-09T16:30:49.000Z","dependencies_parsed_at":"2026-02-06T17:15:32.011Z","dependency_job_id":null,"html_url":"https://github.com/userFRM/rpg-encoder","commit_stats":null,"previous_names":["userfrm/rpg-encoder"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/userFRM/rpg-encoder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/userFRM%2Frpg-encoder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/userFRM%2Frpg-encoder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/userFRM%2Frpg-encoder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/userFRM%2Frpg-encoder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/userFRM","download_url":"https://codeload.github.com/userFRM/rpg-encoder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/userFRM%2Frpg-encoder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29312968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-10T17:48:59.043Z","status":"ssl_error","status_checked_at":"2026-02-10T17:45:37.240Z","response_time":65,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-tools","code-analysis","code-understanding","embeddings","llm","mcp","program-graph","rust","semantic-search","tree-sitter"],"created_at":"2026-02-09T18:09:09.694Z","updated_at":"2026-02-24T20:02:42.168Z","avatar_url":"https://github.com/userFRM.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rpg-encoder\n\n[![CI](https://github.com/userFRM/rpg-encoder/workflows/CI/badge.svg)](https://github.com/userFRM/rpg-encoder/actions)\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n[![Rust](https://img.shields.io/badge/rust-1.85%2B-orange.svg)](https://www.rust-lang.org)\n\n\u003e [!NOTE]\n\u003e This is an **independent, community-driven implementation** inspired by the\n\u003e [RPG-Encoder paper](https://arxiv.org/abs/2602.02084) from Microsoft Research. It is **not**\n\u003e affiliated with, endorsed by, or connected to Microsoft in any way. For the official\n\u003e implementation, see [microsoft/RPG-ZeroRepo](https://github.com/microsoft/RPG-ZeroRepo).\n\u003e\n\u003e Microsoft announced *\"We are in the process of preparing a full public release of the codebase,\n\u003e and all code will be released within the next two weeks.\"* — that was too long to wait.\n\u003e This project was built with Claude by reading the publicly available research papers and\n\u003e implementing the described algorithms from scratch in Rust. All code is original work.\n\u003e The papers are cited for attribution.\n\n---\n\n**Coding agent toolkit for semantic code understanding.**\n\nrpg-encoder builds a semantic graph of your codebase. Your coding agent (Claude Code, Cursor,\netc.) analyzes the code and adds intent-level features via the MCP interactive protocol.\nSearch by what code *does*, not what it's named.\n\n\u003e [!TIP]\n\u003e **New to RPG?** See [How RPG Compares](docs/comparison.md) to understand where it fits\n\u003e alongside Claude Code, Serena, and other tools.\n\u003e For a detailed algorithm-by-algorithm comparison with the research paper, see\n\u003e [Paper Fidelity](docs/paper_fidelity.md).\n\n## Install\n\nAdd to your MCP config (Claude Code `~/.claude.json`, Cursor settings, etc.):\n\n```json\n{\n  \"mcpServers\": {\n    \"rpg\": {\n      \"command\": \"npx\",\n      \"args\": [\"-y\", \"-p\", \"rpg-encoder\", \"rpg-mcp-server\", \"/path/to/your/project\"]\n    }\n  }\n}\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eAlternative: build from source\u003c/summary\u003e\n\n```bash\ngit clone https://github.com/userFRM/rpg-encoder.git\ncd rpg-encoder \u0026\u0026 cargo build --release\n```\n\nThen use the binary path directly:\n\n```json\n{\n  \"mcpServers\": {\n    \"rpg\": {\n      \"command\": \"/path/to/rpg-encoder/target/release/rpg-mcp-server\",\n      \"args\": [\"/path/to/your/project\"]\n    }\n  }\n}\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eMulti-repo setup\u003c/strong\u003e\u003c/summary\u003e\n\nThe MCP server operates on the directory passed as its first argument. For multi-repo usage:\n\n**Option 1: Global config (single primary repo)**\n\nSet your main development repo in `~/.claude.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"rpg\": {\n      \"command\": \"npx\",\n      \"args\": [\"-y\", \"-p\", \"rpg-encoder\", \"rpg-mcp-server\", \"/path/to/primary/repo\"]\n    }\n  }\n}\n```\n\n**Option 2: Per-project override**\n\nCreate `.claude/mcp_servers.json` in each repo that needs RPG:\n\n```json\n{\n  \"rpg\": {\n    \"type\": \"stdio\",\n    \"command\": \"npx\",\n    \"args\": [\"-y\", \"-p\", \"rpg-encoder\", \"rpg-mcp-server\", \"/path/to/this/repo\"],\n    \"env\": {}\n  }\n}\n```\n\nThe project-level config overrides the global one. Restart Claude Code after creating/modifying configs.\n\n\u003c/details\u003e\n\n## Lifecycle\n\n```mermaid\ngraph LR\n    A[Install] --\u003e B[Build]\n    B --\u003e C[Lift]\n    C --\u003e D[Use]\n    D --\u003e E[Update]\n    E --\u003e C\n```\n\nYou install it. Your agent does the rest.\n\n## Getting Started\n\nTell your coding agent:\n\n\u003e \"Build and lift the RPG for this repo\"\n\nThat's it. The agent handles everything. Here's what happens:\n\n1. **Build** — Indexes all code entities and dependencies (~5 seconds)\n2. **Lift** — Agent analyzes each function/class and adds semantic features (~2 min per 100 entities)\n3. **Organize** — Agent discovers functional domains and builds a semantic hierarchy (~30 seconds)\n4. **Save** — Graph is written to `.rpg/graph.json` — commit it so everyone benefits\n\nOnce lifted, try queries like:\n\n- *\"What handles authentication?\"*\n- *\"Show me everything that depends on the database connection\"*\n- *\"Plan a change to add rate limiting to API endpoints\"*\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eHow it works under the hood\u003c/strong\u003e\u003c/summary\u003e\n\nThe RPG (Repository Planning Graph) is a hierarchical, dual-view representation from the\nresearch papers cited below:\n\n1. **Parse** — Extract entities (functions, classes, methods) and dependency edges (imports,\n   invocations, inheritance) using tree-sitter. Build a file-path hierarchy.\n2. **Lift** — Your coding agent analyzes entity source code and adds verb-object semantic\n   features (e.g., \"validate user credentials\", \"serialize config to disk\") via the MCP\n   interactive protocol (`get_entities_for_lifting` → `submit_lift_results`).\n3. **Hierarchy** — Your agent discovers functional domains and assigns entities to a 3-level\n   semantic hierarchy (`build_semantic_hierarchy` → `submit_hierarchy`).\n4. **Ground** — Anchor hierarchy nodes to directories via LCA algorithm, resolve cross-file\n   dependency edges.\n\nThe graph is saved to `.rpg/graph.json` and **should be committed to your repo** — this way\nall collaborators and AI tools get instant semantic search without rebuilding.\n\n\u003c/details\u003e\n\n## MCP Tools\n\n**Build \u0026 Maintain**\n\n| Tool | Description |\n|------|-------------|\n| `build_rpg` | Index the codebase (run once, instant) |\n| `update_rpg` | Incremental update from git changes |\n| `reload_rpg` | Reload graph from disk after external changes |\n| `rpg_info` | Graph statistics, hierarchy overview, per-area lifting coverage |\n\n**Semantic Lifting**\n\n| Tool | Description |\n|------|-------------|\n| `lifting_status` | Dashboard — coverage, per-area progress, NEXT STEP |\n| `get_entities_for_lifting` | Get entity source code for your agent to analyze |\n| `submit_lift_results` | Submit the agent's semantic features back to the graph |\n| `finalize_lifting` | Aggregate file-level features, rebuild hierarchy metadata |\n| `get_files_for_synthesis` | Get file-level entity features for holistic synthesis |\n| `submit_file_syntheses` | Submit holistic file-level summaries |\n| `build_semantic_hierarchy` | Get domain discovery + hierarchy assignment prompts |\n| `submit_hierarchy` | Apply hierarchy assignments to the graph |\n| `get_routing_candidates` | Get entities needing semantic routing (drifted or newly lifted) |\n| `submit_routing_decisions` | Submit routing decisions (hierarchy path or \"keep\") |\n\n**Navigate \u0026 Search**\n\n| Tool | Description |\n|------|-------------|\n| `search_node` | Search entities by intent or keywords (hybrid embedding + lexical scoring) |\n| `fetch_node` | Get entity metadata, source code, dependencies, and hierarchy context |\n| `explore_rpg` | Traverse dependency graph (upstream, downstream, or both) |\n| `context_pack` | Single-call search+fetch+explore with token budget |\n\n**Plan \u0026 Analyze**\n\n| Tool | Description |\n|------|-------------|\n| `impact_radius` | BFS reachability analysis — \"what depends on X?\" |\n| `plan_change` | Change planning — find relevant entities, modification order, blast radius |\n| `find_paths` | K-shortest dependency paths between two entities |\n| `slice_between` | Extract minimal connecting subgraph between entities |\n| `reconstruct_plan` | Dependency-safe reconstruction execution plan |\n\n### Lifting: What It Is\n\nLifting is the process where your coding agent reads each function, class, and method in your\ncodebase and describes what it does in plain English — verb-object features like \"validate user\ncredentials\" or \"serialize config to disk\". These features power semantic search: find code by\nwhat it *does*, not what it's named.\n\n- **No API keys needed** — your connected coding agent (Claude Code, Cursor, etc.) *is* the LLM\n- **One-time cost** — lift once, commit `.rpg/`, and every future session starts instantly\n- **Resumable** — if interrupted, `lifting_status` picks up exactly where you left off\n- **Incremental** — after code changes, `update_rpg` detects what moved and only re-lifts those entities\n- **Scoped** — lift the whole repo or just a subdirectory (`\"src/auth/**\"`)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eLifting protocol details (for tool builders)\u003c/strong\u003e\u003c/summary\u003e\n\n1. Ask your agent to \"lift the code\" (or call `get_entities_for_lifting` with a scope)\n2. The tool returns entity source code with analysis instructions\n3. Your agent analyzes the code and calls `submit_lift_results` with semantic features\n4. The agent continues through all batches automatically, dispatching subagents for large repos\n5. After lifting, `finalize_lifting` → `build_semantic_hierarchy` → `submit_hierarchy`\n\n\u003c/details\u003e\n\n## Supported Languages\n\n| Language | Entity Extraction | Dependency Resolution |\n|----------|------------------|----------------------|\n| Python | Functions, classes, methods | imports, calls, inheritance |\n| Rust | Functions, structs, traits, impl methods | use statements, calls, trait impls |\n| TypeScript | Functions, classes, methods, interfaces | imports, calls, inheritance |\n| JavaScript | Functions, classes, methods | imports, calls, inheritance |\n| Go | Functions, structs, methods, interfaces | imports, calls |\n| Java | Classes, methods, interfaces | imports, calls, inheritance |\n| C | Functions, structs | includes, calls |\n| C++ | Functions, classes, methods, structs | includes, calls, inheritance |\n| C# | Classes, methods, interfaces | using, calls, inheritance |\n| PHP | Functions, classes, methods | use, calls, inheritance |\n| Ruby | Classes, methods, modules | require, calls, inheritance |\n| Kotlin | Functions, classes, methods | imports, calls, inheritance |\n| Swift | Functions, classes, structs, protocols | imports, calls, inheritance |\n| Scala | Functions, classes, objects, traits | imports, calls, inheritance |\n| Bash | Functions | source, calls |\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eCLI\u003c/strong\u003e\u003c/summary\u003e\n\nThe CLI provides structural operations (no semantic lifting — use the MCP server for that).\n\n```bash\n# Install\nnpm install -g rpg-encoder\n\n# Build a graph\nrpg-encoder build\nrpg-encoder build --include \"src/**/*.py\" --exclude \"tests/**\"\n\n# Query\nrpg-encoder search \"parse entities from source code\"\nrpg-encoder fetch \"src/parser.rs:extract_entities\"\nrpg-encoder explore \"src/parser.rs:extract_entities\" --direction both --depth 2\nrpg-encoder info\n\n# Incremental update\nrpg-encoder update\nrpg-encoder update --since abc1234\n\n# Paper-style reconstruction schedule (topological + coherent batches)\nrpg-encoder reconstruct-plan --max-batch-size 8 --format text\nrpg-encoder reconstruct-plan --format json\n\n# Pre-commit hook (auto-updates graph on every commit)\nrpg-encoder hook install\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eConfiguration\u003c/strong\u003e\u003c/summary\u003e\n\nCreate `.rpg/config.toml` in your project root (all fields optional):\n\n```toml\n[encoding]\nbatch_size = 50             # Entities per lifting batch\nmax_batch_tokens = 8000     # Token budget per batch\ndrift_threshold = 0.5       # Jaccard distance midpoint reference\ndrift_ignore_threshold = 0.3  # Below: minor edit, in-place update\ndrift_auto_threshold = 0.7    # Above: auto-queue for re-routing\n\n[navigation]\nsearch_result_limit = 10\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eArchitecture\u003c/strong\u003e\u003c/summary\u003e\n\n```\nrpg-encoder/\n├── rpg-core        Core graph types (RPGraph, Entity, HierarchyNode), storage, LCA\n├── rpg-parser      Tree-sitter entity + dependency extraction (15 languages)\n├── rpg-encoder     Encoding pipeline, semantic lifting utilities, incremental evolution\n│   └── prompts/        Prompt templates (embedded via include_str!)\n├── rpg-nav         Search, fetch, explore, TOON serialization\n├── rpg-cli         CLI binary (rpg-encoder)\n└── rpg-mcp         MCP server binary (rpg-mcp-server)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eHow It Compares\u003c/strong\u003e\u003c/summary\u003e\n\n| Aspect | Paper (Microsoft) | This Repo |\n|--------|-------------------|-----------|\n| Implementation | Python (unreleased) | Rust (available now) |\n| Lifting strategy | Full upfront via API | Progressive — your coding agent lifts via MCP |\n| Semantic routing | LLM-based | LLM-based (via MCP routing protocol) |\n| Feature search | Embedding-based | Hybrid embedding + lexical (BGE-small-en-v1.5) |\n| MCP server | Described, not shipped | Working, with 23 tools |\n| SWE-bench evaluation | 93.7% Acc@5 | Self-eval: MRR 0.59, Acc@10 85% ([benchmark](benchmarks/README.md)) |\n| Languages | Python-focused | 15 languages |\n| TOON format | Not described | Implemented for token efficiency |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eFAQ\u003c/strong\u003e\u003c/summary\u003e\n\n**Do I need an API key or a local LLM?**\n\nNo. Your connected coding agent (Claude Code, Cursor, etc.) *is* the LLM. rpg-encoder sends\nsource code to the agent via MCP tools, the agent analyzes it and sends back semantic features.\nNo API keys, no external services, no local model downloads.\n\n**How long does lifting take?**\n\nRoughly 2 minutes per 100 entities. A small project (50 files, ~200 entities) takes about\n5 minutes. A large project (500+ files) should use parallel subagents — your agent handles\nthis automatically. Build and hierarchy steps are near-instant.\n\n**What happens when I delete or rename files?**\n\nRun `update_rpg` (or use the pre-commit hook). It diffs against the last indexed commit,\nremoves deleted entities, re-extracts renamed/modified files, and marks changed entities\nfor re-lifting. The graph stays consistent without a full rebuild.\n\n**Can I lift only part of the codebase?**\n\nYes. Pass a file glob to `get_entities_for_lifting`: `\"src/auth/**\"`, `\"crates/rpg-core/**\"`,\netc. You can also use `.rpgignore` (gitignore syntax) to permanently exclude files like\nvendored dependencies or generated code.\n\n**What if lifting gets interrupted?**\n\nThe graph is saved to disk after every `submit_lift_results` call. Start a new session,\ncall `lifting_status`, and it picks up exactly where you left off — only unlifted entities\nare returned.\n\n**How does semantic search work?**\n\n`search_node` uses hybrid scoring: BGE-small-en-v1.5 embeddings for semantic similarity\nplus lexical matching for exact names and paths. Query with intent (\"handle authentication\")\nor exact identifiers (\"AuthService::validate\") — both work.\n\n**Should I commit `.rpg/` to the repo?**\n\nYes. The `.rpg/graph.json` file contains the full semantic graph. Committing it means\ncollaborators and CI agents get instant semantic search without re-lifting. The graph\nis deterministic (sorted maps, stable serialization), so diffs are meaningful.\n\n**What about monorepos or very large codebases?**\n\nUse scoped lifting to process one area at a time (`\"packages/api/**\"`, `\"services/auth/**\"`).\nYour coding agent will automatically dispatch parallel subagents for large scopes. The\nincremental update system (`update_rpg`) keeps the graph current without full rebuilds.\nFor very large repos, use `.rpgignore` to exclude vendored code, generated files, and\ntest fixtures.\n\n\u003c/details\u003e\n\n## References\n\nThis project is based on the following research papers. All credit for the theoretical\nframework, algorithms, and evaluation methodology belongs to the original authors.\n\n- **RPG-Encoder**: Luo, J., Yin, C., Zhang, X., et al. \"Closing the Loop: Universal\n  Repository Representation with RPG-Encoder.\" arXiv:2602.02084, 2026.\n  [[Paper]](https://arxiv.org/abs/2602.02084)\n  [[Project Page]](https://ayanami2003.github.io/RPG-Encoder/)\n  [[Official Code]](https://github.com/microsoft/RPG-ZeroRepo)\n\n- **RPG (ZeroRepo)**: Luo, J., Yin, C., et al. \"RepoGraph: Enhancing AI Software Engineering\n  with Repository-level Code Graph.\" arXiv:2509.16198, 2025.\n  [[Paper]](https://arxiv.org/abs/2509.16198)\n\n- **TOON**: Token-Oriented Object Notation — an LLM-optimized data format used for MCP\n  tool output and LLM response parsing.\n  [[Spec]](https://github.com/toon-format/toon)\n\n## License\n\nLicensed under the [MIT License](LICENSE).\n\nThis is an independent implementation. The RPG-Encoder paper and its associated intellectual\nproperty belong to Microsoft Research and the paper's authors. This project implements the\npublicly described algorithms and does not contain any code from Microsoft.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuserfrm%2Frpg-encoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuserfrm%2Frpg-encoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuserfrm%2Frpg-encoder/lists"}