https://github.com/charleschenai/codemap

Static codebase + binary analyzer and decompiler. Decompiles stripped PE/ELF/Mach-O to readable, behaviorally-verified C — structs, arrays, strings, C++ vtables and try/catch. 524 actions, single Rust binary, zero deps.
https://github.com/charleschenai/codemap
codebase-analysis codemap dependency-analysis graph-theory static-analysis
Last synced: 29 days ago
JSON representation
Host: GitHub
URL: https://github.com/charleschenai/codemap
Owner: charleschenai
License: mit
Created: 2026-04-22T16:53:45.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-06-24T07:26:36.000Z (about 1 month ago)
Last Synced: 2026-06-24T07:29:06.870Z (about 1 month ago)
Topics: codebase-analysis, codemap, dependency-analysis, graph-theory, static-analysis
Language: Rust
Size: 24.9 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Roadmap: docs/ROADMAP.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          # codemap

[![CI](https://github.com/charleschenai/codemap/actions/workflows/ci.yml/badge.svg)](https://github.com/charleschenai/codemap/actions/workflows/ci.yml)

[![Release](https://img.shields.io/github/v/release/charleschenai/codemap?sort=semver)](https://github.com/charleschenai/codemap/releases/latest)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

[![install: cargo-binstall](https://img.shields.io/badge/install-cargo--binstall-green.svg)](#install)

> Static codebase + binary analyzer, **decompiler, and patcher**. One binary, 658 actions, 18 source languages, PE/ELF/Mach-O/WASM decompilation to readable **recompilable** C on **x86/x64, ARM64, WebAssembly, and RISC-V RV64** (all four now self-bench-recompile-measured: x86/x64/ARM64 197/200, RV64 75/75=100%, WASM 59/75=78.7% — RV64/WASM recompile-only), sub-second cold-cache on 3K-file repos. **No network, no servers, no databases, no API keys.**

**This README is your system prompt.** Designed for AI agents: drop the entire file into your context (or fetch `https://raw.githubusercontent.com/charleschenai/codemap/main/README.md`) and you have everything you need — what codemap is, when to use it, how to install it, how to call every category of action, output schemas, exit codes, MCP setup. No further docs required for 95% of usage. Humans: see [`docs/HUMAN.md`](docs/HUMAN.md). Everyone else, keep reading.

**Mission:** Break down CODE (source + binary) so AI can replicate it.

## What's new in v8.257 — machine-checked structuring + a complete no-hallucination contract (658 actions)

- **Structuring trace-equivalence ruler** (`cf-verify --strict`, v8.251; goto-laced-acyclic v8.258; loop-aware v8.272). Beyond the side-effect *count* check, codemap now verifies that the recovered C structure preserves every side-effect **path and order**: it compares the `CNode` tree's structural traces against the lifted CFG's ground-truth traces and returns **PROVEN / DIFFERS / INCONCLUSIVE** (sound on fully-structured and goto-laced acyclic functions, and on reducible natural loops via a `{skip, body-once}` abstraction matched identically on both sides; shapes outside the proven-sound subset — irreducible/multi-latch loops, side-effecting headers, etc. — are honestly counted INCONCLUSIVE, never folded into the percent). This catches reorder/path-move bugs the count check is blind to.

- **Proven-faithful readability** (v8.252). Trivial-goto elision (dropping a goto to the immediately-following node + dead loop-latch blocks the recovered `while` already encodes) ships **only because the `--strict` ruler proves it preserved every side-effect path** (DIFFERS stayed 0, PROVEN rose) — the no-hallucination guarantee gating its own output changes.

- **A complete per-symbol type-evidence contract** (`verify-names --audit-types`, v8.253–257). For every emitted pointer — **parameters, return types, local variables, and struct fields** — codemap reports BACKED (a real deref / forward / pointer-source) or **UNBACKED** (a fabricated pointer type), in human text or `--json`. An AI agent consuming the decompiled C can programmatically trust-gate every type. No other decompiler ships a per-symbol evidence contract.

- **The who-hallucinates score** (`verify-names --audit --score`, v8.255). A single deterministic `hallucination_rate` for **any** decompiler's output (feed it via `--text`): how much of that output is fabricated (identifiers whose stems aren't in the binary's bytes) vs evidence-backed. A no-LLM oracle that ranks anyone's decompiler for fabrication.

- **Broader type-ruler validation.** `bench-types-dwarf` now runs on 6 real binaries (lua/sqlite + jq/minigzip/yes/cat), confirming type recovery **generalizes** (~33–57% real-DWARF held-out, consistent, not overfit) with near-zero over-typing false positives.

## What's new in v8.236 — the gate now measures *behavior*, not just recompilation (658 actions)

- **Behavioral-equivalence ruler** (`self-bench --functional`, opt-in). Beyond "does the decompiled C recompile?" codemap now checks "does it *behave* the same?" on a **bounded pure-function corpus**: it recompiles each function's decompiled C into a harness, runs it on generated seeds, and compares observed output (stdout/exit/return) against the original binary probed via gdb/lldb. **func_match 24/25**, and it is honestly *discriminating* — a deliberately-wrong decompilation injected into the corpus correctly scores a MISS, so the metric can't be gamed by garbage that merely compiles. Bounded sampling, not full input-space / formal equivalence.

- **Real-DWARF type-recovery ruler, gate-runnable** (`bench-types-dwarf`). Scores recovered types against build-id-matched **separate DWARF** debug info, decompiling only the stripped `.text` so the oracle never leaks into inference. Honest real type recovery: **~42-48%** across an sqlite/lua O0/O2 corpus — the mission number that gates all type work. The type rulers were also made host-portable so they run on Linux CI gates, not only macOS.

## What's new in v8.230 — decompiler campaign + machine-checked honesty (658 actions)

Current focus: **best-in-class deterministic decompilation**, with the no-hallucination guarantee made *machine-checked* rather than policy.

- **Multi-arch decompile to recompilable C** — all 5 input ISAs: x86/x64, ARM64, WebAssembly, RISC-V RV64 — **all now self-bench-recompile-measured** (x86/x64/ARM64 197/200; RV64 75/75=100%; WASM 59/75=78.7% — RV64/WASM recompile-only via `self-bench --arch=riscv64`/`--arch=wasm32`). `self-bench real` recompile **98.5%** (197/200), guarded by a held-out **canary** corpus (anti-overfit tripwire).

- **Real-DWARF type recovery, measured both ways** — scored against external compiler debug-info ground truth on stripped `-O2` binaries: sqlite_O2 **42.4%**, lua_O2 **43.7%** (libc-prototype seeding + tail-call-thunk parameter recovery + a cross-function gate that eliminated **253 spurious `__int128` returns** — every gain held-out-validated, none overfit).

- **The decompiler verifies ITSELF:** `cf-verify` (CF-GKAT per-function control-flow-equivalence verdict on the structuring pass), `block-equiv` (z3 SMT block-level equivalence — proves the `simplify` pass sound, 0 counterexamples), `formal-equiv` (bounded symbolic equivalence).

  > **z3 caveat:** `block-equiv` and `symex-solve` use the **in-process z3 crate** and are gated behind the optional **`z3-solver`** Cargo feature (needs `libz3` at build time) — they are **NOT present in the default prebuilt / `cargo binstall` binary** (build `cargo build --release -p codemap-cli --features z3-solver` to enable them; the advertised count is the **default-build** total and excludes them — they are additional when the feature is on). Separately, `cegio` **does** ship in the default binary but drives the **system `z3` binary** at runtime via stdio, falling back to a template emitter if no `z3` is installed. `cf-verify` and `formal-equiv` need neither.

- **`lib-id`** identifies statically-linked library functions by normalized content-hash fingerprint (0 false labels); **Go `.gopclntab`** recovery; external function-discovery **F1 83.3%** on the x86-SOK corpus (stripped; gcc/clang/icc/msvc).

### Earlier — v8.52 semantic rewrite pipeline closed

`codemap rewrite --function NAME --edit-c FILE --apply` — decompile → recompile edited C → surgical patch (in-place NOP-pad, or code-cave + JMP trampoline when it grows) → **replay original vs patched for bounded behavioral equivalence**. Built on a recompilable-C foundation (decompiled C now compiles cleanly: every goto target labeled, code-addresses-as-values emitted as numeric literals). Plus a self-directed discovery engine (`cross-pollinate`) that mines codemap's own capability graph for novel primitive-fusion R&D directions, ranked by leverage × coherence.

### Earlier — the 40-topic grind + v11 stack complete (658 actions)

Two full 20-topic roadmaps (THIRD + FOURTH) landed. THIRD-20 (all real, gate-verified): transplant, translate, fingerprint, hot-patch, api-shim, size-opt, multi-refactor, fuzz-harness, instrument, visual-docs, vuln-discover, protocol-rec, vectorize, ml-patch, jit-resolve, self-rewrite, gpu-lift, kernel-rewrite, mobile-fuse, os-map. FOURTH-20 mediums (real): self-bench, eval-suite, lasm, worm-defense, pear-fuzz, pqc-translate, ref-decompile.

**FOURTH-20 deep tier — converted from honest skeletons to REAL in v8.22–v8.30.** These were *labeled* honest skeletons at v8.21 (we don't fake); every one was then made to actually compute, each gated by a discriminator (an input that would expose a stub). They now produce real results, **with bounded engines — precisely scoped, not the heavy external backends:**

| Action | What's real now | Honest bound (what it is NOT) |

|---|---|---|

| `sys-sim` | native x86-64 interpreter — **executes** real instructions | bounded subset, not a full-system/CPU emulator |

| `superset-decompile` | real every-offset superset decode + provenance + interval selection | — |

| `zk-attest` | in-tree SHA-256 + Merkle + Fiat-Shamir + verify (**tamper → FAIL**) | a commitment/attestation scheme, **not a general zk-SNARK prover** |

| `gpu-rewrite` | real PTX parse / transform / re-emit | PTX-level, not a full GPU recompiler for all ISAs |

| `prove-rewrite` | bounded **symbolic-execution** translation-validation → EQUIVALENT / NOT-EQ / INCONCLUSIVE | sound only to the symex bound, **not a Coq/Lean machine-checked proof** |

| `proof-patch` | discharges obligations via symex + taint → DISCHARGED / VIOLATED / OPEN with engine evidence | bounded discharge, not a full theorem prover |

| `meta-evolve` | persisted win-rate tuning computed from real run data | — |

| `self-improve-demo` | genuine measured delta from real bench runs (`--dry`, human-gated) | — |

| `llm-decompile` | pluggable LLM backend (**off by default = deterministic/offline**) + recompile-consistency gate (only accepts output that recompiles/verifies) | does not ship/claim an integrated LLM; you bring the backend |

Still **never a faked verified/proven claim** — the bounds above are stated, not hidden. Run any of them yourself (`codemap  `) and check the output against this table.

## What's new in v8.13 — the autonomous + verifiable engine

codemap is now an **autonomous, self-improving, verifiable** security engine. The decompiler covers **all 5 input ISAs (x86/x64, ARM64, WebAssembly, RISC-V RV64) to recompilable C**, and the action arsenal composes into goal-driven, no-human loops.

**Multi-arch decompiler — all 5 input ISAs produce recompilable C.** `decompile`/`ir` emit readable, **recompilable** C for **x86, x64, ARM64 (incl NEON/SIMD), WebAssembly, and RISC-V RV64GC** (incl compressed/M/A) — all through the same lift → SSA → type/var recovery → SAILR structuring → C pipeline. **x86/x64 and ARM64 are `self-bench`-recompile-gated** (197/200, held-out canary). **WebAssembly** recovers its module type-section signature (v8.245 → `int func0(int32_t p0, int32_t p1){ … }`) and **RISC-V RV64** recovers LP64 args/returns + incoming params (v8.246, use-inferred widths v8.247 → `int add(int32_t a0, int32_t a1){ … }`); both now have **dedicated self-bench corpora** (`self-bench --arch=riscv64` **75/75 = 100%**; `self-bench --arch=wasm32` **59/75 = 78.7%** — both recompile-only, no functional axis yet; RV64 param widths come from observed use, not DWARF). No overclaim: all four are now self-bench-recompile-measured; x86/x64/ARM64 are also functional/canary-gated, RV64/WASM are recompile-only so far.

**The Autonomous lane (new actions):**

- **`run`** — *agentic mode*: `codemap run goal= ` runs a deterministic, **offline, no-LLM** PLAN→ACT→OBSERVE→VERIFY→REPORT loop that composes existing actions into a DAG, threads one graph, is budget/step-capped, emits JSON, and only marks a finding *fixed* if the patch recompiles + re-validates.

- **`learn`** — *self-improving*: records what-worked from each run into a project-brain store; `run`'s planner consults it to tune the DAG over time. The loop is closed — planning improves with usage, no code changes.

- **`redteam`** — autonomous offensive campaign (taint → symbolic → ranked PoC bundle + report).

- **`infer-spec ... export=acsl|lean`** — machine-checkable proof export (Frama-C ACSL + Lean/Coq), so patches are *provable*, not just plausible.

- **`provenance`** — signed, tamper-evident manifests for patched/twinned/hardened artifacts.

- **`pqc-migrate`** — detect quantum-vulnerable crypto → apply NIST PQC (ML-KEM/ML-DSA/SLH-DSA) → equivalence note.

- **`deobfuscate`** — the inverse of `harden`: de-flatten CFG, crack opaque predicates, decrypt strings via symbolic + graph.

Plus, across the roadmap: `binary-twin` (cleanroom fork), `xlang-graph` (cross-language call fusion), `to-rust` (C→idiomatic Rust), `replay` (record/replay + mutation), `what-if` (change-impact), `firmware`, `sbom-flow`, `crypto-audit`, `model-extract`, `game-assets`, `brain-lock`. **658 actions.**

## What's new in v8.4 — multi-arch decompiler + the three strategic arms

v8.4 pushes the v8.3 decompiler in two directions: **multi-architecture** (it now produces recompilable C for **ARM64/AArch64**, not just x86/x64) and the first increments of the three strategic arms that turn codemap from *read-only intelligence* into a full **understand → reason → change** platform.

**v8.4.0 new actions (Phase-1+ across roadmap topics):** `project-brain` (persistent project memory + git-history what-changed), `infer-spec` (formal pre/post/invariant inference → ACSL + Rust contracts, Daikon-style templates), `c-diff` (graph-aware decompiled-C diff with call-graph change propagation), `ci` (binary CI/CD attack-surface gate), `vuln-backport` (CVE patch → older-binary backport locator). ARM64 decompilation hardened: recursion, switch recovery, emit cleanup, recursive-call returns. **658 actions.**

- **ARM64 / AArch64 decompiler.** ARM64 Mach-O now disassembles (Capstone-backed `Arm64Lifter`; function sizing from `LC_FUNCTION_STARTS`) and lifts through the same IR pipeline as x86 — `codemap ir  ` emits readable C with recovered args (AAPCS64 `x0`–`x7`), real calls (recursion is a `call`, not an `asm` comment), and frame/`sp` modeling. **`--verify` PASS on ARM64**, not just x86: both arches decompile → recompile cleanly.

- **`ir --verify` — recompile gate, first-class.** `codemap ir   --verify` writes the emitted C to a temp file and runs a host C compiler on it, reporting **PASS / FAIL** — ground truth that the decompilation is *recompilable*, not just plausible. The backbone of codemap's verify-by-running discipline.

- **Arm 1 — Binary patching.** `bin-patch-fn`: surgical, layout-preserving **in-place function patching** (canned stubs `ret0`/`ret1`/`ret`/`nop` or raw hex), fits-gated, verified by re-disassembly. Neutralize a check (`bin-patch-fn ./app check_license ret1`) without touching any other offset / reloc / string. (The decompile → edit-C → recompile → relink loop is the next increment.)

- **Arm 2 — Symbolic / concolic.** `concolic`: an interval constraint solver over the SSA-IR branch guards (no SMT dependency) — per path it reports **SAT** (with a concrete register seed that drives execution down it), **DEAD** (contradictory guards → opaque-predicate / dead-code signal), or **PARTIAL**. Concrete concolic seeds in the default build.

- **Arm 3 — Dynamic bridge.** `trace-plan`: uses the code-property graph to choose a *selective* instrumentation scope (entry, call sites, dangerous sinks, loop heads — not every instruction) and emits a ready-to-run, ABI-aware GDB script. Drive it with `concolic` seeds; ingest the trace with `runtime-merge`.

- **Graph fusion — cross-binary name recovery.** `name-recovery` recovers a stripped binary's anonymous `sub_` names by matching them (40-dim structural fingerprint, cosine, greedy 1:1) to *named* functions in a reference binary, fusing the recovered names into the graph. Exact on same-build; honest-partial across optimization levels.

- **Decompiler correctness sweep.** Fixed multi-block **argument recovery** (args flowing across a loop/branch were emitted `void` with use-before-def; now seeded at the SSA entry from the calling convention), **struct-field deref** (`p->x`), and **2D-array index** (`m[i*cols+j]`) — all now recompile.

- **Built for the AI-agent customer.** `agent-brief` (one-page high-signal map of a codebase), `search` (relevance-ranked discovery across 658 actions), `graph-export` (Graphviz / Mermaid / **Cytoscape JSON** / interactive HTML). Plus human onboarding: `cargo binstall`, a Homebrew formula, and a [`docs/HUMAN.md`](docs/HUMAN.md) quickstart.

## What's new in v8.3 — the graph-fused decompiler

v8.3 (through `8.3.5`) turns codemap's binary side into a real **decompiler**: lift → SSA → DCE/copy-prop → type & variable recovery → SAILR structuring → readable, recompilable C. It went from "finds 1 function in a stripped PE" to:

- **Full binary coverage.** PE (x86/x64), ELF (x86/x64/ARM/AArch64), and **Mach-O x86-64** — function discovery via PE `.pdata` `RUNTIME_FUNCTION`, ELF symbols/`.eh_frame`, and Mach-O `LC_FUNCTION_STARTS`.

- **Readable C reconstruction.** Recovered **structs** (`p->field` with synthesized typedefs), **arrays** (`a[i]`), **string literals** (`return "hello world"`), **float/XMM ABI** params & returns (SysV + Win64), **C++ virtual calls** (`obj->vfunc_0()`), and clean control flow on `-O2` (no goto-soup).

- **C++ exception recovery.** Idiomatic `try { … } catch (int e) { … }` reconstructed from a stripped binary's `.eh_frame` + `.gcc_except_table` — **including the caught type**, demangled from the LSDA type table. Most decompilers drop the handler entirely or render it as goto-soup.

- **Correctness, not just readability.** Fixed real silent mis-decompilations — array-index liveness (loops returned `a[0]·n`), dropped `movzbl` masks (`x & 0xff` → `x`) — caught and fixed via a re-execution gate.

- **Behaviorally verified.** Every change is gated on a **79-binary recompilability corpus** + a **G10 re-execution harness** (decompile → recompile → run → diff): recovered code is behavior-identical on the scalar subset, not just plausible-looking.

- **Graph-fused.** Decompiled functions feed codemap's heterogeneous code-property graph, so its dataflow / taint / call-graph / centrality analyses run on **stripped binaries**, not just source.

## What's new in v8

v8 cuts the v7 series at `7.184.0` (2026-05-18) and turns over to `8.0.0` (2026-05-20). Headline themes:

- **Action registry complete (T1).** Every action self-registers via `inventory::submit!`; `actions/mod.rs` has zero dispatch arms (catch-all `_ => Err(UnknownAction)` only). Adding a new action is a single submit-block edit in the owning module file.

- **iced-x86 linear-sweep precision (T3).** All `bin_text_*` density actions disassemble via iced-x86 instead of raw byte-scans — eliminates instruction-boundary false positives.

- **Lint zero (T8).** `#![deny(warnings)]` locked into `codemap-core` and `codemap-cli`; `cargo clippy -- -D warnings` ships at 0 warnings.

- **arXiv research: filter scaffolds, ship real work (T9).** `pointer-analysis` (Andersen field-sensitive PA + Tarjan SCC) and `cegio` (rsmt2-driven SMT) shipped with real implementations. `bin-taint` shipped Phase A (CFG, intra/inter-procedural taint, PLT-resolved source/sink, pathfinding, stripped-binary fallback). **16 items removed in v8.2.0 cleanup:** 13 skeleton scaffolds (`symex-concolic`, `loop-polyhedral`, `detect-memory-corruption`, `neural-decompile`, `side-channel-detect`, `symex-speculative`, `gpu-analyze`, `semantic-slice`, `synthesize`, `abstract-interp`, `bin-search`, `patch-binary`, `natural-query`) + 3 failed experiments (`meta-path-ppr` proof +0.0000 lift, `rfmoe` 3/8 FAIL, `ising-landscape` proof pending) — all 59–145 LOC with no proof reports or integration tests.

- **16 Phase F actions multi-corpus replicated:** `transfer-entropy`, `hebbian-coupling`, `kl-drift`, `network-motifs`, `code-entropy`, `criticality-soc`, `fatigue-crack`, `bio-physarum`, `preferential-attachment`, `small-world`, `phase-transitions`, `lyapunov-tracker`, `universality-class`, `lattice-evidence`, `control-theory-pid-ci-cd`, `codemap-mcp`.

**658 actions** registered (full index in `docs/ACTION_CATALOG.md`; generated from the registry by `gen-action-docs` and gated by `tests/single_source_of_truth.rs`). **236** `bin-*` parsers, **18** source-language tree-sitter parsers, **1780+** lib tests, decompiler recompile **98.5%** (`self-bench real`, 197/200), **0** clippy warnings.

---

## When to reach for codemap

| Problem | Codemap action | Why codemap (vs alternatives) |

|---|---|---|

| "What does this codebase do?" | `summary --dir ` | Cross-file structural overview in one call. Beats reading files. |

| "Find unused functions / dead code" | `dead-functions --dir ` | Call-graph reachability across modules. grep can't do this. |

| "Who calls function X?" | `callers --dir  X` | True call graph (AST-aware), not a string match. |

| "What does function X depend on (transitively)?" | `trace --dir  X` | Walks the dep graph. grep would only find direct refs. |

| "What changed between two commits?" | `diff --dir   ` | Semantic diff, not line diff. |

| "Find security issues" | `audit --dir ` | Composite of taint + secret-scan + dep-tree + dead-deps. |

| "Where would a tainted input flow?" | `taint --dir  --source  --sink ` | Path-sensitive, sanitizer-aware, alias-aware, cross-procedural. |

| "Reverse-engineer a binary" | `bin-info ` | PE/ELF/Mach-O parser. capa + YARA + signsrch + PEiD rules built in. |

| "Find cross-language coupling" | `cross-lang --dir ` | Imports/calls that cross language boundaries. |

## When NOT to reach for codemap

- **Editing files**: codemap is read-only. Use Edit/Write directly.

- **Running code**: codemap doesn't compile or exec. Use bash.

- **Live process state**: codemap is static. Use `ps`, `lsof`, `ss`.

- **Single-file grep**: if you know the file, `grep` is faster.

- **String search across few files**: if N<5 files, just `grep`.

---

## Install

### From release (recommended)

Download the tarball for your platform and extract the binary:

```bash

# Linux x86_64

curl -fsSL https://github.com/charleschenai/codemap/releases/latest/download/codemap-Linux-x86_64.tar.gz -o codemap.tar.gz

tar xzf codemap.tar.gz -C ~/.local/bin/

chmod +x ~/.local/bin/codemap

# Linux aarch64 (ARM64)

curl -fsSL https://github.com/charleschenai/codemap/releases/latest/download/codemap-Linux-aarch64.tar.gz -o codemap.tar.gz

tar xzf codemap.tar.gz -C ~/.local/bin/

chmod +x ~/.local/bin/codemap

# macOS (Apple Silicon / arm64)

curl -fsSL https://github.com/charleschenai/codemap/releases/latest/download/codemap-Darwin-arm64.tar.gz -o codemap.tar.gz

tar xzf codemap.tar.gz -C ~/.local/bin/

chmod +x ~/.local/bin/codemap

export PATH="$HOME/.local/bin:$PATH"

```

> **Platforms — two independent axes.** codemap **decompiles 5 input ISAs** to recompilable C (x86, x86-64, ARM64, WebAssembly, RISC-V RV64 — all self-bench-recompile-measured: x86/x64/ARM64 197/200, RV64 100%, WASM 78.7%; RV64/WASM recompile-only) regardless of the host it runs on. Separately, it **ships prebuilt host binaries for 3 targets**: Linux `x86_64`, Linux `aarch64`, and macOS Apple-Silicon (`Darwin-arm64`). **Windows and Intel macOS have no prebuilt binary** — Intel Mac (`x86_64-apple-darwin`) is intentionally omitted and there is no Windows target; on those hosts build from source (`cargo build --release -p codemap-cli`), which decompiles all 5 ISAs identically. Each release asset is published with a matching `.sha256`.

Add `$HOME/.local/bin` to your `PATH` in `~/.bashrc` or `~/.zshrc`:

```bash

export PATH="$HOME/.local/bin:$PATH"

```

For system-wide install (`/usr/local/bin/codemap`):

```bash

sudo cp codemap /usr/local/bin/

sudo chmod +x /usr/local/bin/codemap

```

### From source

```bash

git clone https://github.com/charleschenai/codemap && cd codemap

cargo build --release -p codemap-cli

cp target/release/codemap ~/.local/bin/codemap

chmod +x ~/.local/bin/codemap

```

## Verify

```

codemap --version-detail

```

Prints:

```

codemap 8.2.0

git: 

built: 

host: /

```

If the binary is older than expected, re-run install with `--update`.

---

## How to call any action

Universal shape:

```

codemap  [TARGET...] --dir  [--json] [--quiet] [other-flags]

```

| Flag | Purpose |

|---|---|

| `--dir ` | **Required.** Repo/dir to scan. Repeatable for multi-repo. |

| `--json` | Output JSON (parseable). Default is text (human-readable). |

| `--quiet` | Suppress scan/cache status messages on stderr. |

| `--no-cache` | Force re-scan, ignore `.codemap/cache.bincode`. |

| `--include-path ` | C/C++ include search path. |

| `--watch [SECS]` | Re-run every N seconds. |

For agents: **always use `--json` and `--quiet`** unless you specifically want text output.

## Discover actions

```

codemap --help                                       # full action list

codemap  --help                              # action-specific flags

```

---

## Action categories

658 actions (a curated subset advertised in `--help`, 236 fine-grained `bin-*` parsers, plus the rest) grouped by purpose. Full catalog at [`docs/ACTION_CATALOG.md`](docs/ACTION_CATALOG.md). High-level groups:

| Category | Action count | Examples |

|---|---|---|

| **Analysis** | ~20 | `summary`, `stats`, `trace`, `callers`, `hotspots`, `layers`, `health`, `decorators` |

| **Code intelligence** | ~30 | `complexity`, `import-cost`, `churn`, `api-diff`, `clones`, `entry-points`, `dead-functions` |

| **Dataflow / security** | ~16 | `data-flow`, `taint`, `bin-taint`, `slice`, `trace-value`, `sinks`, `secret-scan`, `audit`, `dep-tree` |

| **Graph theory** | ~40 | `pagerank`, `hubs`, `bridges`, `centrality` (17 measures), `community` (Leiden), `bellman-ford` |

| **Binary / RE** | ~235 | `elf-info`, `pe-imports`, `macho-info`, `bin-anti-debug`, `bin-disasm`, `bin-strings`, `bin-relocs` |

| **Schemas** | ~10 | `proto-schema`, `openapi-schema`, `graphql-schema`, `sql-extract`, `dbf-schema` |

| **Supply chain** | ~10 | `osv-scan`, `sbom-diff`, `license-check`, `cve-scan` |

| **Config-as-code** | ~10 | `k8s-scan`, `iac-scan`, `dockerfile-scan`, `ci-scan`, `oci-scan` |

| **ML / AI** | ~10 | `gguf-info`, `safetensors-info`, `onnx-info`, `cuda-info`, `pyc-info` |

| **LSP bridge** | ~5 | `lsp-symbols`, `lsp-references`, `lsp-calls`, `lsp-diagnostics`, `lsp-types` |

| **Web** | ~5 | `web-sitemap`, `js-api-extract` (HAR/HTML input required) |

| **Cross-language** | ~5 | `lang-bridges`, `gpu-functions`, `monkey-patches` |

| **Composite** | ~10 | `audit`, `compare`, `validate`, `changeset`, `handoff`, `pipeline` |

| **arXiv-derived** | 2 | `pointer-analysis` (Andersen PA), `cegio` (SMT optimizer) |

---

## Output schema

All `--json` outputs follow:

```

{

  "ok": ,

  "action": "",

  "dir": "",

  "result": ,

  "stats": { "files_scanned": N, "duration_ms": M, "cache_hits": K }

}

```

`result` shape varies per action. Action-specific schemas in [`docs/SCHEMAS.md`](docs/SCHEMAS.md).

## Exit codes

| Code | Meaning | Agent response |

|---|---|---|

| 0 | Success | Parse `--json` output |

| 1 | Usage error (bad flag, missing --dir) | Re-read `--help`, fix args, retry |

| 2 | I/O error (path not found, no read perm) | Verify path, retry |

| 101 | Panic | **Do not retry.** File a bug at https://github.com/charleschenai/codemap/issues |

Other non-zero codes: action-specific. See ` --help`.

## AI agent usage guide

codemap is designed for AI agents as its primary customer. Below is the canonical walkthrough for integrating codemap into agent workflows.

### Why use codemap instead of grep/read?

| Scenario | grep / raw edits | codemap |

|---|---|---|

| "What does this codebase do?" | Read every file sequentially | `summary` — structural overview in one call |

| "Find dead / unused code" | Manual reachability tracing | `dead-functions` — true call-graph reachability |

| "Who calls function X?" | String match across files | `callers` — AST-aware call graph |

| "What does function X depend on?" | Direct import grep | `trace` — transitive dep graph walk |

| "What changed between two commits?" | Line-level diff | `diff` — semantic diff (AST-aware) |

| "Find security issues" | YARA / pattern match | `audit` — composite: taint + secret-scan + dep-tree + dead-deps |

| "Where does tainted input flow?" | No tool | `taint` — path-sensitive, sanitizer-aware, cross-procedural |

| "Analyze a compiled binary" | `strings` + `hexdump` + manual | `bin-info` + `bin-taint` — PE/ELF/Mach-O parsers + taint analysis |

| "Graph metrics on code" | Custom scripts | 500+ built-in actions (graph theory, entropy, ML, physics-inspired) |

codemap is **read-only**, **no network**, **no servers**, **no databases**, **no API keys**. It scans your local filesystem, builds ASTs + CFGs + graphs in memory, and returns structured JSON output.

### Canonical call pattern

Every action follows this pattern:

```bash

codemap  [TARGET] --dir  --json --quiet [OPTIONS]

```

| Flag | Purpose |

|---|---|

| `--json` | JSON output (machine-readable) |

| `--quiet` | Suppress progress bars and logs |

| `--dir` | Directory to analyze (required) |

**Output schema** (for actions that return results):

```json

{

  "ok": true,

  "result": { ... },

  "metrics": {

    "time_ms": 42,

    "files_scanned": 1501,

    "edges": 100219

  }

}

```

On failure:

```json

{

  "ok": false,

  "error": "error message"

}

```

Exit codes:

- `0` — success

- `1` — error (check `--json` output for details)

### Worked examples

**Example 1: "What does this repo do?"**

```bash

codemap summary --json --quiet --dir ./project

# → Cross-file structural overview: top-level modules, key dependencies, entry points

```

**Example 2: "Find unused functions"**

```bash

codemap dead-functions --json --quiet --dir ./project

# → Functions with zero callers across the module graph. Includes call-chain depth.

```

**Example 3: "Security audit"**

```bash

codemap audit --json --quiet --dir ./project

# → Composite: taint analysis + secret detection + dependency tree + dead deps

#   Returns findings ranked by confidence with source→sink paths

```

**Example 4: "Taint analysis — find injection paths"**

```bash

codemap taint --json --quiet --dir ./project --source read --sink system

# → Path-sensitive taint from `read` to `system` with confidence scoring

#   Reports ranked source→sink paths with alias resolution

```

**Example 5: "Binary analysis — what is this executable?"**

```bash

codemap bin-info --json --quiet ./target/release/my-binary

# → PE/ELF/Mach-O parser: sections, imports, exports, symbols,

#   capa-rules detection, YARA signatures, anti-debug indicators

```

### MCP: the recommended adoption path

For agents that use MCP-compatible clients (Claude Code, Cursor, Windsurf), add codemap as an MCP tool server. All 658 actions become available as MCP tools with proper input schemas:

```json

// ~/.claude/settings.json

{

  "mcpServers": {

    "codemap": {

      "command": "python3",

      "args": ["/path/to/codemap/docs/codemap-mcp-server.py"]

    }

  }

}

```

This is the recommended path because:

1. **No CLI parsing needed** — tools have structured input schemas

2. **Self-documenting** — `tools/list` returns every action name, description, and schema

3. **Executable via JSON-RPC** — `tools/call` with `{name, arguments}` dispatches any action

4. **Zero config for AI** — the agent discovers capabilities automatically

Set `CODEMAP_BIN` if your codemap binary is not on PATH:

```bash

export CODEMAP_BIN=~/.local/bin/codemap

```

### Environment variables

| Variable | Purpose | Default |

|---|---|---|

| `CODEMAP_BIN` | Path to codemap binary | `codemap` (from PATH) |

| `CODEMAP_CACHE` | Custom cache directory | `.codemap/cache.bincode` (next to scanned dir) |

### Error handling

Always check `--json` output for error details:

```bash

result=$(codemap  --json --quiet --dir ./project)

if echo "$result" | python3 -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if d['ok'] else 1)"; then

  echo "Success: $(echo "$result" | python3 -c "import sys,json; print(json.load(sys.stdin)['result'])")"

else

  echo "Error: $(echo "$result" | python3 -c "import sys,json; print(json.load(sys.stdin)['error'])")"

fi

```

### Performance notes

- Cold cache: sub-second on repos up to 3K files

- Warm cache: near-instant (reads `.codemap/cache.bincode`)

- Large repos (10K+ files): 5-30 seconds for full analysis

- All analysis is in-memory. No disk writes except the cache file.

- No network calls during analysis.

---

## Recipes — when the agent has a specific job to do

Each recipe: **what the action does** → **command** → **sample output** → **when to use it**.

For the complete flat list of action names see [`docs/ACTION_CATALOG.md`](docs/ACTION_CATALOG.md).

---

### Codebase understanding (first-look on an unknown repo)

#### `summary` — one-page structural overview

Reports file count, languages, entry points, top modules, dispatch density. Single-call onboarding.

```

$ codemap summary --dir ./my-repo --json --quiet

{"ok":true,"result":{"files":2824,"languages":["rust","python","typescript"],

  "entry_points":["src/main.rs","src/lib.rs"],"top_modules":["analysis","insights","cpg"]}}

```

**Use when:** new repo, "tell me what this does" before diving deeper.

#### `stats` — quantitative metrics

Per-language LOC + file counts, function/class density, fan-in/fan-out distribution.

```

$ codemap stats --dir ./my-repo --json --quiet

{"ok":true,"result":{"rust":{"files":341,"loc":89432,"fns":2104},"python":{"files":52,"loc":4108}}}

```

**Use when:** comparing repos by size, reporting metrics, sanity-checking parse coverage.

#### `layers` — architectural layer detection

Infers boundaries (web / service / data / infra) from import patterns + naming conventions.

```

$ codemap layers --dir ./my-repo --json --quiet

{"ok":true,"result":{"layers":[{"name":"web","modules":["routes","handlers"]},

  {"name":"data","modules":["models","repo"]}],"violations":[...]}}

```

**Use when:** validating that "web shouldn't import from data" type architectural rules hold.

#### `hotspots` — files with most churn × complexity

Surfaces "danger zone" code (high git churn + high cyclomatic complexity).

```

$ codemap hotspots --dir ./my-repo --json --quiet --top 10

{"ok":true,"result":{"hotspots":[{"file":"src/parser.rs","churn":48,"complexity":92,"score":4416}]}}

```

**Use when:** prioritizing refactor work, finding "where bugs live."

#### `entry-points` — public API surface

Lists exported functions/classes that other code can call from outside.

```

$ codemap entry-points --dir ./my-repo --json --quiet

{"ok":true,"result":{"entries":[{"name":"create_user","file":"api/users.rs","kind":"public_fn"}]}}

```

**Use when:** API documentation, understanding what's a stable contract.

#### `health` — overall quality summary

Composite: dead code % + clippy/lint count + circular deps + missing tests. Single "is this repo healthy?" score.

```

$ codemap health --dir ./my-repo --json --quiet

{"ok":true,"result":{"score":78,"dead_code_pct":3.2,"circular_deps":2,"missing_tests":["api/users.rs::delete"]}}

```

**Use when:** quick "should we touch this codebase or not" gut-check.

---

### Code quality & cleanup

#### `dead-functions` — unreachable code

Functions never called by any other function in the workspace.

```

$ codemap dead-functions --dir ./my-repo --json --quiet

{"ok":true,"result":{"dead":[{"file":"src/old.rs","function":"legacy_helper","line":42}]}}

```

**Use when:** cleanup PR, removing tech debt. **Don't use for:** identifying entry points (they're "dead" by call-graph but intentionally public).

#### `dead-files` — files imported nowhere

Files no other file imports / uses.

```

$ codemap dead-files --dir ./my-repo --json --quiet

{"ok":true,"result":{"dead_files":["src/experimental/old_impl.rs","tools/debug.py"]}}

```

**Use when:** dead-import cleanup.

#### `dead-deps` — declared deps never imported

Packages in `Cargo.toml`/`package.json`/`pyproject.toml` that no source file imports.

```

$ codemap dead-deps --dir ./my-repo --json --quiet

{"ok":true,"result":{"dead":["serde_json (Cargo.toml)","lodash (package.json)"]}}

```

**Use when:** dep cleanup, reducing build time + attack surface.

#### `complexity` — cyclomatic complexity per function

McCabe complexity (branches+1). Catches "this function should be split."

```

$ codemap complexity --dir ./my-repo --json --quiet --top 10

{"ok":true,"result":{"top":[{"fn":"parse_expression","file":"parser.rs","cyclomatic":34,"lines":280}]}}

```

**Use when:** finding refactor candidates, code review automation.

#### `churn` — git change frequency per file

Commits-touching-file count over a window.

```

$ codemap churn --dir ./my-repo --json --quiet --top 10

{"ok":true,"result":{"top":[{"file":"src/parser.rs","commits":78,"authors":12}]}}

```

**Use when:** combined with complexity for hotspots, ownership analysis.

#### `clones` — duplicated code blocks

Detects near-identical token sequences across files (copy-paste detection).

```

$ codemap clones --dir ./my-repo --json --quiet --min-tokens 50

{"ok":true,"result":{"clones":[{"size":120,"locations":[["a.rs:14","b.rs:22"]],"similarity":0.94}]}}

```

**Use when:** finding extraction candidates for shared functions.

#### `circular` — circular import detection

Reports module cycles (a → b → c → a).

```

$ codemap circular --dir ./my-repo --json --quiet

{"ok":true,"result":{"cycles":[["src/a.rs","src/b.rs","src/a.rs"]]}}

```

**Use when:** untangling architecture before a refactor.

---

### Impact tracing & change analysis

#### `trace` — transitive callees (what does X depend on?)

Walks the call graph forward from a function/symbol, returns full dep tree.

```

$ codemap trace --dir ./my-repo --json --quiet RecalcInvoiceTotals

{"ok":true,"result":{"node":"RecalcInvoiceTotals","calls":[

  {"name":"ship_chg_sum","file":"backend/invoices.go:120","depth":1},

  {"name":"format_money","file":"util/money.go:8","depth":2}]}}

```

**Use when:** impact analysis before changing a function, generating context for an LLM.

#### `callers` — transitive callers (who calls X?)

Reverse of `trace`. Returns the function's call sites + their callers.

```

$ codemap callers --dir ./my-repo --json --quiet validate_user

{"ok":true,"result":{"callers":[{"caller":"login","file":"auth.py:88","depth":1}]}}

```

**Use when:** "if I change this signature, what breaks?"

#### `blast-radius` — affected entities from a change

Combines callers + dataflow + tests touched. Most pessimistic estimate.

```

$ codemap blast-radius --dir ./my-repo --json --quiet --target User.id

{"ok":true,"result":{"functions":42,"tests":7,"endpoints":3,"db_columns":2}}

```

**Use when:** "what's the size of changing this thing?"

#### `diff` — semantic diff between two refs

Function-level diff: added, removed, signature-changed, body-changed.

```

$ codemap diff --dir ./my-repo --json --quiet HEAD~5 HEAD

{"ok":true,"result":{"added":["validate_email"],"removed":["old_validator"],

  "signature_changed":[{"fn":"create","before":"(name)","after":"(name,email)"}]}}

```

**Use when:** generating PR descriptions, understanding code review scope.

#### `api-diff` — breaking-change classifier

Like `diff` but specifically flags BREAKING vs additive changes to public API.

```

$ codemap api-diff --dir ./my-repo --json --quiet HEAD~5 HEAD

{"ok":true,"result":{"breaking":[

  {"kind":"removed","fn":"OldAPI::v1_login"},

  {"kind":"signature_change","fn":"create_user","before":"(name)","after":"(name,email)"}]}}

```

**Use when:** versioning decisions (semver minor vs major), CHANGELOG generation.

#### `diff-impact` — functions affected by a commit range

Maps the diff to every transitively-affected caller.

```

$ codemap diff-impact --dir ./my-repo --json --quiet HEAD~5 HEAD

{"ok":true,"result":{"impacted_fns":127,"impacted_files":34,"high_risk":["payment::charge"]}}

```

**Use when:** deciding test scope for a PR.

#### `churn-vs-complexity` (via `hotspots`) — see Codebase understanding above

---

### Data flow & security

#### `audit` — composite security report

Runs taint + secret-scan + dead-deps + dep-tree + license-check in one pass.

```

$ codemap audit --dir ./my-repo --json --quiet

{"ok":true,"result":{"findings":[

  {"kind":"secret","file":".env.sample","line":3,"pattern":"AWS_KEY"},

  {"kind":"taint","source":"req.body","sink":"db.execute","path":[...]},

  {"kind":"dep-vuln","package":"lodash","version":"4.17.20","cve":"CVE-2021-23337"}]}}

```

**Use when:** first-pass security review of an unfamiliar repo.

#### `taint` — path-sensitive taint flow

Tracks tainted values from source(s) to sink(s). Sanitizer-aware, alias-aware (e.g. `safe = sanitize(x)`), cross-procedural (parses wrapper bodies to detect hidden sanitizers).

```

$ codemap taint --dir ./my-repo --json --quiet --source 'req.query' --sink 'db.execute'

{"ok":true,"result":{"paths":[{"source":"req.query.id","sink":"db.execute(sql)",

  "hops":["params.id","userId","query"],"sanitized":false}]}}

```

**Use when:** SQLi/XSS/SSRF detection, "is user input reaching this sink?"

#### `slice` — backward program slice

Given a target variable/sink, return only the code that influences it.

```

$ codemap slice --dir ./my-repo --json --quiet --var 'password' --file auth.py

{"ok":true,"result":{"slice_lines":[12,15,22,30,42],"file":"auth.py"}}

```

**Use when:** narrowing what to read when chasing a bug.

#### `sinks` — list all dangerous sinks

Enumerates every `db.execute`, `eval`, `exec`, `Runtime.exec`, `subprocess.shell=True`, `innerHTML=`, etc.

```

$ codemap sinks --dir ./my-repo --json --quiet

{"ok":true,"result":{"sinks":[{"kind":"sql","file":"api/users.rs","line":88,"expr":"db.execute(query)"}]}}

```

**Use when:** building taint queries, audit checklist generation.

#### `secret-scan` — credentials in source

20+ patterns (AWS key, GitHub PAT, Slack token, Stripe live key, private keys, JWT, DB conn strings, etc.). Redacted output.

```

$ codemap secret-scan --dir ./my-repo --json --quiet

{"ok":true,"result":{"findings":[{"file":".env.sample","line":3,"kind":"aws_access_key","masked":"AKIA****REDACTED"}]}}

```

**Use when:** pre-commit hook, pre-publish audit.

#### `data-flow` — value origin tracing

Where does this variable's value come from? (def-use chain)

```

$ codemap data-flow --dir ./my-repo --json --quiet --target 'user_id'

{"ok":true,"result":{"origins":[{"file":"auth.py:88","expr":"req.cookies['session']"}]}}

```

**Use when:** "where does this magic value come from?"

#### `api-surface` — every exported HTTP endpoint

Detects Flask/Express/Axum/FastAPI/Spring/Rocket route handlers. Lists path + method + handler.

```

$ codemap api-surface --dir ./my-repo --json --quiet

{"ok":true,"result":{"endpoints":[{"method":"POST","path":"/users","handler":"create_user","auth_required":false}]}}

```

**Use when:** generating OpenAPI from existing code, finding unauthenticated endpoints.

---

### Graph algorithms (heterogeneous-graph queries)

These run on codemap's internal call graph + import graph + AST graph.

#### `pagerank` — most-important nodes

NetworkX-style PageRank. High score = central + many incoming refs.

```

$ codemap pagerank --dir ./my-repo --json --quiet --top 10

{"ok":true,"result":{"ranked":[{"fn":"handle_request","score":0.082}]}}

```

**Use when:** finding "load-bearing" functions, prioritizing code review.

#### `hubs` — high-out-degree nodes

Functions/modules that depend on many others. Different from PageRank (which is about incoming).

```

$ codemap hubs --dir ./my-repo --json --quiet

{"ok":true,"result":{"hubs":[{"fn":"orchestrator","out_degree":47}]}}

```

**Use when:** finding god-objects, refactor targets.

#### `bridges` — single-edge cut points

Edges whose removal disconnects the graph. These are critical paths.

```

$ codemap bridges --dir ./my-repo --json --quiet

{"ok":true,"result":{"bridges":[{"from":"auth","to":"db","modules":["auth.rs","db.rs"]}]}}

```

**Use when:** identifying single points of failure in module coupling.

#### `centrality` (17 measures) — broker / connector detection

Run with a specific measure: `betweenness`, `eigenvector`, `katz`, `closeness`, `harmonic`, `load`, `structural-holes` (brokers), `voterank`, etc. All NetworkX standards.

```

$ codemap betweenness --dir ./my-repo --json --quiet --top 5

{"ok":true,"result":{"top":[{"node":"db_session","betweenness":0.34}]}}

```

**Use when:** finding modules that connect otherwise-separate subsystems.

#### `clusters` — community detection (Leiden default)

Partitions the graph into densely-connected sub-communities.

```

$ codemap clusters --dir ./my-repo --json --quiet leiden

{"ok":true,"result":{"clusters":[{"id":0,"size":34,"members":["auth.rs","users.rs"]}]}}

```

**Use when:** discovering implicit module boundaries.

#### `paths` — shortest path between two nodes

Returns the chain of imports/calls connecting source → target.

```

$ codemap paths --dir ./my-repo --json --quiet user_input db_write

{"ok":true,"result":{"path":["user_input","sanitize","query_builder","db_write"],"length":4}}

```

**Use when:** "how does X reach Y?"

#### `subgraph` — extract a focused subgraph

Returns nodes within N hops of a target. Useful before deep analysis.

```

$ codemap subgraph --dir ./my-repo --json --quiet --target login --depth 2

{"ok":true,"result":{"nodes":[...],"edges":[...]}}

```

**Use when:** narrowing scope before more expensive analysis.

#### `bellman-ford ` / `astar  ` / `floyd-warshall` / etc.

Classical shortest-path algorithms exposed for graph queries. See ACTION_CATALOG.md for full list.

---

### Binary analysis & reverse engineering

#### Decompiler (`ir` / `decompile`) — full lift → SSA → simplify → type-recovery → variable-recovery → calling-convention → SAILR structuring → C++ RTTI → readable-C emit pipeline

**This is a real decompiler.** 14-stage pipeline that reconstructs expressions, variables, types, and `if` / `while` / `switch` syntax (incl. jump tables / computed branches / string-literal returns) from compiled binaries. Full G10 fidelity (10/10) + 79/79 protected-bin decomp test pass (bugbins-verify + reexec_harness) with switch_dispatch special-case recovery (const char* + "zero".."seven"/"unknown" map, a1 scrutinee, correct default VA); see CHANGELOG + docs/COMMIT_LEDGER.md for G10 fixes + Job 3 consolidation + GAP3-6 (F-4 -O2 dangling-goto/continue, C++ vcall via rtti, XMM/float ABI + libc-extern recomp fix, Mach-O x86-64 thin+FAT) + GAP9 (no more rsp/rbp/rbx/r12-r15 "lifter gap =0" noise decls in every fn; frame uses elided to 0) + GAP8 (struct field recovery: ptr->field_0xN with synthesized typedefs for recompile) + GAP7 Part A (array element type from access width: int32_t* for 4-byte loads). Emitted C is gcc-recompilable (current 79/79 state supersedes earlier ~48/60 notes). Cross-binary type propagation + RTTI + stack slots + confidence scores. Mach-O x86-64 support (function discovery via LC_FUNCTION_STARTS + symtab + sections; feeds iced-x86/IR). 

**Known limitation (gap 11, deferred)**: Array indexing inside loops can decompile with an incorrect (use-before-def) index (e.g. ghost reg instead of loop counter v), producing behaviorally-wrong recompiled output (sum may return a[0]*n instead of 10); element type is correct. Root: copy-prop drops the index register's def on register reuse inside the loop. Tracked as gap 11.

Remaining gaps documented in DECOMPILER.md. (New direction: user-driven decompiler quality per Ghidra issues etc.)

```bash

# Decompile a single function (full pipeline)

codemap ir  [ | ]

# Decompile entry point

codemap ir 

# Batch call-tree walk with structural hints

codemap decompile  [max-depth=N] [max-children=N] [deep]

```

**Pipeline stages:**

1. **Lift** — iced-x86 decode → IRCFG (three-address IR with explicit BitWidth)

2. **SSA construction** — Cytron et al. (1991): iterated-dominance-frontier phi placement + pre-order DFS renaming

3. **Simplify** — 42 peephole rules (Miasm / angr reference-FIRST): constant folding, identity elimination, SSA-aware simplification, signed-div-by-power-of-2, ROL/ROR detection, byte-swap, etc.

4. **Calling-convention recovery** — SysV AMD64 ABI: populate Call.args from rdi/rsi/rdx/rcx/r8/r9

5. **Dead-code elimination** — backward dataflow liveness (~80% flag computations pruned)

6. **Copy/constant propagation** — 4 alternating iterations of copy-prop + simplify + DCE

7. **Dead-block removal** — reachability from entry; prunes linker padding

8. **Block coalescing** — merge linear Goto-chains

9. **SAILR structuring** — CFG + IRCFG → C-shaped AST (Sequence / IfThen / IfThenElse / While / For / Switch / Call / Goto)

10. **Variable recovery** — classifies variables: Register, Stack, Memory, Temporary, Constant

11. **Type inference** — Phase 2 seeded from widths + Mem-loads/Stores; iterated-meet solver infers Int / Pointer / struct types

12. **Stack-slot analysis** — rsp-relative offsets for `*(rsp_N)` → `stack[]`

13. **C++ RTTI analysis** — vtable references → class declarations (base classes, virtual methods, fields)

14. **C emission** — structured AST → readable C source with type annotations, stack-slot names, symbol resolution

**Differentiators:**

- **Cross-binary type/name propagation** — types from one binary's RTTI flow into another's

- **Graph-as-validator** — heterogeneous code graph cross-checks decompilation output

- **Recompilable-C target** — structured, typed, symbol-resolved C suitable for recompilation

**Example output:**

```text

=== codemap ir ===

Binary:        ./target/release/codemap

Format:        ELF64 (64-bit, arch=x64)

Function:      main @ 0x401000 (234 bytes, 78 insns)

CFG blocks:    12

CFG edges:     18 (pre-enrich) → 18 (post-enrich)

Jump tables:   0 resolved indirect-JMPs

SSA phis:      3 inserted

Variables:     45 total (12 reg, 20 stack[-0x10..+0x18], 10 mem, 3 const, 0 tmp)

Types:         30 bound (15 int, 10 ptr, 3 top, 2 bot, 0 other)

CC args:       5 call sites populated (SysV AMD64)

DCE removed:   62 dead stmts (pre-prop) + 8 (post-prop)

Copy-prop:     15 stmts inlined

Dead blocks:   2 removed (unreachable)

Coalesced:     4 blocks merged

--- structured AST ---

Sequence {

  Let { rbp_0 = rbp }

  Let { rsp_0 = (rsp - 0x10) }

  IfThen {

    Cond: (rax_0 == 0)

    Then: Sequence { Call { printf("usage\n") } }

  }

  While {

    Cond: (argc_0 > 0)

    Body: Sequence { ... }

  }

  Ret { rax_0 }

}

--- C-shaped output ---

int main(int argc, char *argv[]) {

    uint64_t rbp_0 = rbp;

    uint64_t rsp_0 = (rsp - 0x10);

    if (rax_0 == 0) {

        printf("usage\n");

    }

    while (argc_0 > 0) {

        // ... loop body ...

        argc_0 = argc_0 - 1;

    }

    return rax_0;

}

```

**Use when:** binary reverse engineering, understanding compiled code, patch generation, static analysis of binaries. See [`docs/DECOMPILER.md`](docs/DECOMPILER.md) for full pipeline reference.

---

#### `bin-info` / `elf-info` / `macho-info` / `pe-info` — binary fingerprint

Format detection, arch, sections, strip state, language hints (Rust/Go/C++), anti-debug rules, packer detection.

```

$ codemap bin-info /usr/local/bin/codemap --json --quiet

{"ok":true,"result":{"format":"ELF64","arch":"aarch64","rust":true,"strip":false,

  "sections":34,"anti_debug":[],"packed":false}}

```

**Use when:** triage step 1 — "what is this binary?"

#### `pe-imports` / `pe-exports` — Windows PE import/export tables

Lists every DLL imported + every function exported.

```

$ codemap pe-imports ./sample.exe --json --quiet

{"ok":true,"result":{"imports":[{"dll":"kernel32.dll","functions":["VirtualAlloc","CreateProcessA"]}]}}

```

**Use when:** static behavioral profiling — what APIs does this binary depend on?

#### `pe-strings` / `bin-strings` — string extraction

Ascii + utf16le + entropy-filtered.

```

$ codemap pe-strings ./sample.exe --json --quiet --min-len 8

{"ok":true,"result":{"strings":["http://c2.example.com","cmd.exe /c"]}}

```

**Use when:** triaging unknown binaries — strings often reveal C2 URLs, command lines, paths.

#### `binary-diff` — semantic binary diff

Functions added / removed / modified between two builds.

```

$ codemap binary-diff --json --quiet --left v1.exe --right v2.exe

{"ok":true,"result":{"added":["new_handler"],"removed":["legacy_proc"],"modified":["main"]}}

```

**Use when:** patch analysis, regression hunting in firmware.

#### `dotnet-meta` — .NET assembly metadata

PE that contains CLI/.NET — reads the metadata streams, lists types + methods.

```

$ codemap dotnet-meta ./sample.dll --json --quiet

{"ok":true,"result":{"assembly":"Sample.Dll","types":["Foo","Bar"],"methods_count":42}}

```

**Use when:** analyzing .NET malware or .NET 3rd-party libs.

#### `java-class` — JVM class file

Constant pool, method signatures, bytecode summaries.

#### `wasm-info` — WebAssembly module

Imports, exports, function table, memory layout.

---

### Schemas & config-as-code

#### `openapi-schema` / `graphql-schema` / `proto-schema` — extract API schemas

Parses spec files and reports endpoints/types/operations.

```

$ codemap openapi-schema --dir ./api --json --quiet

{"ok":true,"result":{"paths":[{"method":"GET","path":"/users","operationId":"listUsers"}]}}

```

**Use when:** generating client code, checking spec consistency.

#### `k8s-scan` — Kubernetes CIS audit (16 rules)

Checks privileged containers, hostNetwork, missing resource limits, etc.

```

$ codemap k8s-scan --dir ./k8s/ --json --quiet

{"ok":true,"result":{"findings":[{"rule":"K8S-001","resource":"Deployment/api","severity":"high","msg":"privileged=true"}]}}

```

**Use when:** auditing manifests before apply.

#### `iac-scan` — Terraform/CloudFormation/Pulumi audit (12 rules)

```

$ codemap iac-scan --dir ./infra/ --json --quiet

{"ok":true,"result":{"findings":[{"rule":"IAC-007","file":"main.tf","msg":"S3 bucket public-read ACL"}]}}

```

#### `dockerfile-scan` — Dockerfile audit (10 rules)

```

$ codemap dockerfile-scan --dir ./ --json --quiet

{"ok":true,"result":{"findings":[{"rule":"DKR-002","msg":"running as root","line":18}]}}

```

#### `ci-scan` — CI/CD pipeline audit (37 rules across 6 ecosystems)

GitHub Actions, GitLab CI, Jenkinsfile, CircleCI, Azure Pipelines, Travis. Catches injection, unpinned actions, secret literals, `pull_request_target` misuse.

```

$ codemap ci-scan --dir ./.github/ --json --quiet

{"ok":true,"result":{"findings":[{"rule":"GH-003","file":"deploy.yml","msg":"unpinned action ref"}]}}

```

#### `oci-scan` — OCI image / docker save tarball audit

Per-layer manifest, layer-resident secrets (11 patterns), licenses, file/dir/symlink counts.

```

$ codemap oci-scan --dir ./image.tar --json --quiet --mode all

{"ok":true,"result":{"layers":[...],"secrets":[...],"licenses":[...]}}

```

#### `sql-extract` — SQL DDL/DML extraction

Pulls SQL out of source code or .sql files. Schema + queries.

```

$ codemap sql-extract --dir ./my-repo --json --quiet

{"ok":true,"result":{"tables":[{"name":"users","columns":[...]}],"queries":[...]}}

```

---

### Supply chain

#### `osv-scan` — match deps against OSV.dev advisories (offline)

Semver-range-aware.

```

$ codemap osv-scan --dir ./my-repo --json --quiet

{"ok":true,"result":{"vulns":[{"package":"lodash","version":"4.17.20","cve":"CVE-2021-23337"}]}}

```

#### `sbom-diff` — CycloneDX/SPDX diff

Added, removed, upgraded, downgraded packages between two SBOMs.

```

$ codemap sbom-diff --left ./sbom-1.spdx.json --right ./sbom-2.spdx.json --json --quiet

{"ok":true,"result":{"added":[...],"removed":[...],"upgraded":[...]}}

```

#### `license-check` — SPDX compatibility

Per-package license + compatibility verdict.

```

$ codemap license-check --dir ./my-repo --json --quiet

{"ok":true,"result":{"deps":[{"name":"foo","license":"GPL-3.0","compatible":false}]}}

```

#### `cve-scan` — same as osv-scan but specifically against MITRE CVE corpus

---

### ML / AI model files

#### `gguf-info` — llama.cpp GGUF inspection

Architecture, layer count, head count, quant level, vocab size.

```

$ codemap gguf-info ./model.gguf --json --quiet

{"ok":true,"result":{"arch":"llama","n_layers":32,"n_heads":32,"vocab_size":32000,"quant":"Q4_K_M"}}

```

**Use when:** "what model is this file?" Pre-load sanity check.

#### `safetensors-info` — HuggingFace safetensors inspection

Tensor shapes, dtypes, total params.

```

$ codemap safetensors-info ./model.safetensors --json --quiet

{"ok":true,"result":{"tensors":291,"total_params":7240000000,"dtype":"float16"}}

```

#### `onnx-info` — ONNX model graph

Operators, inputs, outputs, opset.

```

$ codemap onnx-info ./model.onnx --json --quiet

{"ok":true,"result":{"opset":17,"ops":["Conv","Relu","MaxPool"],"inputs":[{"name":"x","shape":[1,3,224,224]}]}}

```

#### `cuda-info` — CUDA fatbin/cubin inspection

SM versions present, kernel symbols.

#### `pyc-info` — Python bytecode inspection

Magic number, marshalled code object, imports.

---

### Cross-language & web

#### `lang-bridges` — FFI/binding detection

Detects PyO3 / napi / wasm-bindgen / JNI etc. — where languages interop.

```

$ codemap lang-bridges --dir ./my-repo --json --quiet

{"ok":true,"result":{"bridges":[{"kind":"pyo3","rust_fn":"create_user","py_module":"my_lib"}]}}

```

#### `gpu-functions` — GPU kernels in source

CUDA `__global__`, OpenCL kernels, Metal compute kernels, ROCm/HIP.

```

$ codemap gpu-functions --dir ./my-repo --json --quiet

{"ok":true,"result":{"kernels":[{"name":"matmul_kernel","framework":"cuda","file":"kernels.cu"}]}}

```

#### `monkey-patches` — runtime mutation detection

`obj.method = new_fn`, `setattr`, `prototype` patching.

#### `dispatch-map` — generic dispatch tables

Routers, registries, plugin maps. Finds the "switch statement that controls behavior."

#### `web-sitemap` — sitemap.xml + crawled link graph

#### `js-api-extract` — extract API calls from HAR / JS source

---

### LSP bridge (requires a running language server)

#### `lsp-symbols` — workspace symbol table from LSP

Real symbol info, not AST-inferred. More accurate for typed languages.

#### `lsp-references` — every reference to a symbol (LSP-grade)

#### `lsp-calls` — call hierarchy from LSP

#### `lsp-diagnostics` — current LSP diagnostics across the workspace

```

$ codemap lsp-diagnostics --dir ./my-repo --json --quiet

{"ok":true,"result":{"diagnostics":[{"file":"src/main.rs","line":42,"severity":"error","msg":"E0308: mismatched types"}]}}

```

**Use when:** programmatic access to compiler/type-checker errors.

#### `lsp-types` — type info on hover for a position

---

### arXiv-derived research actions (advanced)

These implement specific research papers. `cegio` and `pointer-analysis` have real implementations with proof reports; `bin-taint` Phase A shipped with empirical proof (P@10 target, achieved P=1.00/R=0.80).

#### `pointer-analysis` — Andersen field-sensitive PA

Computes points-to sets (which pointers can alias which memory). Field-sensitive + flow-insensitive + Tarjan SCC pre-pass for performance.

```

$ codemap pointer-analysis --dir ./my-repo --json --quiet

{"ok":true,"result":{"scope_vars":102000,"copy_constraints":132000,

  "aliases":[{"ptr":"p","may_alias":["a","b"]}]}}

```

**Use when:** understanding aliasing for refactoring (rename a field safely), upstream of taint analysis.

#### `cegio` — counterexample-guided inductive optimization

**arXiv 1704.03738**. Given taint paths, synthesizes the minimum input that triggers a vulnerability.

```

$ codemap cegio --dir ./my-repo --json --quiet --taint-result 

{"ok":true,"result":{"trigger":{"input":"' OR 1=1--","reaches_sink":true}}}

```

**Use when:** turning a taint finding into a proof-of-concept exploit input.

#### `bin-taint` — binary taint analysis (Phase A)

Lifts x86-64 ELF executable sections to a taint IR, builds CFG, propagates forward may-taint dataflow from PLT-resolved sources (read/recv/fread/getenv/strcpy/memcpy) to sinks (system/popen/exec/sprintf/dlopen), reports ranked source→sink paths. Stripped-binary fallback via bounded `.text` pathfinding. Proof: precision 1.00, recall 0.80 on 8-binary corpus (4 vuln classes detected, 0 false positives on 3 safe programs).

```

$ codemap bin-taint ./vulnerable-binary --json --quiet

{"ok":true,"result":{"findings":[{"source":"getenv","sink":"system","hops":["env","cmd","system"],"confidence":0.9},{"source":"read","sink":"sprintf","hops":["buf","format","sprintf"],"confidence":0.7}]}}

```

**Use when:** binary taint analysis on stripped ELF, finding command injection / format string / exec injection paths in compiled code.

---

### Composite workflows

#### `audit` — kitchen-sink security report

See "Data flow & security" section above.

#### `validate` — sanity check (build + lint + tests + audit summary)

Single composite for "is this repo broken?"

#### `changeset` — file-grouped diff summary

```

$ codemap changeset --dir ./my-repo --json --quiet HEAD~10 HEAD

{"ok":true,"result":{"changes":{"feat":[...],"fix":[...],"refactor":[...]}}}

```

#### `handoff` — generate handoff document for a project

Distills repo state into a single MD doc (status + open issues + recent work + next-steps).

#### `pipeline` — multi-action pipeline runner

Run several actions in sequence, accumulate results.

```

$ codemap pipeline --dir ./my-repo --json --quiet --target 'audit:./,trace:main,hotspots:'

{"ok":true,"result":{"audit":{...},"trace":{...},"hotspots":{...}}}

```

**Use when:** scripted multi-step analysis.

---

## Architecture (1-paragraph)

codemap walks `--dir`, parses with tree-sitter, builds a file-level import graph and a function-level call graph, layers PE/ELF/Mach-O/WASM/Java binary parsers + x86/x64 disassembly, and exposes 658 actions through a uniform CLI registry (`inventory::submit!`). Cache: `.codemap/cache.bincode` next to the scanned dir. Pure static. No daemons, no network access at analysis time.

## Repo layout

- `codemap-core/` — parsing, graph, algorithms, actions

- `codemap-cli/` — the `codemap` binary

- `codemap-napi/` — Node.js bindings (optional)

- `docs/` — REFERENCE.md, ACTION_CATALOG.md, SCHEMAS.md, HUMAN.md

- `install.sh` — single install entry

## License

MIT. See [`LICENSE`](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/charleschenai/codemap

Awesome Lists containing this project

README