https://github.com/iliaal/codesage
Code intelligence engine for AI coding agents. Structural graph queries plus semantic search, exposed via CLI and MCP.
https://github.com/iliaal/codesage
ai-agents code-intelligence code-search embeddings mcp rust semantic-search tree-sitter
Last synced: about 1 month ago
JSON representation
Code intelligence engine for AI coding agents. Structural graph queries plus semantic search, exposed via CLI and MCP.
- Host: GitHub
- URL: https://github.com/iliaal/codesage
- Owner: iliaal
- License: mit
- Created: 2026-04-15T19:22:21.000Z (about 1 month ago)
- Default Branch: master
- Last Pushed: 2026-04-15T20:15:25.000Z (about 1 month ago)
- Last Synced: 2026-04-15T21:27:02.752Z (about 1 month ago)
- Topics: ai-agents, code-intelligence, code-search, embeddings, mcp, rust, semantic-search, tree-sitter
- Language: C
- Size: 455 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# CodeSage
[](https://github.com/iliaal/codesage/actions/workflows/ci.yml)
[](https://github.com/iliaal/codesage/actions/workflows/tests.yml)
[](https://github.com/iliaal/codesage/actions/workflows/secret-scan.yml)
[](https://github.com/iliaal/codesage/releases)
[](https://opensource.org/licenses/MIT)
[](https://x.com/intent/follow?screen_name=iliaa)
CodeSage is a code intelligence engine for AI coding agents. It combines structural graph queries (symbols, references, dependencies) and semantic search (embedding retrieval with cross-encoder reranking) in a single Rust binary, usable as a CLI or over MCP.
## What you can do with it
- Find code by natural-language query: "where does auth happen?", "error handling in the GC".
- Look up symbol definitions by name across a codebase.
- Trace imports, calls, and inheritance for any symbol.
- Map import and include relationships between files.
- Estimate which files a change breaks (change impact analysis).
- Build curated code bundles for LLM consumption in JSON, markdown, or flat-text (gitingest-style) form.
- Read per-file git history: churn, fix ratio, historical co-change, risk score.
- Expose all of the above over MCP so Claude Code, Codex, or Cursor can call them.
## Supported languages
PHP, Python, C, Rust, JavaScript, TypeScript.
## Getting started
```bash
# Build with GPU support
cargo build --release -p codesage --features cuda
# Initialize and index a project
cd /path/to/your/project
codesage init
codesage index
# Search
codesage search "authentication handler"
codesage search --json --limit 20 "database connection pooling"
# Structural queries
codesage find-symbol MyClass
codesage find-references some_function --kind call
codesage dependencies src/main.py
# Change impact analysis (who breaks if you touch this?)
codesage impact DocumentRepository --depth 2 --source-only
codesage impact src/auth/session.ts --json
# Context bundle for LLM consumption
codesage export "authentication flow" --limit 5 --callers
codesage export MyClass --symbol --format md
codesage export "auth flow" --format ingest # gitingest-style flat-text bundle
# Git history: churn, fix ratio, co-change, risk score
codesage git-index # initial populate; hooks keep it fresh
codesage git-index --full # force full rescan (weekly hygiene)
codesage coupling src/auth/session.ts --limit 5 # files that historically change with this
codesage risk src/auth/session.ts # score with decomposition
# MCP server for Claude Code / Codex / Cursor (one global server, every onboarded project)
claude mcp add --scope user codesage -- codesage mcp
# Auto-reindex on git operations
codesage install-hooks
# Diagnose installation
codesage doctor
```
## Recipes
Common pipelines using `codesage` with `git`. Each is one shell line plus what the output tells you.
### Risk check before committing
```bash
git diff --cached --name-only | codesage risk-diff
```
Pipes the staged file list through `assess_risk_diff`. Output shows the max risk score, files in each risk bucket (hotspot, fix-heavy, test-gap, wide blast radius), and paste-ready summary notes for the commit message or PR description. If `max_score >= 0.6` or any `test_gap_files` appear, consider adding tests, splitting the patch, or flagging concerns.
### Tests to run after editing
```bash
git diff --cached --name-only | codesage tests-for
```
Returns sibling tests (resolved by language convention) plus tests that historically change with the edited files (from co-change history). Replaces "I'll run all tests" with a focused list.
### Audit a feature branch before opening a PR
```bash
git diff origin/main...HEAD --name-only | codesage risk-diff
```
Same as the pre-commit check, but scoped to everything on the branch instead of just the staged diff. Useful as the last step before `gh pr create`.
### What changed in the last week, ranked by risk
```bash
git log --since='1 week ago' --name-only --pretty='' | sort -u | codesage risk-diff --json | jq '.files[] | select(.score >= 0.5) | .file'
```
Lists high-risk files touched in recent history. Good signal during a retrospective or a "where should we focus refactoring?" discussion.
### Trifecta for one file
```bash
codesage risk path/to/file.rs
codesage tests-for path/to/file.rs
codesage coupling path/to/file.rs --limit 5
```
When you're about to dive into one specific file. Risk score, suggested tests, and what historically co-changes — together they calibrate caution before you start editing.
## Claude Code plugin
`plugins/codesage-tools/` wraps everything above into one command per task. The marketplace manifest lives at the repo root.
```bash
claude plugin marketplace add /path/to/codesage
claude plugin install codesage-tools@codesage
/codesage-onboard /path/to/project
```
Slash commands: `/codesage-onboard`, `/codesage-reset`, `/codesage-reindex`, `/codesage-bench`, `/codesage-eval`. The plugin handles global MCP registration, per-project init, indexing, git hook install (Husky-aware), and writes a `.claude/CLAUDE.md` hint teaching the agent how to route MCP calls.
## Indexing pipeline
`codesage index` walks the project, parses every supported file, extracts structural data and embeddings, and writes both into the same SQLite database.
```mermaid
flowchart LR
A[Project files] --> B[Discover
walk + excludes]
B --> C[Tree-sitter parse]
C --> D[Extract symbols
and references]
C --> E[Chunk text
recursive splitter]
D --> F[(SQLite
files, symbols, refs)]
E --> G[Embed via ONNX
MiniLM-L6-v2]
G --> H[(sqlite-vec
chunks_minilm_384)]
```
Parsing happens in parallel via Rayon; SQLite writes are batched. Re-running `codesage index` is incremental: only files whose content hash changed are re-parsed and re-embedded.
## Search pipeline
A query flows through five stages:
```mermaid
flowchart LR
Q[Query string] --> E[Embed
MiniLM-L6-v2]
E --> K[KNN retrieval
sqlite-vec
overfetch 5x]
K --> B[Symbol boost
+0.1 per token match]
B --> R[Cross-encoder rerank
ms-marco
blend 50/50]
R --> A[Symbol annotation]
A --> T[Top-N results]
```
1. Embed the query with MiniLM-L6-v2 (22M params, 384d) via ONNX Runtime.
2. Prepend file path and symbol context to chunks before embedding.
3. Boost chunks whose content matches known symbol names.
4. Re-score the top candidates with ms-marco-MiniLM-L6-v2 and blend 50/50 with the semantic score.
5. Annotate each result with overlapping function and class names.
The reranker is optional. Set or remove it in `config.toml`; stages 1-3 and the annotation still run without it.
## Configuration
`codesage init` generates `.codesage/config.toml`:
```toml
[project]
name = "my-project"
[embedding]
model = "sentence-transformers/all-MiniLM-L6-v2"
device = "gpu" # "gpu" or "cpu"
reranker = "cross-encoder/ms-marco-MiniLM-L6-v2" # optional, remove to disable
[index]
exclude_patterns = [
"**/tests/**", "**/vendor/**", "**/node_modules/**",
"**/*.test.ts", "**/*Test.php", "**/*.phpt",
]
```
Models download from HuggingFace the first time you use them.
## Architecture
A Rust workspace with six crates:
```mermaid
flowchart TD
cli[cli
binary + MCP server]
gr[graph
indexing + query pipeline]
parser[parser
tree-sitter + discovery]
storage[storage
SQLite + sqlite-vec + FTS5]
embed[embed
ONNX + reranker + chunking]
protocol[protocol
shared types]
cli --> gr
gr --> parser
gr --> storage
gr --> embed
parser --> protocol
storage --> protocol
embed --> protocol
gr --> protocol
```
| Crate | Role |
|-------|------|
| `protocol` | Shared types (Symbol, Reference, SearchResult) |
| `parser` | File discovery, tree-sitter parsing, symbol and reference extraction |
| `storage` | SQLite with sqlite-vec KNN and FTS5 |
| `embed` | ONNX embedding inference, cross-encoder reranking, chunking |
| `graph` | Indexing orchestration and search pipeline |
| `cli` | Binary with CLI subcommands and MCP server |
Storage is a single SQLite database per project at `.codesage/index.db`: structural tables (symbols, refs, files) plus model-specific vector tables for embeddings.
## Retrieval benchmarks
`bench/` holds the harness:
- `codesage-bench-runner` runs a YAML corpus of ground-truth cases through `codesage search` and reports miss rate, median first-hit, recall@5, and recall@10.
- `extract-eval-cases.py` mines eval cases from Claude Code session transcripts and git commit history.
Corpora aren't bundled. Bring your own, or point the plugin at `$CODESAGE_BENCH_CORPUS_DIR`.
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md). In short: file an issue first, add a test, update `CHANGELOG.md` under `[Unreleased]` for user-visible changes.
## License
MIT