An open API service indexing awesome lists of open source software.

https://github.com/iliaal/codesage

Code intelligence engine for AI coding agents. Structural graph queries plus semantic search, exposed via CLI and MCP.
https://github.com/iliaal/codesage

ai-agents code-intelligence code-search embeddings mcp rust semantic-search tree-sitter

Last synced: about 1 month ago
JSON representation

Code intelligence engine for AI coding agents. Structural graph queries plus semantic search, exposed via CLI and MCP.

Awesome Lists containing this project

README

          

# CodeSage

[![CI](https://github.com/iliaal/codesage/actions/workflows/ci.yml/badge.svg)](https://github.com/iliaal/codesage/actions/workflows/ci.yml)
[![Tests](https://github.com/iliaal/codesage/actions/workflows/tests.yml/badge.svg)](https://github.com/iliaal/codesage/actions/workflows/tests.yml)
[![Secret scan](https://github.com/iliaal/codesage/actions/workflows/secret-scan.yml/badge.svg)](https://github.com/iliaal/codesage/actions/workflows/secret-scan.yml)
[![Version](https://img.shields.io/github/v/release/iliaal/codesage)](https://github.com/iliaal/codesage/releases)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Follow @iliaa](https://img.shields.io/badge/Follow-@iliaa-000000?style=flat&logo=x&logoColor=white)](https://x.com/intent/follow?screen_name=iliaa)

CodeSage is a code intelligence engine for AI coding agents. It combines structural graph queries (symbols, references, dependencies) and semantic search (embedding retrieval with cross-encoder reranking) in a single Rust binary, usable as a CLI or over MCP.

## What you can do with it

- Find code by natural-language query: "where does auth happen?", "error handling in the GC".
- Look up symbol definitions by name across a codebase.
- Trace imports, calls, and inheritance for any symbol.
- Map import and include relationships between files.
- Estimate which files a change breaks (change impact analysis).
- Build curated code bundles for LLM consumption in JSON, markdown, or flat-text (gitingest-style) form.
- Read per-file git history: churn, fix ratio, historical co-change, risk score.
- Expose all of the above over MCP so Claude Code, Codex, or Cursor can call them.

## Supported languages

PHP, Python, C, Rust, JavaScript, TypeScript.

## Getting started

```bash
# Build with GPU support
cargo build --release -p codesage --features cuda

# Initialize and index a project
cd /path/to/your/project
codesage init
codesage index

# Search
codesage search "authentication handler"
codesage search --json --limit 20 "database connection pooling"

# Structural queries
codesage find-symbol MyClass
codesage find-references some_function --kind call
codesage dependencies src/main.py

# Change impact analysis (who breaks if you touch this?)
codesage impact DocumentRepository --depth 2 --source-only
codesage impact src/auth/session.ts --json

# Context bundle for LLM consumption
codesage export "authentication flow" --limit 5 --callers
codesage export MyClass --symbol --format md
codesage export "auth flow" --format ingest # gitingest-style flat-text bundle

# Git history: churn, fix ratio, co-change, risk score
codesage git-index # initial populate; hooks keep it fresh
codesage git-index --full # force full rescan (weekly hygiene)
codesage coupling src/auth/session.ts --limit 5 # files that historically change with this
codesage risk src/auth/session.ts # score with decomposition

# MCP server for Claude Code / Codex / Cursor (one global server, every onboarded project)
claude mcp add --scope user codesage -- codesage mcp

# Auto-reindex on git operations
codesage install-hooks

# Diagnose installation
codesage doctor
```

## Recipes

Common pipelines using `codesage` with `git`. Each is one shell line plus what the output tells you.

### Risk check before committing

```bash
git diff --cached --name-only | codesage risk-diff
```

Pipes the staged file list through `assess_risk_diff`. Output shows the max risk score, files in each risk bucket (hotspot, fix-heavy, test-gap, wide blast radius), and paste-ready summary notes for the commit message or PR description. If `max_score >= 0.6` or any `test_gap_files` appear, consider adding tests, splitting the patch, or flagging concerns.

### Tests to run after editing

```bash
git diff --cached --name-only | codesage tests-for
```

Returns sibling tests (resolved by language convention) plus tests that historically change with the edited files (from co-change history). Replaces "I'll run all tests" with a focused list.

### Audit a feature branch before opening a PR

```bash
git diff origin/main...HEAD --name-only | codesage risk-diff
```

Same as the pre-commit check, but scoped to everything on the branch instead of just the staged diff. Useful as the last step before `gh pr create`.

### What changed in the last week, ranked by risk

```bash
git log --since='1 week ago' --name-only --pretty='' | sort -u | codesage risk-diff --json | jq '.files[] | select(.score >= 0.5) | .file'
```

Lists high-risk files touched in recent history. Good signal during a retrospective or a "where should we focus refactoring?" discussion.

### Trifecta for one file

```bash
codesage risk path/to/file.rs
codesage tests-for path/to/file.rs
codesage coupling path/to/file.rs --limit 5
```

When you're about to dive into one specific file. Risk score, suggested tests, and what historically co-changes — together they calibrate caution before you start editing.

## Claude Code plugin

`plugins/codesage-tools/` wraps everything above into one command per task. The marketplace manifest lives at the repo root.

```bash
claude plugin marketplace add /path/to/codesage
claude plugin install codesage-tools@codesage
/codesage-onboard /path/to/project
```

Slash commands: `/codesage-onboard`, `/codesage-reset`, `/codesage-reindex`, `/codesage-bench`, `/codesage-eval`. The plugin handles global MCP registration, per-project init, indexing, git hook install (Husky-aware), and writes a `.claude/CLAUDE.md` hint teaching the agent how to route MCP calls.

## Indexing pipeline

`codesage index` walks the project, parses every supported file, extracts structural data and embeddings, and writes both into the same SQLite database.

```mermaid
flowchart LR
A[Project files] --> B[Discover
walk + excludes]
B --> C[Tree-sitter parse]
C --> D[Extract symbols
and references]
C --> E[Chunk text
recursive splitter]
D --> F[(SQLite
files, symbols, refs)]
E --> G[Embed via ONNX
MiniLM-L6-v2]
G --> H[(sqlite-vec
chunks_minilm_384)]
```

Parsing happens in parallel via Rayon; SQLite writes are batched. Re-running `codesage index` is incremental: only files whose content hash changed are re-parsed and re-embedded.

## Search pipeline

A query flows through five stages:

```mermaid
flowchart LR
Q[Query string] --> E[Embed
MiniLM-L6-v2]
E --> K[KNN retrieval
sqlite-vec
overfetch 5x]
K --> B[Symbol boost
+0.1 per token match]
B --> R[Cross-encoder rerank
ms-marco
blend 50/50]
R --> A[Symbol annotation]
A --> T[Top-N results]
```

1. Embed the query with MiniLM-L6-v2 (22M params, 384d) via ONNX Runtime.
2. Prepend file path and symbol context to chunks before embedding.
3. Boost chunks whose content matches known symbol names.
4. Re-score the top candidates with ms-marco-MiniLM-L6-v2 and blend 50/50 with the semantic score.
5. Annotate each result with overlapping function and class names.

The reranker is optional. Set or remove it in `config.toml`; stages 1-3 and the annotation still run without it.

## Configuration

`codesage init` generates `.codesage/config.toml`:

```toml
[project]
name = "my-project"

[embedding]
model = "sentence-transformers/all-MiniLM-L6-v2"
device = "gpu" # "gpu" or "cpu"
reranker = "cross-encoder/ms-marco-MiniLM-L6-v2" # optional, remove to disable

[index]
exclude_patterns = [
"**/tests/**", "**/vendor/**", "**/node_modules/**",
"**/*.test.ts", "**/*Test.php", "**/*.phpt",
]
```

Models download from HuggingFace the first time you use them.

## Architecture

A Rust workspace with six crates:

```mermaid
flowchart TD
cli[cli
binary + MCP server]
gr[graph
indexing + query pipeline]
parser[parser
tree-sitter + discovery]
storage[storage
SQLite + sqlite-vec + FTS5]
embed[embed
ONNX + reranker + chunking]
protocol[protocol
shared types]

cli --> gr
gr --> parser
gr --> storage
gr --> embed
parser --> protocol
storage --> protocol
embed --> protocol
gr --> protocol
```

| Crate | Role |
|-------|------|
| `protocol` | Shared types (Symbol, Reference, SearchResult) |
| `parser` | File discovery, tree-sitter parsing, symbol and reference extraction |
| `storage` | SQLite with sqlite-vec KNN and FTS5 |
| `embed` | ONNX embedding inference, cross-encoder reranking, chunking |
| `graph` | Indexing orchestration and search pipeline |
| `cli` | Binary with CLI subcommands and MCP server |

Storage is a single SQLite database per project at `.codesage/index.db`: structural tables (symbols, refs, files) plus model-specific vector tables for embeddings.

## Retrieval benchmarks

`bench/` holds the harness:

- `codesage-bench-runner` runs a YAML corpus of ground-truth cases through `codesage search` and reports miss rate, median first-hit, recall@5, and recall@10.
- `extract-eval-cases.py` mines eval cases from Claude Code session transcripts and git commit history.

Corpora aren't bundled. Bring your own, or point the plugin at `$CODESAGE_BENCH_CORPUS_DIR`.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). In short: file an issue first, add a test, update `CHANGELOG.md` under `[Unreleased]` for user-visible changes.

## License

MIT