https://github.com/querymt/qndx
index your code and query it fast
https://github.com/querymt/qndx
code-search index indexer indexing-engine
Last synced: about 2 months ago
JSON representation
index your code and query it fast
- Host: GitHub
- URL: https://github.com/querymt/qndx
- Owner: querymt
- Created: 2026-03-27T12:24:24.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-01T23:14:45.000Z (about 2 months ago)
- Last Synced: 2026-04-02T09:28:35.724Z (about 2 months ago)
- Topics: code-search, index, indexer, indexing-engine
- Language: Rust
- Homepage:
- Size: 208 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# qndx
Fast regex search indexer for large repositories.
qndx builds a local n-gram index over source files and uses it to narrow the search space before running the actual regex. For selective queries on large codebases, this is significantly faster than scanning every file -- while guaranteeing no false negatives.
## How it works
1. **Index**: Extract overlapping trigrams (and sparse n-grams) from every file. Store them in a sorted lookup table (`ngrams.tbl`) with postings lists (`postings.dat`) that map each n-gram to the files containing it.
2. **Search**: Decompose the regex into required literal fragments, look up their n-gram hashes in the index, intersect the posting lists to get a small candidate set, then run the full regex only against those candidates.
3. **Freshness**: Track Git working tree state. Modified, added, and untracked files are re-indexed into a lightweight overlay that merges with the baseline index at query time, giving read-your-writes semantics without a full rebuild.
Every match returned by the index path is verified against the actual file content. The index only eliminates files that provably cannot match -- it never introduces false negatives.
## Quick start
```bash
# Build
cargo build --release
# Index a repository
qndx index -r /path/to/repo
# Search
qndx search -r /path/to/repo "fn main"
qndx search -r /path/to/repo "TODO|FIXME|HACK" --stats
qndx search -r /path/to/repo "impl.*Iterator" --strategy trigram --stats
# Inspect the query plan without searching
qndx plan "DatabaseConnection"
```
The index is stored in `/.qndx/index/v1/` and reused automatically on subsequent searches. If no index exists, `search` falls back to a full scan.
## Performance
Measured on a 722-file / 8.6 MB Rust codebase (querymt):
| Query | Strategy | Candidates | Scan | Indexed | Speedup |
|-------|----------|-----------|------|---------|---------|
| `enum AgentMode` | trigram | 8 / 722 | 27 ms | 0.008 ms | 3375x |
| `TODO` | trigram | 45 / 722 | 27 ms | 1.4 ms | 19x |
| `self\.\w+` | trigram | 214 / 722 | 28 ms | 6.9 ms | 4x |
| `pub fn` | trigram | 240 / 722 | 27 ms | 6.7 ms | 4x |
| `impl.*for` | trigram | 427 / 722 | 28 ms | 9.9 ms | 3x |
The index reader uses memory-mapped I/O (`memmap2`), so query-time resident memory is proportional to pages touched during the search, not the index size. A 2 GB index (Linux kernel) requires only ~100 KB of resident memory for a typical query.
## CLI reference
### `qndx index`
Build the search index for a directory.
```
qndx index [OPTIONS]
Options:
-r, --root Root directory to index [default: .]
-i, --index-dir Index output directory
--max-file-size Maximum file size in bytes [default: 1048576]
--hidden Include hidden files
--binary Include binary files
```
### `qndx search`
Search using regex, with optional index acceleration.
```
qndx search [OPTIONS]
Options:
-r, --root Root directory to search [default: .]
-i, --index-dir Index directory
-l, --files-only Show only file names
--stats Show timing and candidate statistics
--scan Force scan-only mode (ignore index)
--strategy N-gram strategy: auto, trigram, sparse [default: auto]
```
Output format: `path:line:column: matched_text`
When `--stats` is enabled for indexed search, qndx collects and prints a summary plus stage timings:
```
3 matches in 8 files (174185 bytes, 8 candidates / 722 total, 12 lookups, strategy: trigram) in 0.008s [indexed]
timing: open=3.412ms, plan=0.071ms, candidates=0.204ms, verify=4.033ms
```
### `qndx plan`
Show the query plan for a pattern without running a search.
```
qndx plan [OPTIONS]
Options:
--strategy Force a specific strategy [default: auto]
```
Example output:
```
Pattern: enum AgentMode
Literals: ["enum AgentMode"]
Trigram plan:
lookups: 12
cost: 12.00
Sparse plan: unavailable (13 sparse grams >= 12 trigrams, no reduction)
Selected: trigram
Lookups: 12
Cost: 12.00
```
### `qndx bench`
Benchmark reporting and budget checking (see [Benchmarking](#benchmarking)).
```
qndx bench report [--format human|json] [--criterion-dir target/criterion]
qndx bench check-budgets [--budgets benchmarks/budgets.toml] [--fail-on-critical]
```
## Architecture
```
crates/
qndx-core/ Shared types, file format, hashing, file walk, scan-only search
qndx-index/ Index builder, memory-mapped reader, postings (Vec/Roaring/hybrid)
qndx-query/ Regex decomposition, query planner, candidate resolution, verification
qndx-git/ Git integration via gix (dirty detection, HEAD commit)
qndx-cli/ CLI entrypoints
qndx-bench/ Benchmark fixtures, report generation, budget checking
```
### Data flow
```
build search
----- ------
source files ──> walk + extract trigrams ──> ngrams.tbl pattern
extract sparse n-grams ──> postings.dat |
collect metadata ──> manifest.bin v
decompose regex
|
plan (trigram vs sparse)
|
lookup n-gram hashes
intersect posting lists
|
candidate files
|
read + verify (full regex)
|
verified matches
```
### Index files
The index is stored in three files under `.qndx/index/v1/`:
| File | Magic | Contents |
|------|-------|----------|
| `ngrams.tbl` | `QXNG` | Sorted n-gram hash table (20 bytes per entry: hash, offset, length, flags) |
| `postings.dat` | `QXPO` | Concatenated posting blocks (tagged: varint-delta for small lists, Roaring for large) |
| `manifest.bin` | `QXMF` | Metadata and file paths (postcard-serialized) |
Each file has a 20-byte header: 4-byte magic, u32 version, u64 payload length, u32 CRC32 checksum.
See [docs/file-format.md](docs/file-format.md) for the full specification.
## Benchmarking
### Synthetic benchmarks
```bash
# Run all benchmarks
cargo bench
# Run a specific benchmark group
cargo bench -- end_to_end_search
cargo bench -- postings_choice
```
Benchmark targets: `serializer_choice`, `postings_choice`, `ngram_extract`, `query_planner`, `end_to_end_search`, `git_overlay`.
### Real corpus benchmarks
Benchmark against an actual codebase:
```bash
# Basic
QNDX_BENCH_CORPUS=~/src/linux cargo bench --bench real_corpus
# With corpus-specific patterns
QNDX_BENCH_CORPUS=~/src/linux \
QNDX_BENCH_PATTERNS=benchmarks/patterns/linux.txt \
cargo bench --bench real_corpus
# Quick validation (no Criterion iterations)
QNDX_BENCH_CORPUS=~/myproject cargo bench --bench real_corpus -- --test
# Limit files for large repos
QNDX_BENCH_MAX_FILES=5000 \
QNDX_BENCH_CORPUS=~/src/linux cargo bench --bench real_corpus
```
Environment variables:
| Variable | Required | Description |
|----------|----------|-------------|
| `QNDX_BENCH_CORPUS` | Yes | Path to the codebase |
| `QNDX_BENCH_PATTERNS` | No | Path to patterns file (tab-separated `name\tpattern` or one pattern per line) |
| `QNDX_BENCH_NAME` | No | Override corpus name in reports |
| `QNDX_BENCH_MAX_FILES` | No | Limit number of files |
| `QNDX_BENCH_MAX_FILE_SIZE` | No | Override max file size (default: 1 MB) |
HTML reports are generated by Criterion at `target/criterion/real_{name}/report/index.html`.
### Regression tracking
Performance budgets are defined in [`benchmarks/budgets.toml`](benchmarks/budgets.toml). Critical budgets (end-to-end search, postings intersection) fail CI on violation. See [docs/performance-budgets.md](docs/performance-budgets.md) for details.
```bash
# Save a baseline
cargo bench -- --save-baseline main
# Compare against baseline
cargo bench -- --baseline main
# Check budgets
cargo run -p qndx-cli -- bench check-budgets
```
## Development
### Build
```bash
cargo build
cargo build --release
```
### Test
```bash
# All tests (202 tests across all crates)
cargo test --all-features
# Specific crate
cargo test -p qndx-index
cargo test -p qndx-query
# Differential tests (index results == scan results)
cargo test differential
# Regex edge cases
cargo test regex_edge_cases
```
### Lint
```bash
cargo clippy --all-features --all-targets -- -D warnings
cargo fmt --all -- --check
```
## Documentation
| Document | Description |
|----------|-------------|
| [docs/architecture.md](docs/architecture.md) | Crate structure, data flow, design decisions |
| [docs/file-format.md](docs/file-format.md) | On-disk index format specification |
| [docs/decision-gates.md](docs/decision-gates.md) | Benchmark-backed architecture decisions (serializer, postings, n-gram strategy) |
| [docs/performance-budgets.md](docs/performance-budgets.md) | Per-benchmark-group regression thresholds |
| [docs/regression-triage.md](docs/regression-triage.md) | Six-step process for investigating performance regressions |
| [docs/release-gate.md](docs/release-gate.md) | Release criteria and MVP definition of done |
## License
MIT