{"id":47791141,"url":"https://github.com/querymt/qndx","last_synced_at":"2026-04-03T15:39:17.509Z","repository":{"id":348600922,"uuid":"1193624909","full_name":"querymt/qndx","owner":"querymt","description":"index your code and query it fast","archived":false,"fork":false,"pushed_at":"2026-04-01T23:14:45.000Z","size":213,"stargazers_count":0,"open_issues_count":12,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-02T09:28:35.724Z","etag":null,"topics":["code-search","index","indexer","indexing-engine"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/querymt.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-27T12:24:24.000Z","updated_at":"2026-04-01T23:14:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/querymt/qndx","commit_stats":null,"previous_names":["querymt/qndx"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/querymt/qndx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/querymt%2Fqndx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/querymt%2Fqndx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/querymt%2Fqndx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/querymt%2Fqndx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/querymt","download_url":"https://codeload.github.com/querymt/qndx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/querymt%2Fqndx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31360807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T15:19:21.178Z","status":"ssl_error","status_checked_at":"2026-04-03T15:19:20.670Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-search","index","indexer","indexing-engine"],"created_at":"2026-04-03T15:39:16.039Z","updated_at":"2026-04-03T15:39:17.503Z","avatar_url":"https://github.com/querymt.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# qndx\n\nFast regex search indexer for large repositories.\n\nqndx builds a local n-gram index over source files and uses it to narrow the search space before running the actual regex. For selective queries on large codebases, this is significantly faster than scanning every file -- while guaranteeing no false negatives.\n\n## How it works\n\n1. **Index**: Extract overlapping trigrams (and sparse n-grams) from every file. Store them in a sorted lookup table (`ngrams.tbl`) with postings lists (`postings.dat`) that map each n-gram to the files containing it.\n\n2. **Search**: Decompose the regex into required literal fragments, look up their n-gram hashes in the index, intersect the posting lists to get a small candidate set, then run the full regex only against those candidates.\n\n3. **Freshness**: Track Git working tree state. Modified, added, and untracked files are re-indexed into a lightweight overlay that merges with the baseline index at query time, giving read-your-writes semantics without a full rebuild.\n\nEvery match returned by the index path is verified against the actual file content. The index only eliminates files that provably cannot match -- it never introduces false negatives.\n\n## Quick start\n\n```bash\n# Build\ncargo build --release\n\n# Index a repository\nqndx index -r /path/to/repo\n\n# Search\nqndx search -r /path/to/repo \"fn main\"\nqndx search -r /path/to/repo \"TODO|FIXME|HACK\" --stats\nqndx search -r /path/to/repo \"impl.*Iterator\" --strategy trigram --stats\n\n# Inspect the query plan without searching\nqndx plan \"DatabaseConnection\"\n```\n\nThe index is stored in `\u003croot\u003e/.qndx/index/v1/` and reused automatically on subsequent searches. If no index exists, `search` falls back to a full scan.\n\n## Performance\n\nMeasured on a 722-file / 8.6 MB Rust codebase (querymt):\n\n| Query | Strategy | Candidates | Scan | Indexed | Speedup |\n|-------|----------|-----------|------|---------|---------|\n| `enum AgentMode` | trigram | 8 / 722 | 27 ms | 0.008 ms | 3375x |\n| `TODO` | trigram | 45 / 722 | 27 ms | 1.4 ms | 19x |\n| `self\\.\\w+` | trigram | 214 / 722 | 28 ms | 6.9 ms | 4x |\n| `pub fn` | trigram | 240 / 722 | 27 ms | 6.7 ms | 4x |\n| `impl.*for` | trigram | 427 / 722 | 28 ms | 9.9 ms | 3x |\n\nThe index reader uses memory-mapped I/O (`memmap2`), so query-time resident memory is proportional to pages touched during the search, not the index size. A 2 GB index (Linux kernel) requires only ~100 KB of resident memory for a typical query.\n\n## CLI reference\n\n### `qndx index`\n\nBuild the search index for a directory.\n\n```\nqndx index [OPTIONS]\n\nOptions:\n  -r, --root \u003cROOT\u003e                    Root directory to index [default: .]\n  -i, --index-dir \u003cINDEX_DIR\u003e          Index output directory\n      --max-file-size \u003cMAX_FILE_SIZE\u003e  Maximum file size in bytes [default: 1048576]\n      --hidden                         Include hidden files\n      --binary                         Include binary files\n```\n\n### `qndx search`\n\nSearch using regex, with optional index acceleration.\n\n```\nqndx search [OPTIONS] \u003cPATTERN\u003e\n\nOptions:\n  -r, --root \u003cROOT\u003e            Root directory to search [default: .]\n  -i, --index-dir \u003cINDEX_DIR\u003e  Index directory\n  -l, --files-only             Show only file names\n      --stats                  Show timing and candidate statistics\n      --scan                   Force scan-only mode (ignore index)\n      --strategy \u003cSTRATEGY\u003e    N-gram strategy: auto, trigram, sparse [default: auto]\n```\n\nOutput format: `path:line:column: matched_text`\n\nWhen `--stats` is enabled for indexed search, qndx collects and prints a summary plus stage timings:\n\n```\n3 matches in 8 files (174185 bytes, 8 candidates / 722 total, 12 lookups, strategy: trigram) in 0.008s [indexed]\n  timing: open=3.412ms, plan=0.071ms, candidates=0.204ms, verify=4.033ms\n```\n\n### `qndx plan`\n\nShow the query plan for a pattern without running a search.\n\n```\nqndx plan [OPTIONS] \u003cPATTERN\u003e\n\nOptions:\n      --strategy \u003cSTRATEGY\u003e  Force a specific strategy [default: auto]\n```\n\nExample output:\n\n```\nPattern: enum AgentMode\n\nLiterals: [\"enum AgentMode\"]\n\nTrigram plan:\n  lookups: 12\n  cost:    12.00\n\nSparse plan: unavailable (13 sparse grams \u003e= 12 trigrams, no reduction)\n\nSelected:  trigram\nLookups:   12\nCost:      12.00\n```\n\n### `qndx bench`\n\nBenchmark reporting and budget checking (see [Benchmarking](#benchmarking)).\n\n```\nqndx bench report [--format human|json] [--criterion-dir target/criterion]\nqndx bench check-budgets [--budgets benchmarks/budgets.toml] [--fail-on-critical]\n```\n\n## Architecture\n\n```\ncrates/\n  qndx-core/     Shared types, file format, hashing, file walk, scan-only search\n  qndx-index/    Index builder, memory-mapped reader, postings (Vec/Roaring/hybrid)\n  qndx-query/    Regex decomposition, query planner, candidate resolution, verification\n  qndx-git/      Git integration via gix (dirty detection, HEAD commit)\n  qndx-cli/      CLI entrypoints\n  qndx-bench/    Benchmark fixtures, report generation, budget checking\n```\n\n### Data flow\n\n```\n                     build                              search\n                     -----                              ------\n\n  source files ──\u003e walk + extract trigrams ──\u003e ngrams.tbl      pattern\n                   extract sparse n-grams ──\u003e postings.dat        |\n                   collect metadata       ──\u003e manifest.bin        v\n                                                           decompose regex\n                                                                  |\n                                                           plan (trigram vs sparse)\n                                                                  |\n                                                           lookup n-gram hashes\n                                                           intersect posting lists\n                                                                  |\n                                                           candidate files\n                                                                  |\n                                                           read + verify (full regex)\n                                                                  |\n                                                           verified matches\n```\n\n### Index files\n\nThe index is stored in three files under `.qndx/index/v1/`:\n\n| File | Magic | Contents |\n|------|-------|----------|\n| `ngrams.tbl` | `QXNG` | Sorted n-gram hash table (20 bytes per entry: hash, offset, length, flags) |\n| `postings.dat` | `QXPO` | Concatenated posting blocks (tagged: varint-delta for small lists, Roaring for large) |\n| `manifest.bin` | `QXMF` | Metadata and file paths (postcard-serialized) |\n\nEach file has a 20-byte header: 4-byte magic, u32 version, u64 payload length, u32 CRC32 checksum.\n\nSee [docs/file-format.md](docs/file-format.md) for the full specification.\n\n## Benchmarking\n\n### Synthetic benchmarks\n\n```bash\n# Run all benchmarks\ncargo bench\n\n# Run a specific benchmark group\ncargo bench -- end_to_end_search\ncargo bench -- postings_choice\n```\n\nBenchmark targets: `serializer_choice`, `postings_choice`, `ngram_extract`, `query_planner`, `end_to_end_search`, `git_overlay`.\n\n### Real corpus benchmarks\n\nBenchmark against an actual codebase:\n\n```bash\n# Basic\nQNDX_BENCH_CORPUS=~/src/linux cargo bench --bench real_corpus\n\n# With corpus-specific patterns\nQNDX_BENCH_CORPUS=~/src/linux \\\nQNDX_BENCH_PATTERNS=benchmarks/patterns/linux.txt \\\ncargo bench --bench real_corpus\n\n# Quick validation (no Criterion iterations)\nQNDX_BENCH_CORPUS=~/myproject cargo bench --bench real_corpus -- --test\n\n# Limit files for large repos\nQNDX_BENCH_MAX_FILES=5000 \\\nQNDX_BENCH_CORPUS=~/src/linux cargo bench --bench real_corpus\n```\n\nEnvironment variables:\n\n| Variable | Required | Description |\n|----------|----------|-------------|\n| `QNDX_BENCH_CORPUS` | Yes | Path to the codebase |\n| `QNDX_BENCH_PATTERNS` | No | Path to patterns file (tab-separated `name\\tpattern` or one pattern per line) |\n| `QNDX_BENCH_NAME` | No | Override corpus name in reports |\n| `QNDX_BENCH_MAX_FILES` | No | Limit number of files |\n| `QNDX_BENCH_MAX_FILE_SIZE` | No | Override max file size (default: 1 MB) |\n\nHTML reports are generated by Criterion at `target/criterion/real_{name}/report/index.html`.\n\n### Regression tracking\n\nPerformance budgets are defined in [`benchmarks/budgets.toml`](benchmarks/budgets.toml). Critical budgets (end-to-end search, postings intersection) fail CI on violation. See [docs/performance-budgets.md](docs/performance-budgets.md) for details.\n\n```bash\n# Save a baseline\ncargo bench -- --save-baseline main\n\n# Compare against baseline\ncargo bench -- --baseline main\n\n# Check budgets\ncargo run -p qndx-cli -- bench check-budgets\n```\n\n## Development\n\n### Build\n\n```bash\ncargo build\ncargo build --release\n```\n\n### Test\n\n```bash\n# All tests (202 tests across all crates)\ncargo test --all-features\n\n# Specific crate\ncargo test -p qndx-index\ncargo test -p qndx-query\n\n# Differential tests (index results == scan results)\ncargo test differential\n\n# Regex edge cases\ncargo test regex_edge_cases\n```\n\n### Lint\n\n```bash\ncargo clippy --all-features --all-targets -- -D warnings\ncargo fmt --all -- --check\n```\n\n## Documentation\n\n| Document | Description |\n|----------|-------------|\n| [docs/architecture.md](docs/architecture.md) | Crate structure, data flow, design decisions |\n| [docs/file-format.md](docs/file-format.md) | On-disk index format specification |\n| [docs/decision-gates.md](docs/decision-gates.md) | Benchmark-backed architecture decisions (serializer, postings, n-gram strategy) |\n| [docs/performance-budgets.md](docs/performance-budgets.md) | Per-benchmark-group regression thresholds |\n| [docs/regression-triage.md](docs/regression-triage.md) | Six-step process for investigating performance regressions |\n| [docs/release-gate.md](docs/release-gate.md) | Release criteria and MVP definition of done |\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquerymt%2Fqndx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquerymt%2Fqndx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquerymt%2Fqndx/lists"}