{"id":32622803,"url":"https://github.com/ria-19/reporag","last_synced_at":"2026-04-11T07:37:58.182Z","repository":{"id":321567888,"uuid":"1082693320","full_name":"ria-19/reporag","owner":"ria-19","description":"The Open-Source Repository Intelligence System. A resilient RAG platform for code and documentation. Converse naturally with any codebase, wiki, or issue tracker to accelerate understanding and onboarding 10x","archived":false,"fork":false,"pushed_at":"2025-10-30T09:13:45.000Z","size":78,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-30T11:26:01.938Z","etag":null,"topics":["ai-assistant","codeanalysis","faiss-vector-database","github","langchain","llm-application","python","rag"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ria-19.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-24T16:10:05.000Z","updated_at":"2025-10-30T09:13:48.000Z","dependencies_parsed_at":"2025-10-30T11:27:03.719Z","dependency_job_id":"1e955ace-a104-4e24-b568-4deb7da93ff5","html_url":"https://github.com/ria-19/reporag","commit_stats":null,"previous_names":["ria-19/reporag"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ria-19/reporag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ria-19%2Freporag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ria-19%2Freporag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ria-19%2Freporag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ria-19%2Freporag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ria-19","download_url":"https://codeload.github.com/ria-19/reporag/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ria-19%2Freporag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281873520,"owners_count":26576262,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-assistant","codeanalysis","faiss-vector-database","github","langchain","llm-application","python","rag"],"created_at":"2025-10-30T19:51:24.700Z","updated_at":"2026-04-11T07:37:58.175Z","avatar_url":"https://github.com/ria-19.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RepoRAG\n\nLocal-first codebase intelligence. Ask questions about any GitHub repository in natural language and get answers grounded in the actual source code.\n\n**Not a code editor. Not a copilot.** A system for understanding massive codebases — onboarding, open-source contribution, structural questions.\n\n```\nPOST /query\n{\n  \"question\": \"how does dependency injection work?\",\n  \"repo_name\": \"fastapi\"\n}\n\n→ {\n  \"answer\": \"FastAPI's DI system works by...\",\n  \"sources\": [{\"file_path\": \"fastapi/dependencies/utils.py\", ...}],\n  \"metrics\": {\"total_latency_ms\": 840, \"chunks_retrieved\": 12, ...}\n}\n```\n\nEvery response includes per-step latency, token counts, and retrieval scores. No magic.\n\n---\n\n## What It Does\n\n1. **Indexes a GitHub repo** — clones (shallow), parses with tree-sitter, embeds with nomic-embed-text-v1.5, writes to LanceDB (vector) + KuzuDB (graph)\n2. **Answers questions** — hybrid retrieval (vector + BM25) → graph expansion → LLM generation → faithfulness validation\n3. **Measures itself** — Precision@5, Context Recall, Answer Relevance, Faithfulness score on a golden dataset\n\n---\n\n## Stack\n\n| Layer | Choice | Why |\n|---|---|---|\n| Parser | tree-sitter \u003e= 0.22 | 40+ languages, single API, binary wheels |\n| Embedding | nomic-embed-text-v1.5 (768d MRL) | Matryoshka dims — two vectors, one forward pass |\n| Vector + BM25 | LanceDB | Embedded Rust, hybrid search, IVF-PQ, no server |\n| Graph | KuzuDB | Embedded, columnar, SIMD, Cypher |\n| Merge | RRF (Reciprocal Rank Fusion) | Rank-based, no score normalization needed |\n| LLM | Ollama (local) or Gemini (cloud) | Config switch, no code change |\n| API | FastAPI (sync handlers) | Sync for CPU-heavy paths, ThreadPool managed |\n\n---\n\n## Setup\n\n### Prerequisites\n- Python 3.11+\n- [uv](https://github.com/astral-sh/uv) package manager\n- [Ollama](https://ollama.com) installed and running (for local LLM)\n- `git` on PATH\n\n### Install\n\n```bash\ngit clone \u003cthis-repo\u003e\ncd reporag\n\n# Create virtualenv and install all dependencies\nuv venv\nsource .venv/bin/activate\nuv sync\n```\n\n### Configure\n\n```bash\ncp .env.example .env\n# Edit .env — defaults work for local dev with Ollama\n```\n\nKey settings:\n```bash\nLLM_PROVIDER=ollama       # or: gemini\nLLM_MODEL=llama3.2        # any model pulled in Ollama\nEMBEDDING_DEVICE=cpu      # or: cuda, mps\n```\n\nFor Gemini:\n```bash\nLLM_PROVIDER=gemini\nGEMINI_API_KEY=your_key_here\nGEMINI_MODEL=gemini-1.5-flash\n```\n\n### Pull Ollama model (if using local LLM)\n\n```bash\nollama pull llama3.2\n```\n\nThe API will auto-pull on first startup if the model is missing — but pulling manually avoids a long wait on first request.\n\n---\n\n## Running\n\n```bash\n# Start the API\nuv run uvicorn api:app --reload --port 8000\n\n# Verify it's alive\ncurl http://localhost:8000/health\n```\n\n---\n\n## Usage\n\n### Index a repository\n\n```bash\ncurl -X POST http://localhost:8000/index \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"github_url\": \"https://github.com/tiangolo/fastapi\",\n    \"repo_name\": \"fastapi\",\n    \"force\": false\n  }'\n```\n\nResponse:\n```json\n{\n  \"repo\": \"fastapi\",\n  \"files_processed\": 87,\n  \"files_skipped\": 0,\n  \"chunks_indexed\": 1243,\n  \"elapsed_seconds\": 142.3\n}\n```\n\n`force: true` clears the WAL and re-indexes from scratch.\n\n### Check what's indexed\n\n```bash\ncurl http://localhost:8000/health\n```\n\n```json\n{\n  \"status\": \"ok\",\n  \"lance\": {\"chunks_indexed\": 1243, \"table_name\": \"chunks\"},\n  \"kuzu\":  {\"nodes\": 1243, \"edges_calls\": 847, \"edges_imports\": 0}\n}\n```\n\n### Query\n\n```bash\ncurl -X POST http://localhost:8000/query \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"question\": \"how does dependency injection work?\",\n    \"repo_name\": \"fastapi\"\n  }'\n```\n\nWith filters:\n```bash\n# Exact symbol lookup\ncurl -X POST http://localhost:8000/query \\\n  -d '{\n    \"question\": \"what does solve_shared do?\",\n    \"repo_name\": \"fastapi\",\n    \"symbol\": \"solve_shared\"\n  }'\n\n# File filter\ncurl -X POST http://localhost:8000/query \\\n  -d '{\n    \"question\": \"how is routing set up?\",\n    \"repo_name\": \"fastapi\",\n    \"filename\": \"routing\"\n  }'\n\n# Language filter\ncurl -X POST http://localhost:8000/query \\\n  -d '{\n    \"question\": \"how are requests validated?\",\n    \"repo_name\": \"fastapi\",\n    \"language\": \"python\"\n  }'\n```\n\nResponse:\n```json\n{\n  \"question\": \"how does dependency injection work?\",\n  \"answer\": \"FastAPI's DI system inspects function signatures...\",\n  \"route\": \"conceptual\",\n  \"sources\": [\n    {\n      \"file_path\": \"fastapi/dependencies/utils.py\",\n      \"symbol_name\": \"solve_dependencies\",\n      \"chunk_type\": \"function\",\n      \"start_line\": 142,\n      \"end_line\": 198,\n      \"snippet\": \"async def solve_dependencies(...\",\n      \"rrf_score\": 0.043,\n      \"source\": \"hybrid\"\n    }\n  ],\n  \"metrics\": {\n    \"total_latency_ms\": 840,\n    \"chunks_retrieved\": 8,\n    \"tokens_used\": 1240,\n    \"embed_query\": {\"latency_ms\": 45, \"output_count\": 2},\n    \"hybrid_search\": {\"latency_ms\": 120, \"output_count\": 5},\n    \"graph_expand\": {\"latency_ms\": 18, \"output_count\": 12},\n    \"llm_generate\": {\"latency_ms\": 620, \"tokens_used\": 1240}\n  },\n  \"faithfulness\": 0.82,\n  \"answer_relevance\": null\n}\n```\n\n`answer_relevance` is `null` when LLM judge was not sampled this request (10% sample rate in production).\n\n### Run evaluation\n\n```bash\n# Generate golden dataset first\nuv run python scripts/generate_golden_dataset.py \\\n  --repo fastapi \\\n  --n 20 \\\n  --output data/qa_pairs.jsonl\n\n# Run eval\nuv run python eval.py \\\n  --dataset data/qa_pairs.jsonl \\\n  --output data/eval_results.json\n```\n\nOr via HTTP:\n```bash\ncurl http://localhost:8000/eval\n```\n\n---\n\n## Architecture\n\n### Write Path (Indexing)\n\n```\nGitHub URL\n  → git clone --depth=1 (temp dir, auto-cleaned)\n  → LocalRepoLoader.stream_files() → RawFile (one at a time, constant RAM)\n  → WAL check: skip if already indexed (resume on crash)\n  → CodeParser.parse(raw_file) → CodeChunk[], RawEdge[]\n      Python: tree-sitter recursive walk, extracts functions/classes/imports\n      JS/TS:  same, plus JSDoc extraction\n      Other:  FallbackStrategy (line-based chunking, no symbol extraction)\n  → Batch accumulate (batch_size=32 by default)\n  → NomicEmbedder.embed_batch() → (vector_768[], vector_128[])\n      One forward pass, two vectors via MRL truncation\n  → LanceStore.add_chunks() — merge_insert on chunk_id (idempotent)\n  → KuzuStore.add_nodes() — MERGE on chunk_id (idempotent)\n  → WAL.record(\"file_indexed\", repo=repo_name, path=...)\n  \n  [Pass 2 — after all files]\n  → resolve_edges(repo_name, all_raw_edges, name_map)\n      Matches call targets to known chunk_ids\n      Unresolved = external library = safely dropped\n  → KuzuStore.add_edges() — MERGE (idempotent, silent skip if node missing)\n  → WAL.record(\"edges_written\", repo=repo_name)\n```\n\n### Read Path (Query)\n\n```\nQueryRequest (question, repo_name, symbol?, filename?, language?, k=5)\n  → Step 1: Parse filters (already validated by Pydantic)\n  → Step 2: Exact symbol lookup (if --symbol provided)\n      If \u003e 10 results: too ambiguous, fall through to hybrid search\n  → Step 3: Route query → prompt template\n      BoW keyword matching: debugging \u003e setup \u003e code_search \u003e conceptual\n      Default on zero matches: conceptual\n  → Step 4: Embed query (768d + 128d, QUERY_PREFIX)\n  → Step 5: Hybrid search (if symbol lookup didn't find enough)\n      Stage 1: vector_128 IVF-PQ ANN → top 1000 candidates (fast)\n      Stage 2: vector_768 flat numpy cosine on 1000 candidates (precise)\n      BM25: FTS on search_content column\n      Merge: RRF(stage2_ranks, bm25_ranks) → top k\n  → Step 6: Graph expand (single-hop, max 20 neighbors)\n      MATCH (a)-[:CALLS|IMPORTS]-\u003e(b) WHERE a.chunk_id IN [seeds]\n  → Step 7: Fetch neighbor text from LanceDB by chunk_ids\n  → Step 8: Merge + deduplicate (seeds + neighbors)\n  → Step 9: Assemble context\n      Interleave: retrieved chunks at attention peaks (start/end)\n      Graph neighbors in middle\n      Cap at 32,000 chars\n  → Step 10: Generate\n      prompt_template.format(context=..., question=...)\n      OllamaLLM or GeminiLLM\n  → Step 11: Validate\n      CosineValidator: always (cosine(embed(answer), embed(context)))\n      LLMJudge: 10% sample (faithfulness + answer relevance)\n  → QueryResult (answer, sources, metrics, faithfulness, answer_relevance)\n```\n\n### Multi-Tenant Isolation\n\nAll data is namespaced by `repo_name`:\n\n| Layer | Isolation mechanism |\n|---|---|\n| chunk_id | `{repo}::{safe_path}::{class}.{func}::{hash[:8]}` |\n| WAL keys | `op::repo::path=value` |\n| LanceDB | `repo_name` column + `WHERE repo_name = $repo` on all searches |\n| KuzuDB | `repo_name` property on every Chunk node + enforced in all MATCH patterns |\n\n### MRL Cascade (Two-Stage Retrieval)\n\n```\nIndex time:\n  Store vector_768 (768d, precise) + vector_128 (128d, fast) per chunk\n  One forward pass — truncate + re-normalize for 128d\n\nQuery time:\n  Stage 1: IVF-PQ ANN on vector_128 → top 1000 candidates (fast)\n  Stage 2: flat numpy cosine on vector_768 for candidates only (exact, ~microseconds)\n  \nCost: quality close to full 768d, speed close to 128d\n```\n\n---\n\n## Repo Structure\n\n```\nreporag/\n├── src/\n│   ├── core/\n│   │   ├── models.py          Domain models: CodeChunk, GraphEdge, QueryResult...\n│   │   └── ports.py           Write-path Protocols\n│   ├── ingestion/\n│   │   └── loaders.py         LocalRepoLoader, GitHubRepoLoader\n│   ├── chunking/\n│   │   ├── parser.py          CodeParser, PythonStrategy, JavaScriptStrategy, FallbackStrategy\n│   │   └── graph.py           resolve_edges()\n│   ├── embedding/\n│   │   └── embedder.py        NomicEmbedder, MRL cascade\n│   ├── storage/\n│   │   ├── lance_store.py     LanceStore — hybrid search, upsert, FTS\n│   │   └── kuzu_store.py      KuzuStore — graph nodes, edges, expansion\n│   ├── query/\n│   │   ├── pipeline.py        QueryPipeline — 10-step read path\n│   │   ├── router.py          QueryRouter (BoW), ContextAssembler\n│   │   └── validator.py       CosineValidator, LLMJudge\n│   ├── observability/\n│   │   └── metrics.py         Timer (context manager), StepMetrics\n│   ├── llm.py                 OllamaLLM, GeminiLLM, build_llm()\n│   ├── indexer.py             Indexer, WAL\n│   ├── config.py              Settings (pydantic-settings, all env vars)\n│   └── logger.py              setup_logging(), get_logger()\n├── scripts/\n│   └── generate_golden_dataset.py  Golden QA pair generation\n├── data/\n│   ├── qa_pairs.jsonl         Generated golden dataset\n│   └── eval_results.json      Eval output\n├── api.py                     FastAPI app, lifespan, 4 endpoints\n├── eval.py                    Evaluation harness + CLI entry point\n├── DECISIONS.md               All major architectural decisions\n├── FAILURES.md                All bugs found and case studies\n├── RESULTS.md                 Eval results (fill after running eval)\n├── pyproject.toml\n└── .env.example\n```\n\n---\n\n## Eval Metrics\n\n| Metric | Formula | What it catches |\n|---|---|---|\n| Precision@5 | `\\|retrieved[:5] ∩ expected\\| / 5` | Retrieval noise |\n| Context Recall | `\\|retrieved ∩ expected\\| / \\|expected\\|` | Missing relevant chunks |\n| Answer Relevance | `cosine(embed(answer), embed(ground_truth))` | Off-topic answers |\n| Faithfulness | LLM judge: is answer grounded in context? | Hallucination |\n\nDiagnosing failures:\n- **Low Precision + Low Recall** → retrieval problem (embedding quality, BM25 config, chunking)\n- **High Precision + Low Faithfulness** → generation problem (prompt, model, context assembly)\n- **Low Answer Relevance** → routing problem (wrong prompt template selected)\n\n---\n\n## Limitations\n\n- **IMPORTS graph not implemented** — only CALLS edges. File-level dependency traversal is V2.\n- **Languages:** Python, JS, TypeScript only. Other languages chunked as plaintext (no symbol extraction).\n- **Query router ~90% accuracy** — BoW fails on ambiguous queries. LLM-based routing would add +500ms.\n- **Reranker not implemented** — graph neighbors ranked only by retrieval order, not query relevance.\n- **WAL is single-process** — `threading.Lock()` protects threads; not safe for `uvicorn --workers N`.\n\nSee `FAILURES.md` for full details on each limitation and the V2 upgrade path.\n\n---\n\n## Development\n\n```bash\n# Lint\nuv run ruff check .\n\n# Format\nuv run ruff format .\n\n# Type check\nuv run mypy src/\n\n# Tests (when written)\nuv run pytest\n```\n\n---\n\n## Design Philosophy\n\n**Transparent over magical.** Every query shows exactly what was retrieved, why, and how long each step took.\n\n**Measurable over assumed.** Nothing is added unless eval shows it helps. The reranker is not in v1 — it will be added only if faithfulness score proves it's needed.\n\n**Embedded over networked.** LanceDB and KuzuDB run in-process. No servers to manage, no network latency, no auth.\n\n**Local-first.** Works fully offline with Ollama. Cloud LLM is opt-in, not required.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fria-19%2Freporag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fria-19%2Freporag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fria-19%2Freporag/lists"}