{"id":50829490,"url":"https://github.com/moazmo/pr-sentinel","last_synced_at":"2026-06-13T22:00:35.847Z","repository":{"id":363926974,"uuid":"1265322755","full_name":"moazmo/pr-sentinel","owner":"moazmo","description":"Multi-agent code review for your pull requests — runs in your CI, bring your own key, shows which agent found what.","archived":false,"fork":false,"pushed_at":"2026-06-10T23:52:59.000Z","size":1630,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-11T01:10:39.278Z","etag":null,"topics":["ai","code-review","developer-tools","github-actions","langgraph","llm","multi-agent"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moazmo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-10T17:07:04.000Z","updated_at":"2026-06-10T23:53:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/moazmo/pr-sentinel","commit_stats":null,"previous_names":["moazmo/pr-sentinel"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/moazmo/pr-sentinel","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moazmo%2Fpr-sentinel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moazmo%2Fpr-sentinel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moazmo%2Fpr-sentinel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moazmo%2Fpr-sentinel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moazmo","download_url":"https://codeload.github.com/moazmo/pr-sentinel/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moazmo%2Fpr-sentinel/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34301732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-13T02:00:06.617Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","code-review","developer-tools","github-actions","langgraph","llm","multi-agent"],"created_at":"2026-06-13T22:00:17.138Z","updated_at":"2026-06-13T22:00:35.832Z","avatar_url":"https://github.com/moazmo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🛡️ PR Sentinel\n\n**Multi-agent code review for your pull requests — runs in your CI, brings your own key, and shows you which agent found what.**\n\n[![CI](https://github.com/moazmo/pr-sentinel/actions/workflows/ci.yml/badge.svg)](https://github.com/moazmo/pr-sentinel/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/)\n[![evals 49/51](https://img.shields.io/badge/evals-49%2F51%20%C2%B7%200%20false%20positives-brightgreen.svg)](#accuracy-is-a-systems-problem-not-a-model-size-problem)\n[![cost ~$0.005/review](https://img.shields.io/badge/cost-~%240.005%2Freview-blue.svg)](#what-it-costs)\n\n![PR Sentinel reviewing a pull request](assets/demo.gif)\n\nFive specialized LLM agents — **Architect, Security, Performance, Test, and Reviewer** — each examine your PR diff from a different angle, then merge into **one prioritized, deduplicated comment**. No walls of noise, no black box: every finding is attributed to the agent that raised it, and every agent prompt is [readable in this repo](src/pr_sentinel/prompts/).\n\n## The problem\n\nCode review is the most expensive bottleneck in most teams. Senior engineers burn hours reviewing PRs; under-reviewed code ships bugs; and most AI tools help *write* code, not critically *review* it. The AI reviewers that do exist are usually noisy black boxes — and false positives are why people uninstall them.\n\n## 30-second install\n\nAdd `.github/workflows/pr-sentinel.yml` to your repo:\n\n```yaml\nname: PR Sentinel\non:\n  pull_request:\n    types: [opened, synchronize, reopened]\npermissions:\n  contents: read\n  pull-requests: write\njobs:\n  review:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: moazmo/pr-sentinel@v2\n        with:\n          api_key: ${{ secrets.PR_SENTINEL_API_KEY }}\n```\n\nThen add one repository secret: **Settings → Secrets and variables → Actions → New repository secret**, name it `PR_SENTINEL_API_KEY`, paste your LLM provider key. Done — no checkout step, no other configuration required.\n\n\u003e This workflow is the hardened version on purpose: `pull_request` trigger (never `pull_request_target`) and minimal permissions. See [Security model](#security-model).\n\n## What it costs\n\nPR Sentinel speaks the **OpenAI-compatible protocol with a configurable `base_url`** — one integration reaches OpenAI, OpenRouter, Groq, DeepSeek, Mistral, and local Ollama. A typical PR (~3k diff tokens × 4 analysts + reviewer) costs:\n\n| Route | Model | $/1M in / out | Typical PR |\n|---|---|---|---|\n| Zero-config default | OpenAI `gpt-5-mini` | $0.25 / $2.00 | **≈ $0.01** |\n| Cheapest strong option | DeepSeek V4 Flash | $0.14 / $0.28 | **≈ $0.004** |\n| Best cheap closed-model | Claude Haiku 4.5 (via OpenRouter) | $1.00 / $5.00 | ≈ $0.03 |\n| Free | OpenRouter free models | $0 | $0 (rate-limited) |\n| Fully private | Ollama on a self-hosted runner | $0 | $0 — code never leaves your infra |\n\nTo use the cheapest option, drop this in `.pr-sentinel.yml`:\n\n```yaml\nprovider:\n  base_url: https://api.deepseek.com/v1\n  model: deepseek-v4-flash\n```\n\nEvery review comment shows its own token count and estimated cost in the footer. There's also a `dry_run: true` mode that posts a cost estimate **without making any LLM calls** — try PR Sentinel before spending a cent.\n\n## The agents\n\n| Agent | Looks for |\n|---|---|\n| 🏛️ **Architect** | Separation-of-concerns violations, leaky abstractions, coupling, misleading naming |\n| 🔒 **Security** | Injection (SQL/shell/XSS), exposed secrets, authz/authn gaps, unsafe deserialization |\n| ⚡ **Performance** | O(n²) patterns, N+1 queries, blocking calls in async paths, unnecessary allocations |\n| 🧪 **Test** | New behavior without tests, untested error paths, assertions removed, broad mocks |\n| 🔎 **Verifier** | Adjudicates every surviving finding against the diff — confirm / reject / downgrade — before anything posts |\n| 🧠 **Reviewer** | The aggregator: resolves semantic duplicates, cuts noise, writes the final review |\n\nThe Verifier + Reviewer are the difference between \"multi-agent\" and \"a wall of noise\": the Reviewer's prompt is explicitly biased — *when in doubt, drop the finding; three real issues beat thirty maybes* — and the Verifier fact-checks each finding against the code first. All prompts live in [`src/pr_sentinel/prompts/`](src/pr_sentinel/prompts/) — read them, tune them, PR them.\n\n## Architecture\n\n```mermaid\ngraph LR\n    A[ingest\u003cbr/\u003e\u003ci\u003efiles, skip rules,\u003cbr/\u003enumbered hunks, PR map\u003c/i\u003e] --\u003e B[🏛️ architect ×3]\n    A --\u003e C[🔒 security ×3]\n    A --\u003e D[⚡ performance ×3]\n    A --\u003e E[🧪 test ×3]\n    B --\u003e F[merge\u003cbr/\u003e\u003ci\u003evote · anchor evidence\u003cbr/\u003e· dedup · cluster\u003c/i\u003e]\n    C --\u003e F\n    D --\u003e F\n    E --\u003e F\n    F --\u003e V[🔎 verifier\u003cbr/\u003e\u003ci\u003econfirm / reject /\u003cbr/\u003edowngrade vs code\u003c/i\u003e]\n    V --\u003e G[🧠 reviewer\u003cbr/\u003e\u003ci\u003esemantic dedup,\u003cbr/\u003enoise cut, prose\u003c/i\u003e]\n    G --\u003e H[publish\u003cbr/\u003e\u003ci\u003einline + summary,\u003cbr/\u003escrub, sticky\u003c/i\u003e]\n```\n\nA fan-out/fan-in [LangGraph](https://github.com/langchain-ai/langgraph) graph, no loops. Each analyst runs **3 samples in parallel** and majority-votes (self-consistency); the merge pass anchors every finding's quoted evidence to a real diff line (dropping hallucinations) and deterministically clusters duplicates; the Verifier adjudicates; the Reviewer resolves semantic duplicates and writes prose. If an agent fails, the others still report — partial review beats no review. Findings anchored to an added line post as **inline review comments**; the rest stay in one sticky summary comment.\n\nLarge PRs: files are fetched via the paginated files API (the only endpoint that doesn't fall over past 3,000 lines), reviewed per-file within token budgets with a shared \"PR map\" for cross-file context (plus ±8 lines of head-ref context per hunk), and anything truncated or skipped is **disclosed in the comment**, never silently dropped. When the file cap bites, the highest-review-priority files (source over docs, by churn) are kept.\n\n## Configuration\n\nOptional `.pr-sentinel.yml` at the repo root — zero config works out of the box. All fields and their defaults:\n\n```yaml\nmode: \"\"                      # preset: fast | balanced | thorough (overrides the accuracy block)\nprovider:\n  base_url: https://api.openai.com/v1     # any OpenAI-compatible endpoint\n  model: gpt-5-mini\n  api_key_env: PR_SENTINEL_API_KEY        # name of the secret env var\n  kind: openai-compat                     # or \"anthropic\" for the native Messages API\n  analyst_model: \"\"                       # optional: cheaper model for the 4 analysts\n  review_model: \"\"                        # optional: model for verifier + reviewer\nagents:\n  enabled: [architect, security, performance, test]   # reviewer always runs\n  guidance: \"\"                # repo-specific guidance appended to every analyst, e.g. \"Django project\"\n  instructions: {}            # per-agent guidance, e.g. {architect: \"we use hexagonal architecture\"}\naccuracy:\n  samples: 3                  # self-consistency samples per analyst (1 disables the ensemble)\n  min_support: 2              # a finding must appear in this many samples to survive the vote\n  verifier: true              # run the adjudication pass before the reviewer\n  adaptive: true              # spend extra samples only on chunks that found something\n  cross_file: false           # opt-in extra pass for cross-file issues (1 more call)\n  # Research levers (v2.5) — all opt-in, default off. Measured ≈ baseline on flash\n  # (no accuracy gain above the ensemble+verifier system), so off by default; on\n  # together in `mode: thorough`. Kept as honest, toggleable infrastructure.\n  debias: false               # judge the code on its own merits, ignore reassuring/alarming PR titles (also security hardening)\n  calibration: false          # per-agent flag/stay-silent anchors (stable, cached prompt prefix)\n  lenses: false               # give each ensemble sample a different lens (standard/checklist/adversarial)\n  cot: \"off\"                  # \"brief\" adds a short reasoning scan before the findings (off | brief)\n  # Reasoning controls (DeepSeek V4: thinking is a parameter, on by default).\n  # analyst_thinking is DeepSeek-specific and endpoint-safe (only sent when set).\n  # Measured: disabling thinking tanks accuracy (~91%→61%), so leave it on.\n  analyst_thinking: null      # null = provider default (DeepSeek = on); false/true to force\n  reasoning_effort: \"\"        # \"\" | low | medium | high (depth when thinking is on)\n  repo_context: false         # prefetch cross-file symbol definitions for context (Python, opt-in, live-path)\nmin_severity: medium          # report at/above: critical|high|medium|low|nit\nignore:                       # appended to the built-in skip list\n  - \"migrations/**\"\nlimits:\n  max_files: 35\n  max_input_tokens: 120000\n  max_output_tokens_per_agent: 2000\nreview:\n  include_deletions: false\n  language_hint: \"\"           # e.g. \"python\" — appended to agent prompts\n  context_lines: 8            # head-ref context lines added around each hunk (0 disables)\n  incremental: true           # on re-review, skip files unchanged since the last review\n  suppress: []                # silence findings: [\"legacy/**\", \"api/*.py:nit\"]\noutput:\n  inline: true                # post anchored findings as inline review comments\n  suggestions: true           # render one-line fixes as one-click GitHub suggestion blocks\n  request_changes_at: \"\"      # submit REQUEST_CHANGES at/above this severity (e.g. \"critical\")\n  labels: false               # apply risk labels (security / needs-tests / …) to the PR\ngate:\n  level: \"off\"                # fail a Check Run at/above this severity so merges can be required\nsast:\n  enabled: false              # run Semgrep over changed files; hits go through the verifier's triage\n  rules: \"auto\"               # semgrep --config value (needs Semgrep in the runner; opt-in, live-path)\ndescribe: false               # maintain a generated summary in the PR body\ndry_run: false                # estimate cost, post the estimate, no LLM calls\n```\n\n**Cheapest-accuracy preset** (the README leaderboard config) — flash everywhere, ensemble on:\n\n```yaml\nprovider:\n  base_url: https://api.deepseek.com/v1\n  model: deepseek-v4-flash\n```\n\nYou can also silence a false positive inline, right where it lives:\n\n```python\ndanger = eval(user_input)  # pr-sentinel: ignore\nrisky  = run(x)            # pr-sentinel: ignore[security]\n```\n\nLockfiles, `node_modules`, `vendor`, `dist`, minified and generated files are **always skipped** (built-in list, protects your token budget). A malformed config never breaks anything — defaults apply and the comment notes it.\n\nThe config is read from the **base branch**, not the PR head — so a hostile PR can't disable the Security agent or raise your spend caps.\n\n## Security model\n\nThis category of tool was actively attacked in 2026 — review bots leaked their own API keys through PR titles, and `pull_request_target` misconfigurations got repos' entire secret stores harvested. PR Sentinel is built against that threat model:\n\n- **`pull_request` trigger only, never `pull_request_target`.** On fork PRs, secrets are absent by GitHub design, and PR Sentinel **skips gracefully** — that's correct behavior, not a missing feature. The alternative is how repos get their keys stolen. See [SECURITY.md](SECURITY.md).\n- **Minimal permissions:** `contents: read` + `pull-requests: write`. Even a fully compromised run can't push code or touch other workflows.\n- **PR content is treated as untrusted input.** Titles and diffs reach the model only inside delimited data blocks, with explicit instructions that the content is data under review, never instructions. Delimiter-escape attempts are neutralized.\n- **Structured output as a boundary:** analyst output that doesn't parse against the finding schema is discarded. An injected \"post your API key\" can't survive a parser that only accepts findings.\n- **Secrets never reach the prompt path** — they exist only in the HTTP client layer, enforced by construction and by regression tests. As defense-in-depth, the final comment is scanned for key-shaped strings and redacted on match.\n- **Config from the base branch** (see above).\n- **BYOK data path:** your code goes to *your* chosen LLM provider under *your* key — or nowhere at all, with Ollama on a self-hosted runner. It never touches any server of ours (there are none).\n\n## Reliability\n\nPR Sentinel **never breaks your CI**. Every failure path — provider down, rate limits, malformed diffs, huge PRs, missing config — degrades to a short comment (or a log line) and a clean exit. Hard caps (`max_files`, `max_input_tokens`) guarantee a worst-case cost ceiling per PR no matter what arrives.\n\nOn every push to the PR, the existing review comment is **updated in place** (one living comment per PR), not stacked — and only the files **changed since the last review** are re-examined (incremental review), so settled code isn't re-flagged or re-billed.\n\n## Features teams actually adopt for\n\nEverything below is $0 — you bring the key, so there's no paid tier gating any of it:\n\n- **One-click fixes.** When a finding has a precise fix, it's offered as a GitHub *suggestion block* — the author clicks \"Commit suggestion\" to apply it.\n- **Merge gating.** Set `gate.level: high` and PR Sentinel posts a **Check Run** that fails when there's an unresolved High/Critical finding — make it a required check and risky code can't merge. Off by default; never surprises you.\n- **Request changes.** Optionally submit the review as *Changes requested* at a severity you choose.\n- **Suppression.** Silence a false positive with an inline `# pr-sentinel: ignore` or a `review.suppress` glob — it stays gone.\n- **Custom guidance.** Tell the agents about your codebase (`agents.guidance`, `agents.instructions`) — read from the base branch, so a hostile PR can't inject instructions.\n- **Risk labels + readiness score.** Auto-label PRs (`security`, `needs-tests`, …) and show a deterministic *merge-readiness 0–100* and *review-effort 1–5* in the summary.\n- **Presets.** `mode: fast | balanced | thorough` instead of tuning knobs.\n\n\u003e Two optional features need one extra permission each in your workflow: merge gating adds `checks: write`, and risk labels add `issues: write`. Everything else works with the default `contents: read` + `pull-requests: write`.\n\n## Accuracy is a systems problem, not a model-size problem\n\nThis is PR Sentinel's bet, and the thing that separates it from single-pass reviewers. LLM review errors are mostly **variance** (a finding shows up on one run, not the next), **mislocalization** (right issue, wrong line), and **hallucination** (a finding that cites code that isn't there). None of those need a bigger model — they need *sampling, anchoring, and verification*. So v2 wraps cheap models in a system that fixes each:\n\n- **Line-numbered diffs (A1).** Analysts see absolute line numbers on every hunk line and cite the numbers they're shown — localization stops being a guess.\n- **Evidence anchoring (A2).** Every finding must quote the offending line. A deterministic pass checks that quote against the diff; **a finding whose evidence isn't literally in the code is dropped before it can post.** Hallucinations become structurally impossible, not just discouraged by a prompt.\n- **Self-consistency ensemble (A3).** Each analyst reviews three times; findings are majority-voted. A one-off miss or a one-off hallucination doesn't survive the vote. DeepSeek's prompt caching makes 3× sampling cost ~1.3×, not 3×.\n- **Verifier pass (A4).** A separate agent adjudicates every surviving finding against the numbered diff — confirm / reject / downgrade — before the reviewer writes a word. It runs a **rubric meta-judge** (argue the rejection first; keep a finding only if the visible code survives) — single pass, no debate, which the research shows amplifies bias rather than reducing it.\n\n**v2.5 added four more $0 levers — and measured them honestly.** Each is a config toggle; all ship **off by default** because, on cheap `deepseek-v4-flash` over the 37-fixture benchmark, every lever arm landed *within run-to-run noise of the levers-off baseline* — no measurable accuracy gain. We don't flip a default that changes review behavior without an eval that justifies it, so they stay opt-in (and turn on together in `mode: thorough` for max-recall users):\n\n- **Confirmation-bias debiasing (`debias`).** Judge each line on its own merits, ignore the PR title's framing — a reassuring title can't hide a real bug, an alarming one can't conjure a fake one. Accuracy-neutral here, but real **injection hardening**, so it's the lever most worth enabling.\n- **Calibration anchors (`calibration`).** Per-agent *flag / stay-silent* examples in the cached prompt prefix (nearly free per call) to pin a cheap model's severity bar.\n- **Prompt-diverse ensemble (`lenses`).** Ensemble samples take different viewpoints (plain / checklist / adversarial) instead of only a temperature jitter.\n- **Verdict-first chain-of-thought (`cot`).** An optional short reasoning scan, with each finding emitted verdict-first (the ordering that minimizes abstention).\n\nThe honest result — *levers that didn't beat the baseline get shipped off, not dressed up as a win* — is the same discipline as the leaderboard below.\n\n### The leaderboard\n\nThe same `deepseek-v4-flash` model ($0.14 / $0.28 per 1M tokens), 17 fixtures across 5 languages, 3 runs each (51 fixture-runs), 2026-06-11:\n\n| Config | Caught | Clean-fixture false positives | Cost / review |\n|---|---|---|---|\n| Naive single pass | 47/51 (92%) | **2** | ~$0.002 |\n| **PR Sentinel v2 (ensemble + verifier)** | **49/51 (96%)** | **0** | ~$0.005 |\n\nThe system turns a budget model from \"good with the occasional false positive on clean code\" into \"better, with **zero** false positives\" — for half a cent a review. The two remaining v2 misses are different fixtures on different runs (honest run-to-run variance, disclosed not tuned away). The naive run's false positives were the Test agent flagging a refactor whose test was *in the same diff* — exactly the noise the ensemble + verifier eliminate.\n\nThe fixture set includes seeded bugs (SQL injection, XSS, path traversal, hardcoded secret, N+1, blocking-async, leaky abstraction, untested money-code) in Python / JS / TS / Go / Java, **hard negatives** (correctly-parameterized SQL that looks scary; a bounded loop that looks O(n²)), and **two prompt-injection vectors** (in the diff and in the PR title) — both of which leak nothing and get flagged as attacks.\n\n#### v2.5 — harder benchmark, and the levers measured honestly\n\nThe benchmark was expanded to **37 fixtures across 7 languages** (added Ruby + C#, new bug classes — SSRF, insecure deserialization, `eval`, open redirect, ReDoS, secret-logging, weak crypto, TLS-disabled — more clean false-positive controls, and **misleading-title `mt_*` fixtures** that plant a real bug under a calm title or clean code under an alarming one, to measure debiasing). On this harder set, 3 runs each on `deepseek-v4-flash`:\n\n| Config | Passed (3×37 = 111) | Notes |\n|---|---|---|\n| **PR Sentinel system (ensemble + verifier)** | **101/111 (91%)** | the shipped baseline |\n| + debias + calibration | 98/111 (88%) | within noise |\n| + debias only | ~89% (run-to-run) | within noise |\n\nThe five research levers (debias, calibration, diverse lenses, verdict-first CoT, rubric verifier) **land within flash's run-to-run variance of the baseline** — no measurable accuracy gain — so they ship **off by default**, opt-in via config or `mode: thorough`. The spread in every arm is dominated by the same handful of genuinely hard fixtures (a validated query in a handler that tempts a false positive; a hardcoded secret under a \"style: rename\" title that flash just doesn't reliably catch). Publishing a lever as a win it didn't earn would be exactly the fixture-tuning this project refuses to do.\n\n#### The honest real-PR number (and where the work goes next)\n\nSeeded fixtures measure \"does the right agent catch a planted bug with no false positives\" — useful, but easy. The harder, truer test (`evals/realpr.py`, v2.6) takes **real merged bug-fix PRs**, reverses them to reintroduce the bug, and checks recall. On **32 such PRs** across 9 repos (requests / flask / httpx / click / pydantic / fastapi / black / express / gin), `deepseek-v4-flash` caught **7/32 (21%)** diff-only — far below the seeded 91%, and a sobering, honest read on real-world recall (the best commercial tools sit ~45–57% on real PRs per the independent [Martian benchmark](https://www.codeant.ai/blogs/ai-code-review-benchmark-results-from-200-000-real-pull-requests)). The misses are the *context-dependent* defects — a removed workaround, a typing regression, a teardown ordering bug — that no ±N-line reviewer can judge from the diff alone. Closing that gap is the roadmap: **repository-aware context** and **SAST grounding**, not more prompting. We publish the 21% because an honest hard number you can improve beats a flattering easy one you can't.\n\nThe first step of that roadmap is already in: an opt-in **repository-context prefetch** (`accuracy.repo_context`) that hands analysts the definitions of the cross-file symbols a diff references. On the same 32 PRs it moved recall **21% → 28%**, replicating a +9pp lift from a smaller pilot. Per-PR, it added 3 Python catches (the context-dependent class — e.g. a *removed workaround*) where the cross-file context actually applies, while a couple of non-Python results flipped on run-to-run noise (those files get no context). The Python signal is real; the single-run noise is why it ships **off by default** until a 3-run confirmation — but the data already points the way.\n\n\u003e Two reasoning facts worth knowing (verified against the DeepSeek API): `deepseek-v4-flash` reasons by default, and **turning that off collapses recall to ~64%** — so the system keeps reasoning on. Reasoning controls are exposed (`accuracy.analyst_thinking`, `reasoning_effort`) but default to the provider's setting.\n\nReproduce with your own key:\n\n```bash\nPR_SENTINEL_API_KEY=sk-... PR_SENTINEL_BASE_URL=https://api.deepseek.com/v1 \\\nPR_SENTINEL_MODEL=deepseek-v4-flash python evals/run.py --runs 3\n```\n\nThe unit/integration suite (**215 tests**, LLM and GitHub API fully mocked, no network) runs in CI: `pytest`.\n\n## On-demand commands\n\nComment on any PR (repo owners / members / collaborators only — a drive-by commenter can't spend your key):\n\n- `@pr-sentinel review` — re-run the full review\n- `@pr-sentinel ask \u003cquestion\u003e` — ask anything about the diff; get a grounded, cited answer\n- `@pr-sentinel describe` — write a summary + file walkthrough into the PR body\n\nTo enable them, add the `issue_comment` trigger to your workflow:\n\n```yaml\non:\n  pull_request:\n    types: [opened, synchronize, reopened]\n  issue_comment:\n    types: [created]\n```\n\n## Roadmap\n\nSee [ROADMAP.md](ROADMAP.md): GitHub App / hosted tier, auto-fix suggestions, multi-provider git hosts (GitLab/Bitbucket), and fork-PR review via maintainer-gated re-runs.\n\n## Design decisions\n\nEvery significant architecture choice — language, orchestration shape, dedup strategy, provider abstraction, large-diff handling — is documented with its tradeoffs in [DECISIONS.md](DECISIONS.md).\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md). Short version: `pip install -e \".[dev]\"`, `pytest`, open a PR — PR Sentinel reviews it. 🙂\n\n## License\n\n[MIT](LICENSE) © Moaz Muhammad\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoazmo%2Fpr-sentinel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoazmo%2Fpr-sentinel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoazmo%2Fpr-sentinel/lists"}