{"id":49404548,"url":"https://github.com/burakkose/copilot-research","last_synced_at":"2026-04-28T20:05:15.531Z","repository":{"id":353496214,"uuid":"1218702597","full_name":"burakkose/copilot-research","owner":"burakkose","description":"Multi-agent research extension for GitHub Copilot CLI with parallel specialists, depth floors, falsification, cross-session memory.","archived":false,"fork":false,"pushed_at":"2026-04-24T05:45:21.000Z","size":120,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-24T07:25:01.212Z","etag":null,"topics":["agents","analysis","github-copilot","research"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/burakkose.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T06:12:32.000Z","updated_at":"2026-04-24T05:45:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/burakkose/copilot-research","commit_stats":null,"previous_names":["burakkose/copilot-research"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/burakkose/copilot-research","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/burakkose%2Fcopilot-research","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/burakkose%2Fcopilot-research/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/burakkose%2Fcopilot-research/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/burakkose%2Fcopilot-research/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/burakkose","download_url":"https://codeload.github.com/burakkose/copilot-research/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/burakkose%2Fcopilot-research/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32396812,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T19:38:08.556Z","status":"ssl_error","status_checked_at":"2026-04-28T19:37:55.688Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","analysis","github-copilot","research"],"created_at":"2026-04-28T20:05:09.775Z","updated_at":"2026-04-28T20:05:15.515Z","avatar_url":"https://github.com/burakkose.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# copilot-research\n\nA multi-agent research and brainstorming extension for GitHub Copilot\nCLI. Give it a topic; it plans the work, runs specialist sub-agents in\nparallel across web / papers / market / social / funding sources,\nsynthesizes a report, critiques it with a different model, verifies\nthe citations, and indexes the result for future sessions.\n\nDesign notes:\n\n- **Depth floors** — each specialist has minimum word count, distinct\n  URLs opened, inline quotes, and adversarial queries; failures get\n  respawned automatically.\n- **Falsification first** — counter-evidence searches are required,\n  claims need verbatim quotes, citations are fetched and checked.\n- **Cross-session memory** — past reports are indexed with TF-IDF +\n  topic tags and a relevance threshold, so prior work surfaces only\n  when it's actually related.\n\nSome of the architecture choices are informed by recent research-agent\npapers (MIA, HiRAS, CoSearch, SeekerGym) — see [References](#references).\n\n---\n\n## Table of contents\n\n- [Quick start](#quick-start)\n- [Features](#features)\n- [The 20 tools](#the-20-tools)\n- [Usage — every common workflow](#usage--every-common-workflow)\n- [Technical concept research (for engineers)](#technical-concept-research-for-engineers)\n- [Quality discipline (Gartner-grade)](#quality-discipline-gartner-grade)\n- [Pipeline diagram](#pipeline-diagram)\n- [Anti-laziness depth floors](#anti-laziness-depth-floors)\n- [Configuration](#configuration)\n- [Project layout](#project-layout)\n- [Why this works (epistemic design)](#why-this-works-epistemic-design)\n- [Extending](#extending)\n- [References](#references-the-literature-behind-the-design)\n- [License](#license)\n\n---\n\n## Quick start\n\n```bash\n# 1. (Optional but recommended) — add API keys for higher-quality search\ncp .env.example .env\n$EDITOR .env             # add TAVILY_API_KEY at minimum (free 1k/mo)\nsource .env\n\n# 2. Launch Copilot CLI in this directory\ncopilot --experimental\n\n# 3. In the session, just talk to it\n\u003e Run deep research on \"AI coding agents in 2026\"\n```\n\nThat's it. Outputs land in `research-output/`. Every other key listed in\n`.env.example` is optional — the system falls back to public APIs and\nkeyless surfaces (HackerNews, Reddit, SEC EDGAR, arXiv, Crunchbase RSS)\nwhen an MCP server isn't configured.\n\n---\n\n## Features\n\n### 🧠 Multi-agent orchestration\n- **Hybrid orchestrator-workers pattern** (Anthropic's playbook) — one lead agent decomposes the task, spawns N specialists in parallel, and synthesizes\n- **Up to 5 specialists run truly concurrently**, each with its own context window — no lost-in-the-middle from one giant prompt\n- **8 specialist scopes** out of the box, each with prescriptive (not suggestive) briefs\n\n### 🚦 Anti-laziness gate (the headline)\n- **Programmatic depth floors** per specialist (word count, distinct URLs opened, inline verbatim quotes, adversarial query pairs) — checked by code, not vibes\n- **Auto-respawn loop**: any specialist below floor is sent back with `\"⚠️ INSUFFICIENT — append, don't delete\"` until it complies\n- **Mandatory self-audit checklist** at the end of every specialist note\n- Cannot proceed to synthesis while any floor is unmet\n\n### 🔍 Falsification-first methodology\n- Every important claim runs an **adversarial query pair** before it's allowed in the report\n- **Multi-query reformulation** (CoSearch-inspired) — 3+ rephrasings per critical claim, union of results\n- **Inline-quoted evidence** required — paraphrase-and-cite is rejected (that's where hallucinations hide)\n- **Confidence tags** on every claim (✅ Verified / 🔵 Likely / 🟠 Speculative / ⚡ Contested)\n\n### 🎯 Adaptive supervision\n- **Completeness audit** runs after specialists return; detects coverage gaps, contradictions, thin sections (SeekerGym-inspired)\n- **Dynamic fill-in spawning** — supervisor decides specialist scope at runtime, not just upfront (HiRAS-inspired)\n- **Confidence-based escalation**: any 🟠/⚡ on a decision-relevant claim auto-triggers a focused dig-deeper specialist\n\n### 🛰️ Real-time data sources\n- **Funding pulse**: SEC EDGAR Form D RSS (primary source!) + Crunchbase News RSS + TechCrunch venture RSS + HN Algolia + layoffs.fyi\n- **Social listening**: Reddit JSON endpoints (no key needed), HackerNews Algolia API, X/Twitter via site filters, Substack/dev.to/Lobsters\n- **Academic**: arXiv + Semantic Scholar (citation graph) + Google Scholar + connectedpapers + paperswithcode + OpenReview\n- **Market**: Gartner / Forrester / IDC / CB Insights / PitchBook / Crunchbase / SEC EDGAR (10-K, S-1, 8-K)\n- **Engineering**: blog filters for Netflix/Uber/Stripe/Airbnb/Spotify/Pinterest/LinkedIn/Shopify + QCon/KubeCon archives\n\n### 🧪 Code-validated numerics\n- **`validate_with_code` tool** writes Python that pulls raw data and recomputes claimed numbers\n- Supports trend fits, Monte Carlo, survey CIs, benchmark recomputation\n- Validated claims get a `[code-verified](./\u003cartifact\u003e.md)` link in the final report\n\n### 🪞 Multi-model red-team critique\n- Critic runs on a **different model family** from the orchestrator (default: `gpt-5.4` when orchestrator is Claude)\n- Prevents same-family blind spots\n- Configurable per-call via `critic_model`\n\n### ✅ Citation verification\n- After draft is written, **every cited URL is fetched** and the claim is checked against the source\n- Broken / paywalled / unsupported citations are flagged in `research-output/\u003cid\u003e-\u003cslug\u003e-citations.md`\n- Unsupported claims must be removed or downgraded before the report is finalized\n\n### 💾 Cross-session memory\n- All past reports are auto-indexed at `research-output/_memory-index.md`\n- **TF-IDF retrieval** with length normalization, title-token boosting (5×), minimum relevance threshold\n- **Auto-extracted topic tags** per report (top distinctive terms)\n- **Slim session-start digest**: shows topic clusters, not titles → no off-topic bleed-through when you're working on something unrelated\n- Unrelated queries return `no_matches` (verified — coffee research won't surface AI-agents work)\n\n### 💡 Brainstorm with stress-tests\n- `brainstorm_from_research` generates project ideas grounded in a report\n- Each idea includes a **pre-mortem** (\"This fails because ___\"), a **validation plan cheaper than the MVP**, a **specific \"why now\" data point**, and an **honest difficulty rating**\n- Optional `validate_top_ideas: true` runs `validate_with_code` on the top-3 ideas' market-fit numbers\n\n---\n\n## The 20 tools\n\n| Tool | Phase | What it does |\n|---|---|---|\n| `recall_prior_research` | 0 | Query cross-session memory of past reports (TF-IDF, threshold-filtered) |\n| `personal_context_profile` | Pre | Save your role/stack/constraints once; auto-injected into every run so findings get weighted against your reality |\n| `pre_register_research` | 0.5 | Lock dimensions, exclusion criteria, glossary, and falsifiability test BEFORE specialists run (clinical-trial-style pre-registration) |\n| `plan_research` | 1 | Decompose topic → specialist scopes + adversarial searches + code-validation candidates |\n| `run_deep_research` | All | Full pipeline: plan → specialists → audit → synth → critique → verify |\n| `enforce_depth_floors` | 2.4 | Anti-laziness gate — counts words/URLs/quotes per specialist; returns respawn directive for any below floor |\n| `completeness_audit` | 2.5 | Gap detection on specialist notes; recommends adaptive fill-ins |\n| `deep_paper_search` | Specialist | arXiv + Semantic Scholar with citation-graph traversal |\n| `trend_quantifier` | Specialist | GitHub/npm/PyPI/Trends/jobs — code-validated curves with slope + R² |\n| `funding_pulse` | Standalone | **Real-time funding feed** — SEC EDGAR Form D + Crunchbase RSS + TechCrunch RSS + HN Algolia + layoffs.fyi, cross-referenced |\n| `concept_explainer` | Standalone | Layered breakdown (intuition → mechanics → math → code → comparison → pitfalls) |\n| `red_team_critique` | 4 | Adversarial review on a *different model family* |\n| `ensemble_critique` | 4 | 3-critic ensemble (Skeptic + Regulator + Practitioner) on different models — only findings surviving all three are publication-ready |\n| `bias_audit` | 5 | Explicit checklist (survivorship/hype/recency/vendor-narrative/echo-chamber/...) with documented pass/fail per bias |\n| `validate_with_code` | 3.5 | Python validation: Monte Carlo, trend fit, CI, recompute, benchmark |\n| `citation_verifier` | 6 | Fetch every URL, check claim support — requires **verbatim quotes** for ✅; flags 🔁 echo-chamber sources |\n| `calibration_log` | Post | Append probabilistic claims, resolve outcomes over time, compute Brier score (measures whether your ✅/🔵/🟠 tags actually match reality) |\n| `eval_harness` | Quality | Run the orchestrator against a reference benchmark; scores accuracy/citation/hallucination/calibration; tracks regressions across runs |\n| `brainstorm_from_research` | Output | Stress-tested project ideas with optional code-validated market fit |\n| `list_research_reports` | Browse | Index of everything in `research-output/` |\n\n---\n\n## Usage — every common workflow\n\nYou don't invoke these as functions; just describe what you want in\nplain English in the Copilot CLI session and the orchestrator routes\nto the right tool. Examples below show what to type.\n\n### 🚀 The 90% case — full deep research\n\n```\n\u003e Run deep research on \"browser-based AI agents\"\n```\n\nDefaults: `depth=standard`, `autonomy=auto`, all 5 default specialists\n(`web_trends`, `academic_papers`, `market_analysis`, `developer_sentiment`,\n`social_pulse`), code validation on, multi-model critic.\n\n### 🔥 Maximum-depth research (cost-no-object)\n\n```\n\u003e Run deep research on \"vector databases for RAG\", depth \"deep\",\n  focus_areas [\"web_trends\", \"academic_papers\", \"market_analysis\",\n  \"competitor_analysis\", \"tech_landscape\", \"developer_sentiment\",\n  \"funding_activity\", \"social_pulse\"]\n```\n\nThis triggers the **deep tier floors** per specialist: 3,000 words, 30\ndistinct URLs opened, 18 inline quotes, 6 adversarial pairs — auto-respawn\nif any specialist falls short. With all 8 specialists, expect 20–60 minutes\nof background work and a 10–20K-word grounded report.\n\n### 🛑 Plan-then-execute (interactive)\n\n```\n\u003e Run deep research on \"post-quantum cryptography\", autonomy \"interactive\"\n```\n\nPauses after planning so you can read/edit the plan before specialists\nfire. Useful for novel topics where you want to steer the scope.\n\n### 💰 Just the latest funding rounds\n\n```\n\u003e funding_pulse subject \"AI agents\" window_days 60\n\u003e funding_pulse subject \"vector database\" window_days 90 investors [\"a16z\", \"Sequoia\"]\n\u003e funding_pulse subject \"climate tech\" companies [\"Climeworks\", \"Heirloom\"]\n```\n\nPulls from SEC EDGAR Form D + Crunchbase News + TechCrunch + HN Algolia\n+ layoffs.fyi. Cross-references every round; flags single-source claims.\nNo API keys required.\n\n### 📚 Just the academic literature\n\n```\n\u003e deep_paper_search topic \"constitutional AI\" paper_count 15 traverse_depth 2\n```\n\nReturns a paper-graph report: relevant papers + papers that cite them\ncritically + papers that those papers cite. arXiv + Semantic Scholar +\noptional Google Scholar.\n\n### 📈 Just the adoption metrics\n\n```\n\u003e trend_quantifier subject \"LangChain\" github_repos [\"langchain-ai/langchain\"]\n  pypi_packages [\"langchain\"] job_keywords [\"langchain\", \"LLM agent\"]\n```\n\nPulls GitHub stars-over-time, contributor bus factor, npm/PyPI downloads,\nGoogle Trends, HN \"Who's hiring?\" mention counts. Fits trend lines and\nreports slope + R².\n\n### 🎓 Quick concept explainer (no full pipeline)\n\n```\n\u003e Explain \"speculative decoding\" at practitioner level with runnable code\n```\n\nLayered breakdown: intuition → mechanics → math → code → comparisons →\npitfalls. No memory recall, no critique loop — just a focused explainer.\n\n### 🧠 Use prior work\n\n```\n\u003e recall_prior_research query \"vector database pgvector benchmarks\"\n```\n\nReturns ranked excerpts from past reports (or `no_matches` if your topic\nis genuinely unrelated to anything on file).\n\n### 🪞 Critique an existing draft (cross-model)\n\n```\n\u003e red_team_critique target_path \"research-output/2026-04-22-abc123-vector-databases-report.md\"\n  focus \"market-size and growth-rate claims\"\n```\n\nSpawns a critic on a different model family. Saves critique alongside\nthe report.\n\n### 🔢 Sanity-check a number\n\n```\n\u003e validate_with_code claim \"Vector-DB market is growing 60% YoY\"\n  method \"trend_fit\"\n  data_source_hint \"Pinecone funding rounds + npm downloads of pgvector\"\n```\n\nWrites Python that pulls real data, fits a trend, and writes a verdict.\nSaved to `research-output/.../artifacts/`.\n\n### 💡 Brainstorm projects from a report\n\n```\n\u003e Brainstorm from research-output/2026-04-22-abc123-vector-databases-report.md\n  with constraints \"solo dev, TypeScript+Python, $0 budget, 8-week timeline\"\n  validate_top_ideas true\n```\n\nIdeas come with pre-mortems, cheap validation plans, and (with the flag)\ncode-validated market-fit numbers for the top 3.\n\n### 📋 What have I researched?\n\n```\n\u003e list_research_reports\n```\n\nOr just look in `research-output/` directly.\n\n---\n\n## Technical concept research (for engineers)\n\nThe orchestrator was designed for market/competitive research, but it\nworks equally well for \"should I adopt X?\" and \"what's the SOTA in Y?\"\nquestions. The trick is encoding **your constraints** so the agent can\nweight findings against your reality, not just \"what does the world\nthink.\"\n\n### Step 1 — Set your profile once\n\n```\n\u003e Set my personal_context_profile to:\n  Role: distributed-systems engineer at a 200-person SaaS\n  Stack: Go, Postgres 15, Kafka 3.x, K8s, Datadog\n  Scale: 30k req/s p50, P99 budget 200ms, 4 regions\n  Non-goals: no Rust rewrite, no managed-only solutions, no NVLink-required tech\n  Adoption appetite: moderate — willing to adopt 2-year-old tech with active community,\n    not bleeding edge\n  Pain points: cross-region transactions, schema evolution, cold-start on autoscale\n```\n\nThis writes `research-output/_profile.md`. It's auto-injected into every\nsubsequent research run, so findings get filtered against your reality.\nGitignored — your profile stays local. Update with the same tool any time.\n\n### Step 2 — Use one of these proven patterns\n\n#### Pattern A — \"Should I adopt X?\"\n\n```\n\u003e pre_register_research with:\n    topic: \"Adopting Raft for our metadata service (replacing ZooKeeper)\"\n    dimensions:\n      - correctness under network partition\n      - operational cost vs current ZooKeeper\n      - Go library maturity (etcd/io, hashicorp/raft)\n      - observed failure modes in production at our scale\n      - migration cost from current setup\n    exclusion_criteria:\n      - vendor blogs without postmortem links\n      - papers older than 2022 unless seminal\n    falsifiability: \"If 3+ teams of comparable size report Raft caused outages\n                     they wouldn't have had with ZooKeeper, abandon.\"\n\n\u003e Then run_deep_research at depth=standard with focus_areas=[academic_papers,\n  tech_landscape, developer_sentiment]\n\n\u003e Then ensemble_critique on the report\n\u003e Then bias_audit\n```\n\nThe Practitioner critic in the ensemble will catch operational debt the\nSkeptic won't. The Regulator will surface compliance/disclosure issues\nyou might not have considered.\n\n#### Pattern B — \"What's the SOTA in X?\"\n\n```\n\u003e run_deep_research:\n    topic: \"State of distributed transactions in 2026: 2PC alternatives that\n           actually shipped in production at \u003e10k tx/s\"\n    focus_areas: [academic_papers, tech_landscape, developer_sentiment]\n    depth: deep\n\n\u003e Then concept_explainer for the top 2-3 contenders\n\u003e Then validate_with_code (method=benchmark) for any \"X is N× faster\" claim\n```\n\nThe depth=deep tier forces 30+ URLs and 18+ inline quotes per specialist —\nnecessary for technical questions where the difference between papers\nand production is the whole point.\n\n#### Pattern C — Quick \"is this real?\" sanity check\n\n```\n\u003e Run quick research on \"is QUIC actually winning over HTTP/2 for\n  internal microservice traffic in 2026\"\n```\n\n`depth=quick` does ~1 hour of research. Good for litmus tests before\ninvesting in a full investigation.\n\n### What technical-research-specific signal you'll get\n\n- **Production failure modes** (not just \"the technique works\"): the\n  bias audit + Practitioner critic surface war stories.\n- **Compatibility flags** against your stack: profile-aware findings\n  (e.g. \"this requires NVLink — your rig is PCIe-3, marked inapplicable\").\n- **Adoption-curve quantification**: `trend_quantifier` pulls GitHub /\n  npm / job-posting trends with R² fits, distinguishing real momentum\n  from temporary hype.\n- **Code-validated benchmarks**: claims like \"Foo is 3× faster than\n  Bar\" trigger `validate_with_code(method=benchmark)` — actual\n  measurement, not just citation of someone else's number.\n\n---\n\n## Quality discipline (Gartner-grade)\n\nThe orchestrator can be run lightly or with full enterprise discipline.\nFor decision-grade reports (the kind you'd hand to a VP or stake your\nquarter on), use this workflow:\n\n```\n1. personal_context_profile (set once, then forget)\n2. recall_prior_research        — don't re-do work you already did\n3. pre_register_research        — lock the rubric BEFORE looking at sources\n4. run_deep_research            — full hybrid pipeline\n5. completeness_audit           — surface gaps; spawn fill-ins as needed\n6. ensemble_critique            — 3 critics on different model families\n7. bias_audit                   — explicit checklist; failed checks block ship\n8. citation_verifier            — verbatim quotes for every ✅\n9. calibration_log (action=log) — record probabilistic claims for later scoring\n10. (months later)\n    calibration_log (action=resolve) — mark outcomes\n    calibration_log (action=score)   — Brier score across all resolved claims\n11. eval_harness                — quarterly run against reference benchmark\n                                  to catch quality regressions\n```\n\n**Why each step matters:**\n\n- **Pre-registration** kills motivated reasoning. If you let evidence\n  reshape your evaluation rubric mid-research, you're cherry-picking\n  retroactively.\n- **Ensemble critique** → 3 priors on different models means findings\n  that survive all three are genuinely robust. Single-model critique\n  has correlated blind spots.\n- **Verbatim quotes** kill paraphrase drift — the #1 hallucination\n  vector for research agents.\n- **Calibration log** is how you tell if your \"✅ Verified\" tags\n  actually correspond to ~95% accuracy in reality. If your Brier\n  score is 0.30, your confidence labels are theater.\n- **Eval harness** is the regression suite for the agent itself.\n  Without it, every \"improvement\" you make to prompts/orchestration\n  is unfalsifiable. Bundled sample at\n  `.github/extensions/research-orchestrator/eval-benchmark/sample.json`\n  — extend with your own reference questions over time.\n\n---\n\n## Pipeline diagram\n\n```\nrecall_prior_research          MIA-inspired memory (TF-IDF, no off-topic noise)\n        ↓\nplan_research                  decompose into specialist scopes\n        ↓\nparallel specialists           CoSearch-inspired multi-query reformulation\n  ├── web_trends               ↳ inline-quote evidence\n  ├── academic_papers          ↳ arXiv + Semantic Scholar + Google Scholar +\n  │                              connectedpapers + paperswithcode + OpenReview\n  ├── market_analysis          ↳ Gartner/Forrester/IDC/CB Insights/PitchBook/\n  │                              SEC EDGAR (10-K, S-1)\n  ├── competitor_analysis      ↳ G2 / Capterra / Product Hunt / AlternativeTo\n  ├── tech_landscape           ↳ engineering blogs (Netflix/Uber/Stripe…) + QCon/KubeCon\n  ├── developer_sentiment      ↳ GitHub + HN Algolia + named subs + StackOverflow + Lobsters\n  ├── funding_activity         ↳ Crunchbase + TechCrunch + investor blogs + layoffs.fyi\n  │                              (calls funding_pulse tool for real-time rounds)\n  └── social_pulse             ↳ Reddit (multiple subs, .json endpoints) +\n                                  HackerNews (Algolia) + X/Twitter + Substack\n        ↓\nenforce_depth_floors           ⚠️ ANTI-LAZINESS GATE — programmatic\n        ↓                       auto-respawn if any specialist is below floor\ncompleteness_audit             SeekerGym-inspired gap detection\n        ↓                       ↳ adaptive fill-in spawn (HiRAS-inspired)\nvalidate_with_code             PAL-style quantitative validation\n        ↓\ncitation-grounded synthesis    every claim has an inline quote\n        ↓\nred_team_critique              different model family (variance reduction)\n        ↓\nrevise + escalation            🟠/⚡ on key claims → spawn dig-deeper specialists\n        ↓\ncitation_verifier              fetch every URL, check claim support\n        ↓\nmemory update                  index report (with topic tags) for future recall\n```\n\n---\n\n## Anti-laziness depth floors\n\nThe single most common failure mode of \"deep research\" agents is stopping\nearly. This orchestrator enforces hard, programmatically-checked floors\nper specialist before allowing the pipeline to advance to synthesis:\n\n| Depth     | Words   | Distinct URLs | Inline quotes | Adversarial pairs |\n|-----------|---------|---------------|---------------|-------------------|\n| `quick`   |   800   |  8            |  4            | 2                 |\n| `standard`| 1,800   | 18            | 10            | 4                 |\n| `deep`    | 3,000   | 30            | 18            | 6                 |\n\nIf any specialist's notes fall below the floor, the orchestrator calls\n`enforce_depth_floors` → gets back a `respawn_directive` → respawns the\nspecialist with `\"⚠️ INSUFFICIENT — keep digging, APPEND don't replace\"`\nand re-checks. **The agent cannot exit early.**\n\nFor the `social_pulse` specialist, additional per-platform floors apply:\n\n| Depth     | Reddit threads / subs | HN comments / stories | X threads | Long-form blogs |\n|-----------|------------------------|------------------------|-----------|-----------------|\n| `quick`   | 3 / 2                  | 3 / 2                  | 2         | 2               |\n| `standard`| 6 / 3                  | 5 / 3                  | 4         | 4               |\n| `deep`    | 10 / 4                 | 8 / 4                  | 6         | 6               |\n\nCurl templates for the public HackerNews Algolia API and Reddit `.json`\nendpoints are baked into the brief so specialists never have an \"I\ncouldn't search there\" excuse.\n\n---\n\n## Configuration\n\n### MCP servers (all optional)\n\nIn Copilot CLI, run `/mcp` and copy server entries from `mcp-servers.json`.\nRecommended priority order:\n\n| Service | Free tier | Used by | Required? |\n|---|---|---|---|\n| **Tavily** | 1000 searches/mo | All web research | Highly recommended |\n| **Brave Search** | 2000/mo | Falsification cross-check | Optional |\n| **ArXiv MCP** | unlimited | `deep_paper_search` | No (built-in fallback) |\n| **Semantic Scholar** | unlimited (key for higher rate) | Citation graph | Optional |\n| **Firecrawl** | 500 pages/mo | JS-heavy fetches, citation_verifier | Optional |\n| **GitHub** | Public API | `trend_quantifier` | Optional |\n| **Reddit** | Free OAuth | `social_pulse` | No (public JSON works) |\n| **HackerNews** | unlimited | `social_pulse`, `developer_sentiment` | No (Algolia is public) |\n| **SerpAPI (Scholar)** | 100/mo | `academic_papers` | No (use Tavily site filter) |\n| **Perplexity** | paid | Optional cross-check | Optional |\n\n### Knobs you'll actually use\n\n```\ndepth: \"quick\" | \"standard\" | \"deep\"           # default: standard\nautonomy: \"auto\" | \"interactive\"               # default: auto\nfocus_areas: [...]                             # default: 5-pack\nenable_code_validation: true                   # default: true\noutput_format: \"markdown\" | \"executive_brief\"  # default: markdown\ncritic_model: \"gpt-5.4\" | …                    # default: gpt-5.4\n```\n\n---\n\n## Project layout\n\n```\n.\n├── .github/\n│   ├── copilot-instructions.md\n│   ├── extensions/research-orchestrator/extension.mjs   # The orchestrator (~2000 LOC)\n│   └── instructions/\n│       ├── research.instructions.md         # Falsification, confidence tags, source tiers\n│       ├── orchestration.instructions.md    # When/how to spawn subagents (all 7 phases)\n│       ├── code-validation.instructions.md  # When/how to validate with code\n│       └── memory.instructions.md           # Cross-session memory recall discipline\n├── mcp-servers.json                         # MCP server reference config\n├── .env.example                             # API key template\n├── README.md                                # this file\n└── research-output/                         # All artifacts saved here\n    ├── _memory-index.md                     # Auto-built TF-IDF memory index\n    ├── \u003cid\u003e-\u003cslug\u003e-plan.md\n    ├── \u003cid\u003e-\u003cslug\u003e-notes/\n    │   ├── \u003carea\u003e.md                        # Per-specialist findings\n    │   ├── fillin-\u003cslug\u003e.md                 # Adaptive gap-fill outputs\n    │   └── _audit.md                        # Completeness audit\n    ├── \u003cid\u003e-\u003cslug\u003e-artifacts/               # Code, data, charts\n    ├── \u003cid\u003e-\u003cslug\u003e-funding-pulse.md         # If funding_pulse was run\n    ├── \u003cid\u003e-\u003cslug\u003e-trend.md                 # If trend_quantifier was run\n    ├── \u003cid\u003e-\u003cslug\u003e-papers.md                # If deep_paper_search was run\n    ├── \u003cid\u003e-\u003cslug\u003e-critique.md\n    ├── \u003cid\u003e-\u003cslug\u003e-citations.md\n    ├── \u003cid\u003e-\u003cslug\u003e-report.md                # ⭐ Final report, auto-indexed\n    └── \u003cid\u003e-\u003cslug\u003e-ideas.md                 # If brainstorm was run\n```\n\n---\n\n## Why this works (epistemic design)\n\n- **Memory recall first** — avoids redundant work; later reports build on earlier\n- **Multi-agent decomposition** — each specialist focuses on one slice; better signal than one big prompt\n- **Parallel execution** — N specialists in roughly the wall-clock of one\n- **Multi-query reformulation** — 3 rephrasings per claim recovers retrieval recall lost by treating search as fixed (CoSearch showed up to 26.8% F1 left on the table)\n- **Falsification by default** — every important claim runs an adversarial search pair before it's allowed in the report\n- **Programmatic depth floors** — the agent cannot lie to itself about how thoroughly it searched\n- **Completeness audit** — explicit gap detection prevents silent omissions (SeekerGym showed SOTA agents miss \u003e50% of relevant info)\n- **Adaptive fill-in** — supervisor spawns more specialists dynamically, not just upfront (HiRAS pattern)\n- **Confidence tagging** — calibration matters more than confidence\n- **Code validation** — numbers get recomputed, not just quoted (PAL pattern)\n- **Multi-model red-team** — critic on different model family = independent error\n- **Confidence escalation** — low-confidence on key claims triggers more research\n- **Citation verification** — every URL is opened and the claim is checked\n- **Memory update** — final report becomes input for next investigation\n\n---\n\n## Extending\n\n- **Add MCP servers** via `/mcp` in Copilot CLI\n- **Add focus areas**: edit `SPECIALISTS` in `.github/extensions/research-orchestrator/extension.mjs`\n- **Adjust depth floors**: edit `DEPTH_FLOORS` and `SOCIAL_PLATFORM_FLOORS` constants\n- **Change critic model**: edit `CRITIC_MODEL` constant (or pass `critic_model` per call)\n- **Tighten/relax methodology**: edit files in `.github/instructions/`\n- **After editing the extension**, run `extensions_reload` in CLI (no restart needed)\n\n---\n\n## References (the literature behind the design)\n\n- **MIA — Memory Intelligence Agent** (arXiv 2604.04503, Apr 2026) — Manager-Planner-Executor with non-parametric memory + on-the-fly test-time learning\n- **HiRAS — Hierarchical Research Agent System** (arXiv 2604.17745, Apr 2026) — supervisory managers coordinating specialists across stages\n- **CoSearch** (arXiv 2604.17555, Apr 2026) — joint training of reasoner + ranker; showed fixed retrieval leaves 26.8% F1 on the table\n- **SeekerGym** (arXiv 2604.17143, Apr 2026) — completeness-of-retrieval benchmark; best agents retrieve only 42.5% of relevant Wikipedia passages\n- **LiteResearcher** (arXiv 2604.17931, Apr 2026) — agentic-RL training framework (not adopted; out of scope for prompt-orchestration)\n- Anthropic *\"How we built our multi-agent research system\"* — orchestrator-workers\n- **STORM** (Shao et al. 2024) — outline-first writing\n- **Reflexion** (Shinn et al. 2023), **Self-Refine** (Madaan et al. 2023) — critique loops\n- **ReAct** (Yao et al. 2022) — reason + act with tools\n- **PAL** (Gao et al. 2022) — program-aided validation\n\n---\n\n## License\n\nMIT — see [LICENSE](./LICENSE).\n\nPRs welcome if you've added a specialist or data surface that's worth sharing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fburakkose%2Fcopilot-research","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fburakkose%2Fcopilot-research","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fburakkose%2Fcopilot-research/lists"}