{"id":51281981,"url":"https://github.com/subratamondal1/argus","last_synced_at":"2026-06-30T02:30:28.654Z","repository":{"id":368011793,"uuid":"1256754784","full_name":"subratamondal1/argus","owner":"subratamondal1","description":"Framework-free, horizontally-autoscaled multi-agent deep-research engine. Own the loop, not the framework.","archived":false,"fork":false,"pushed_at":"2026-06-28T17:18:05.000Z","size":9343,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-28T18:10:25.482Z","etag":null,"topics":["agent","agentic-ai","agentic-rag","ai-agents","deep-research","fastapi","keda","kubernetes","langchain-alternative","litellm","llm","llm-evaluation","llmops","mcp","multi-agent","pgvector","postgresql","python","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/subratamondal1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-02T04:08:40.000Z","updated_at":"2026-06-28T17:18:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/subratamondal1/argus","commit_stats":null,"previous_names":["subratamondal1/argus"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/subratamondal1/argus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/subratamondal1%2Fargus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/subratamondal1%2Fargus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/subratamondal1%2Fargus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/subratamondal1%2Fargus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/subratamondal1","download_url":"https://codeload.github.com/subratamondal1/argus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/subratamondal1%2Fargus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34950328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-30T02:00:05.919Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agentic-ai","agentic-rag","ai-agents","deep-research","fastapi","keda","kubernetes","langchain-alternative","litellm","llm","llm-evaluation","llmops","mcp","multi-agent","pgvector","postgresql","python","rag","retrieval-augmented-generation"],"created_at":"2026-06-30T02:30:27.308Z","updated_at":"2026-06-30T02:30:28.644Z","avatar_url":"https://github.com/subratamondal1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://em-content.zobj.net/source/apple/391/eye_1f441-fe0f.png\" width=\"110\" alt=\"Argus\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eArgus\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eOwn the agent loop, not the framework.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  A framework-free, multi-agent deep-research engine — a planner fans out parallel\u003cbr/\u003e\n  agents over the web and your documents, and a synthesizer writes a cited answer.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-blue?style=flat\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/python-3.12%2B-blue?style=flat\" alt=\"Python 3.12+\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/tests-150%20passing-brightgreen?style=flat\" alt=\"Tests: 150 passing\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/local--first-Ollama-orange?style=flat\" alt=\"Local-first\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#benchmark\"\u003eBenchmark\u003c/a\u003e •\n  \u003ca href=\"#evaluation-methodology\"\u003eEval methodology\u003c/a\u003e •\n  \u003ca href=\"#features\"\u003eFeatures\u003c/a\u003e •\n  \u003ca href=\"#quickstart\"\u003eQuickstart\u003c/a\u003e •\n  \u003ca href=\"#how-it-works\"\u003eHow it works\u003c/a\u003e •\n  \u003ca href=\"#configuration\"\u003eConfiguration\u003c/a\u003e •\n  \u003ca href=\"#development\"\u003eDevelopment\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nArgus answers a hard question the way a research team would: a **planner** breaks it into sub-questions, a fan-out of hand-written **searcher agents** researches them in parallel over the live web and a local document corpus, and a **synthesizer** writes a cited answer. Retrieval quality is enforced by an **eval gate** that blocks regressions, the searcher fan-out **autoscales from zero** on Kubernetes, and a run can be made **crash-resumable**.\n\nBuilt on Python 3.12 · LiteLLM · PostgreSQL + pgvector · FastAPI · Next.js — with **no agent framework** (no LangChain/LangGraph): the loop, the budget, and the failure handling are owned directly. **Local-first** — it runs at zero cost on Ollama; OpenAI and Anthropic are optional drop-ins.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/argus-demo.gif\" width=\"100%\" alt=\"Argus deep-research run: a planner decomposes the question into four sub-questions, four searcher agents run in parallel, and a synthesizer writes a cited answer — streamed live.\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cem\u003eOne \u003cstrong\u003eDeep research\u003c/strong\u003e run, end to end: the planner decomposes the question, four searcher agents fan out in parallel, and the synthesizer streams a cited answer — running locally on Ollama at zero cost.\u003c/em\u003e\n\u003c/p\u003e\n\n---\n\n## Benchmark\n\n\u003e **Retrieval quality is a number, not a vibe.**\n\n`argus eval` ingests a curated RAG corpus ([`eval/corpus/`](eval/corpus/)), runs a committed golden set ([`eval/golden.jsonl`](eval/golden.jsonl)) — **48 questions including negative/unanswerable cases** — through real retrieval and a judged agent answer, and exits non-zero when any metric falls below [`eval/thresholds.json`](eval/thresholds.json).\n\n### Latest results (`make eval`, 48-item benchmark, corpus-only)\n\n| Metric (RAGAS vocabulary) | gpt-5.4-mini | qwen3.5:4b (local, $0) | qwen2.5:3b (local, $0) | threshold |\n|---|:---:|:---:|:---:|:---:|\n| `context_recall` (hit@k) | **1.000** | 0.976 | 0.976 | ≥ 0.80 |\n| `context_precision` | 0.662 | **0.676** | 0.624 | ≥ 0.20 |\n| `mrr` | 0.948 | **0.952** | 0.952 | ≥ 0.60 |\n| `faithfulness` | 0.976 | 0.976 | **1.000** | ≥ 0.70 |\n| `answer_relevancy` | **1.000** | 0.976 | 0.690 | — |\n| `judge_pass_rate` | **0.976** | 0.643 | 0.000 | ≥ 0.70 |\n| `keyword_pass_rate` | **0.881** | 0.833 | 0.190 | ≥ 0.60 |\n| `abstention_rate` (negatives declined) | **1.000** | 0.833 | 0.000 | ≥ 0.70 |\n| **Gate** | ✅ **PASS** | ❌ FAIL (judge by 0.057) | ❌ FAIL | |\n\n**Reading the table:**\n- The hosted `gpt-5.4-mini` clears all 8 gates. The fully-local `qwen3.5:4b` ($0 stack) passes 7 of 8 — one notch of reasoning short on `judge_pass_rate`, with retrieval, faithfulness, and abstention all green.\n- `qwen2.5:3b` collapses on negatives (`abstention_rate = 0.000`): it fabricates answers for every unanswerable question. This is where smaller models fail in production RAG.\n- Retrieval and generation signals are gated **independently**, so a retrieval regression never hides behind a good answer.\n\n```bash\nmake eval             # run the eval gate (reports the table above)\nmake eval-calibrate   # prove the judge agrees with humans (Cohen's κ ≥ floor)\n```\n\n---\n\n## Evaluation methodology\n\n### Why these metrics?\n\nArgus implements the RAGAS vocabulary **in-repo** ([`eval/`](src/argus/eval/)) without the RAGAS library, to stay dependency-light and fully offline. Each metric targets a distinct failure mode in a RAG + agentic system:\n\n| Metric | What failure it catches |\n|---|---|\n| `context_recall` | Retriever misses relevant chunks entirely |\n| `context_precision` | Retriever floods context with noise, diluting signal |\n| `mrr` | Relevant chunk exists but ranks low — hurts synthesis quality |\n| `faithfulness` | Synthesizer hallucinates facts not grounded in retrieved context |\n| `answer_relevancy` | Synthesizer answers a different question than was asked |\n| `judge_pass_rate` | Holistic answer quality, as judged by a calibrated LLM judge |\n| `keyword_pass_rate` | Answer covers the key factual entities from the golden reference |\n| `abstention_rate` | System fabricates on unanswerable queries instead of declining |\n\n### How the golden set was constructed\n\nThe 48-item golden set (`eval/golden.jsonl`) was built to stress every failure mode:\n\n- **Positive cases** — questions with clear, corpus-grounded answers. Each has a reference answer and a set of required keywords.\n- **Negative / unanswerable cases** — questions whose answers are not in the corpus. A faithful system must decline (output a refusal or \"I don't know\"). Any non-refusal on a negative case is scored as `abstention_rate = 0`, the harshest possible penalty.\n- **Near-miss cases** — questions with partial corpus support, designed to expose `context_precision` failures where the retriever returns related but not sufficient chunks.\n\n### How the LLM judge is calibrated\n\nThe judge is a prompted LLM that scores each (question, retrieved_context, answer) triple as pass/fail. Calibration works as follows:\n\n1. A human-annotated sample of 20 triples is rated pass/fail by a human.\n2. The judge is run on the same 20 triples.\n3. **Cohen's κ** is computed between human and judge labels. κ ≥ 0.80 is required before the judge is trusted as a gate. If κ falls below the floor, the judge prompt is revised and recalibrated (`make eval-calibrate`).\n4. The calibrated judge then scores the remaining benchmark items. This prevents the judge from becoming a rubber stamp — κ \u003c 0.80 means the judge is not reliably capturing human quality signals.\n\n### What the `qwen2.5:3b` failure reveals\n\nThe smallest local model scores `abstention_rate = 0.000` — it fabricates an answer for every unanswerable question in the benchmark. This is the canonical failure mode of RAG systems deployed without abstention testing: the model confidently answers questions the corpus cannot support. The benchmark's negative cases exist specifically to catch this before it reaches production.\n\nGetting `qwen3.5:4b` through the gate required disabling Qwen3's default chain-of-thought (`reasoning_effort=\"disable\"` → Ollama `think:false`), which reduced latency from ~84s to ~2s per call, while keeping thinking enabled for structured judge output — which Ollama drops when thinking is off. This asymmetry (thinking off for answers, on for judging) is documented in [`docs/adr/`](docs/adr/).\n\n---\n\n## Features\n\n| Capability | Detail |\n|---|---|\n| **Framework-free agent loop** | Hand-written tool-use loop over LiteLLM with a 3-axis budget (turns / tokens / wall-clock) + hard cost cap, a retry/fallback ladder, and a self-registering, permission-gated tool registry. |\n| **Multi-agent orchestration** | Planner → parallel searcher agents (isolated context each) → synthesizer → reflect/replan. |\n| **Contextual RAG** | Anthropic-style contextual chunking; hybrid dense-HNSW + lexical-FTS retrieval fused with Reciprocal Rank Fusion; optional `bge-reranker-v2-m3` cross-encoder — all on a single pgvector store. |\n| **Eval gate** | RAGAS-style metrics + Cohen's-κ-calibrated LLM judge; fails the build below committed thresholds. Full methodology above. |\n| **Horizontal scale** | Searcher fan-out on an ARQ-on-Redis queue; Kubernetes + KEDA scale searcher pods from zero on queue depth. |\n| **Durable execution** | Opt-in DBOS workflows — a crashed research run resumes from its last checkpointed step (Postgres-backed). |\n| **MCP server** | The tool registry exposed over the Model Context Protocol (`argus mcp`) for any MCP host. |\n| **Multi-tenant + auth** | Email/password → argon2id + HS256 JWT in an httpOnly cookie with signed double-submit CSRF; per-tenant data isolation. |\n| **Streaming UI** | FastAPI Server-Sent Events streaming live multi-agent progress to a Next.js 16 / React 19 client. |\n| **Sandboxed code execution** | `execute_python` runs model-generated code in a subprocess sandbox (rlimits, timeout, no network) behind a permission gate. |\n\n---\n\n## Quickstart\n\n```bash\n# 1. Install (uv manages the Python 3.12 toolchain and the venv).\nuv sync\n\n# 2. Start the local backing stack (Postgres + pgvector, SearXNG).\nmake up\n\n# 3. Run the LLM and embeddings locally on Ollama — zero cost, no API key.\nollama pull qwen2.5:3b \u0026\u0026 ollama pull nomic-embed-text\n\n# 4. Ask.\nuv run argus \"What changed in the EU AI Act timeline in 2026?\"\n```\n\n`cp .env.example .env` first if you want to override defaults. To use a hosted model, set `OPENAI_API_KEY` and `ARGUS_MODEL=openai/...` in `.env`. Stack controls: `make status` / `make down`.\n\n---\n\n## How it works\n\n```\nquestion\n   │  planner (LLM)\n   ▼\nsub-questions ──► searcher agent ─┐   each: own tool-use loop + budget,\n              ──► searcher agent ─┤   rag_search over corpus + web_search,\n              ──► searcher agent ─┘   run in parallel (asyncio / ARQ + KEDA)\n                       │ findings\n                       ▼\n                  synthesizer (LLM) ──► reflect/replan ──► cited answer\n```\n\nEvery LLM call is structured-logged and cost-attributed; the agent loop stops on the first of its turn/token/wall-clock/cost limits. The RAG path ingests documents with LLM-written contextual prefixes, embeds them locally on Ollama, and indexes for both dense (HNSW) and lexical (full-text) search; queries fuse the two with Reciprocal Rank Fusion.\n\nDesign decisions are recorded as ADRs in [`docs/adr/`](docs/adr/).\n\n### Why no framework?\n\nThe loop is a **stateless reducer** over an explicit `messages: list[dict]`. That one decision pays three ways:\n\n- **Testability** — feed a canned `messages` list (or a fake `CompletionClient`), assert. No live LLM needed for 150 tests.\n- **Durability** — the list is serializable, so checkpoint it and resume after a crash (DBOS opt-in).\n- **Debuggability** — every prompt is in plain sight. There is no metaclass, DAG executor, or hidden state to peel back when something fails.\n\n---\n\n## Document ingestion (RAG)\n\n```bash\nollama pull nomic-embed-text                       # one-time, local embeddings\nuv run argus ingest ./notes/architecture.md        # a file\nuv run argus ingest https://example.com/post       # or a URL\nuv run argus --deep \"How does our system handle retries?\"\n```\n\nEmbeddings and rerank run locally — document text never leaves the machine. PDF/DOCX/PPTX ingest: `uv sync --extra parse`. Cross-encoder rerank: `uv sync --extra rerank` + `ARGUS_RERANK_ENABLED=true`.\n\n---\n\n## Web UI\n\n```bash\nmake web-install   # first time only: install frontend deps (bun)\nmake web           # FastAPI on :8000 + Next.js on :3000 → http://localhost:3000\n```\n\nThe UI streams the multi-agent flow live (plan → parallel search → tool calls → synthesize → reflect) and renders a cited Markdown answer, with document upload and a deep-research toggle. The backend is standalone — `make serve` runs the API alone, and the CLI works without the UI.\n\n---\n\n## Configuration\n\n| Variable | Default | Purpose |\n|---|---|---|\n| `ARGUS_MODEL` | `ollama_chat/qwen2.5:3b` | Agent / contextualization / judge LLM (`openai/...` for hosted). |\n| `ARGUS_EMBEDDING_MODEL` | `ollama/nomic-embed-text` | Embedding model (768-d; must match the column). |\n| `ARGUS_USE_QUEUE` | `false` | Fan searchers onto the ARQ-on-Redis queue (KEDA-autoscalable). |\n| `ARGUS_USE_DURABLE` | `false` | Run deep research as a crash-resumable DBOS workflow (`--extra durable`). |\n| `ARGUS_RERANK_ENABLED` | `false` | Enable the cross-encoder rerank stage (`--extra rerank`). |\n\nOptional extras: `parse` (document parsing), `rerank` (cross-encoder), `otel` (OpenTelemetry), `durable` (DBOS), `mcp` (MCP server).\n\n---\n\n## Development\n\n```bash\nmake ci          # format-check + lint (ruff) + typecheck (ty) + tests (pytest)\nmake test        # tests only — hermetic (LLM, DB, and search are faked/marker-gated)\nmake eval        # run the eval gate\nmake eval-calibrate   # judge calibration (Cohen's κ)\nmake mcp         # run the tool registry as an MCP server over stdio\n```\n\nCI runs the hermetic suite plus a Postgres + Redis integration job and a kind + KEDA autoscaling smoke on every push. Integration tests are behind a `pytest -m integration` marker so the default suite needs no services.\n\n---\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsubratamondal1%2Fargus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsubratamondal1%2Fargus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsubratamondal1%2Fargus/lists"}