https://github.com/varaddurge/argus

CLI observability and debugging for agent workflows — catch silent/semantic failures, trace root causes, and replay from any step. 98.8% root cause accuracy across 100 controlled scenarios
https://github.com/varaddurge/argus

agent-workflows ai-agents cli debugging observability replay

Last synced: 19 days ago
JSON representation

CLI observability and debugging for agent workflows — catch silent/semantic failures, trace root causes, and replay from any step. 98.8% root cause accuracy across 100 controlled scenarios

Host: GitHub
URL: https://github.com/varaddurge/argus
Owner: VaradDurge
License: other
Created: 2026-04-02T04:23:10.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-06-11T09:32:00.000Z (24 days ago)
Last Synced: 2026-06-11T10:08:28.061Z (24 days ago)
Topics: agent-workflows, ai-agents, cli, debugging, observability, replay
Language: TypeScript
Homepage:
Size: 6.3 MB
Stars: 4
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          


  


  

  

  

  



---

**Production readiness platform for AI agent pipelines.**

Your LangGraph pipeline runs. No exception. But three nodes later something crashes with a `KeyError`. The node that crashed didn't cause it — some node upstream returned a dict with a missing field, and nothing caught it.

ARGUS sits between your nodes and catches silent failures, semantic degradation, and contract violations before they reach production.



---

## Install

```bash

pip install argus-agents

```

## Setup — pick whichever fits your code

**Option A — pass graph to constructor (recommended):**

```python

from argus import ArgusWatcher

watcher = ArgusWatcher(graph)      # attaches monitoring automatically

app = graph.compile()

result = app.invoke(initial_state) # run auto-saves when the last node finishes

print(watcher.run_id)              # access the run ID directly

```

**Option B — separate watch call:**

```python

from argus import ArgusWatcher

watcher = ArgusWatcher()

watcher.watch(graph)       # before graph.compile()

app = graph.compile()

result = app.invoke(initial_state)

```

**Option C — after compile (new in v0.5.0):**

```python

from argus import ArgusWatcher

watcher = ArgusWatcher()

app = graph.compile(checkpointer=memory)

app = watcher.watch_compiled(app)   # works on already-compiled graphs

result = app.invoke(initial_state)

```

All three work. No changes to your node functions. Runs are saved automatically for linear and fan-out/fan-in graphs. Only cyclic graphs (with back-edges) need a manual `watcher.finalize()` call.

### ArgusWatcher parameters

| Parameter | Type | Default | Description |

|-----------|------|---------|-------------|

| `graph` | `StateGraph` | `None` | LangGraph graph to monitor. If passed, `watch()` is called automatically. |

| `max_field_size` | `int` | `50_000` | Max characters per field before truncation in stored outputs. |

| `validators` | `dict` | `None` | Per-node semantic validators. Use `"*"` as key to run on every node. Each validator is a `(bool, str)` callable. |

| `strict` | `bool` | `False` | Enable extra checks: nested error keys, rate-limit responses, empty lists, type mismatches. Recommended for CI/staging. |

| `investigate` | `bool \| str` | `True` | LLM root-cause investigation. `True` = on failure only, `"always"` = every node, `False` = off. |

| `redact_keys` | `set[str]` | `None` | Field names to redact from stored outputs (e.g. `{"password", "api_key"}`). |

| `persist_state` | `bool` | `True` | Save run records to `.argus/runs/`. Set `False` for ephemeral monitoring. |

| `record_http` | `bool` | `False` | Record all HTTP calls for deterministic replay. Saved to disk per run. |

| `semantic_judge` | `bool` | `False` | Enable LLM-powered quality judge on every node output. Requires `OPENAI_API_KEY`. |

| `judge_model` | `str` | `"gpt-4o"` | Model for the semantic judge and investigation. |

```python

# Example with multiple options

watcher = ArgusWatcher(

    graph,

    semantic_judge=True,

    judge_model="gpt-4o-mini",

    strict=True,

    record_http=True,

    redact_keys={"api_key", "token"},

    validators={

        "summarize": lambda o: (len(o.get("summary", "")) > 10, "Summary too short"),

    },

)

```

---

## What it catches

**Silent failures** — a node returns `{}` or drops a required field. No exception, pipeline keeps running. ARGUS compares each node's output against the next node's type annotations and flags it before the crash happens downstream.

**Semantic failures** — structure is fine but the value is wrong. Pass a validator:

```python

watcher = ArgusWatcher(graph, validators={

    "classify": lambda o: (o.get("label") in ["yes", "no"], "unexpected label"),

    "*":        lambda o: ("error" not in o, "error key present"),

})

```

`"*"` runs on every node.

**Crashes** — full traceback captured per node, with a one-line root cause:

```

└─  KeyError: 'score'

└─  at pipeline.py:47  →  result = state["score"] * weight

└─  Field 'score' was absent from the incoming state

```

**Strict mode** — additional patterns: nested error keys, rate limit responses, empty required lists, `list[int]` vs `list[str]` type mismatches. Use in staging/CI:

```python

watcher = ArgusWatcher(graph, strict=True)

```

---

## Output

```

argus  run-abc12345  ·  2024-04-05 12:30  ·  1243 ms

status  ●  silent_failure

   1  fetch       43 ms    ✓  pass

   2  validate    12 ms    ⚠  silent failure

      └─  Field "score" is missing

      └─  process received bad state

   3  process    891 ms    ✗  crashed

      └─  KeyError: 'score'

      └─  Field 'score' was absent from the incoming state

root cause   validate

```

Parallel nodes shown as a grouped panel. Cyclic graphs show each iteration separately. Human interrupt chains stitched into one trace on resume.

---

## Rerun

A 10-node pipeline fails at node 7. You fix the bug. Instead of re-running nodes 1–6 and burning API credits:

```bash

argus replay  node_7

```

ARGUS restores the exact state at node 7 from disk and runs from there. Upstream outputs stay frozen. Only node 7 onward re-executes with your fixed code.

From the web UI — hover any step, click `↺ Rerun From Here`. After rerun, the diff view opens automatically.

```bash

argus diff     # compare rerun vs original

```

### What about external API calls?

By default, reruns call external APIs live (OpenAI, search tools, databases). Results may differ from the original run.

For **fully deterministic** reruns, record HTTP calls during the original run:

```python

watcher = ArgusWatcher(graph, record_http=True)

```

Every API response is saved to disk. During rerun, the recorded responses are served back — same data, zero extra cost, fully reproducible.

---

## Semantic Judge (LLM-powered)

Deterministic checks catch ~80% of production failures (missing fields, empty results, type mismatches, placeholder outputs). For the remaining 20% — subtle quality issues like wrong tone, unhelpful responses, or outdated information — enable the semantic judge:

```python

watcher = ArgusWatcher(graph, semantic_judge=True)

```

The LLM judge runs **after** deterministic checks on every node. It evaluates output quality, generates causal hypotheses, and suggests debugging steps.

```python

# With a specific model

watcher = ArgusWatcher(graph, semantic_judge=True, judge_model="gpt-4o")

# Combined with HTTP recording for deterministic + intelligent monitoring

watcher = ArgusWatcher(graph, semantic_judge=True, record_http=True)

```

Requires `OPENAI_API_KEY` in your environment. Uses GPT-4o by default.

**When to use:** complex multi-agent pipelines, customer-facing outputs, LLM-generated content where quality matters.

**When to skip:** simple pipelines, CI/CD speed runs, zero-cost monitoring.

---

## Adaptive Learning (v0.6)

ARGUS learns from your runs. When the semantic judge discovers a new failure pattern, it proposes a candidate signature. You review it in the **Approvals** page (`argus ui`) and choose:

- **Private** — adds to your local heuristic engine only

- **Shared** — pushes to the cloud so every ARGUS user benefits

The heuristic engine loads from three tiers: **bundled** (ships with ARGUS) → **private** (your local patterns) → **shared** (community-contributed, synced from cloud). All three are merged and deduplicated at startup.

```bash

argus ui          # open Approvals page to review candidates

argus login       # required for cloud sync

```

The semantic judge also overrides heuristic false positives. If a node failed *only* due to a heuristic pattern match (no structural issues, no validator failures), the LLM reviews context and can clear the flag.

---

## Diagnose setup issues

```bash

argus doctor

```

```

✓  python           Python 3.9.6

✓  langgraph        langgraph 0.6.11

✓  storage          312 runs stored, all healthy

✓  replay           all 7 node functions importable for rerun

✓  optional deps    openai (key set), dotenv

```

5 seconds to know if something is wrong — Python version, LangGraph compatibility, storage health, rerun readiness.

---

## CLI

```

argus list                          # all runs

argus show last                     # most recent run

argus show run                  # by full id or 8-char prefix

argus replay              # re-run from a node

argus replay   --only     # re-run just that one node

argus inspect  --step     # raw input/output for a node

argus diff                      # rerun vs original

argus diff              # any two runs

argus ui                            # open web dashboard

argus doctor                        # check your setup

argus login                         # sync runs to cloud

```

---

## Web UI

```bash

argus ui

```

Opens at `http://localhost:7842`. Serves runs from `.argus/runs/` in your current directory — no account needed.

Run detail, rerun tree, side-by-side diff, LLM cost per node, AI root cause investigation.

---

## Node statuses

| | |

|---|---|

| `✓` | pass |

| `~` | pass with warnings (empty optional fields) |

| `⚠` | silent failure (missing required fields) |

| `⊗` | semantic fail (validator returned False) |

| `⏸` | interrupted (human-in-the-loop pause) |

| `✗` | crashed |

---

## Without LangGraph

```python

from argus import ArgusSession

session = ArgusSession()

session.set_edges({"fetch": ["classify"], "classify": ["process"]})

fetch    = session.wrap("fetch",    fetch_fn)

classify = session.wrap("classify", classify_fn)

process  = session.wrap("process",  process_fn)

state = fetch(initial_state)

state = classify(state)

state = process(state)

session.finalize()

```

Works with Prefect, Temporal, or plain Python functions.

---

Requires Python 3.9+. LangGraph 0.2+ only needed for `ArgusWatcher`.

**v0.6.2** — [changelog](https://github.com/VaradDurge/ARGUS/releases/tag/v0.6.2)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/varaddurge/argus

Awesome Lists containing this project

README