An open API service indexing awesome lists of open source software.

https://github.com/priyanshuchawda/tracepilot-gemini-cli

TracePilot: Gemini CLI fork with Phoenix/OpenInference tracing, MCP self-introspection, safety gates, redaction, evals, and a verified broken-repo repair loop
https://github.com/priyanshuchawda/tracepilot-gemini-cli

ai-agents arize-phoenix evals gemini-cli mcp openinference opentelemetry typescript

Last synced: about 9 hours ago
JSON representation

TracePilot: Gemini CLI fork with Phoenix/OpenInference tracing, MCP self-introspection, safety gates, redaction, evals, and a verified broken-repo repair loop

Awesome Lists containing this project

README

          

# TracePilot - editted gemini cli

TracePilot is a forked Gemini CLI TypeScript agent runtime with Phoenix
observability, OpenInference-style spans, Phoenix MCP self-introspection, safety
gates, redaction, deterministic evals, and a broken-repo repair demo.

Normal coding agents primarily reason from the current prompt, repository, and
tool output. TracePilot additionally stores sanitized evidence from successful,
verified repairs and retrieves relevant prior repair evidence through Phoenix
before deciding how to handle a similar failure. Retrieved evidence informs the
repair plan; it does not bypass command safety or verification.

This public repository is being hardened from a successful proof demo into a
developer-facing repair reliability product foundation. It is not a claim that
TracePilot is production-ready for arbitrary repositories without the documented
safety, redaction, and verification gates passing in that target environment.
Productization work is tracked in
[Epic #124](https://github.com/priyanshuchawda/tracepilot-gemini-cli/issues/124).

The TypeScript monorepo now lives at the repository root so reviewers can see
the actual project immediately: `package.json`, `packages/`, `docs/`,
`examples/`, `scripts/`, and `.github/` are all top-level.

## What It Proves

TracePilot is built to prove this repair loop:

1. A user asks the agent to fix or debug a broken repo.
2. The agent runs a command.
3. The command fails.
4. The failure is traced to Phoenix.
5. The agent queries Phoenix through MCP.
6. The repair plan references trace evidence.
7. The agent applies the fix.
8. The agent reruns the test.
9. The test passes.
10. A deterministic eval result is logged.
11. Judge input/result artifacts are written for repair-quality review.

## Repair Memory

A completed repair produces a structured, sanitized repair report containing
the failure signature, root cause, actions taken, verification results,
confidence, and proof level. TracePilot exports that report as Phoenix-visible
trace evidence.

On a later failure, TracePilot derives a failure signature and queries Phoenix
through MCP for matching successful repairs. The repair planner receives the
retrieved evidence with provenance and confidence. It can reuse the prior
diagnosis or reject it when the current repository state does not match. A
repair is not recorded as successful until its verification matrix passes.

The implementation and focused tests are in:

- `packages/core/src/tracepilot/repairReport.ts`
- `packages/core/src/tracepilot/repairPlanner.ts`
- `packages/core/src/telemetry/phoenixSelfIntrospection.ts`
- `scripts/demo-phoenix-repair-memory-replay.ts`

## Phoenix Workflow

Phoenix is both the observability backend and the retrieval surface for repair
memory. OpenInference-style spans capture failures and completed repair reports.
Phoenix MCP lets the agent inspect matching prior evidence during a later repair
session. The smoke tests separately verify OTEL export and MCP visibility so a
local-only result cannot be presented as live Phoenix proof.

Latest verified Phoenix evidence:

- Phoenix OTEL smoke passed for session `tracepilot-smoke-1778699160858`.
- Phoenix MCP smoke passed for session `tracepilot-mcp-smoke-1778699158476`.
- Strict broken-node demo passed for session
`tracepilot-broken-node-app-1778699160588`.
- Demo trace evidence: `de13112b1dadd28dda63a83365d92344`.

TracePilot reports now include `proofLevel` and `strictLiveProof`. Treat
`local_offline`, `controlled_substitute`, and `degraded_gemini` as development
evidence only; strict review evidence requires `live_phoenix`,
`live_gemini_phoenix`, or `hosted_cloud_run`.

Cloud Run is intentionally not live right now. The repo contains cheap Cloud Run
deploy and smoke tooling, but a public hosted URL should only be shared after
redeploying and re-running the hosted smoke test.

## Repository Structure

```text
.
|-- README.md # GitHub landing page
|-- AGENT.md # Operating rules for coding agents
|-- PLAN.md # Original TracePilot implementation plan
|-- docs.md # Phoenix/OpenInference research notes
|-- .github/ # Workflows, issue templates, PR template
|-- packages/
| |-- cli/ # Gemini CLI terminal package
| |-- core/ # Agent runtime, tools, scheduler, telemetry
| |-- test-utils/ # Shared test helpers
| `-- vscode-ide-companion/
|-- packages/core/src/
| |-- telemetry/ # Phoenix OTEL, redaction, spans, MCP query
| |-- tracepilot/ # Repair planner and deterministic evals
| |-- policy/ # Command safety and risk classification
| |-- scheduler/ # Agent turn and tool execution path
| `-- tools/ # Shell, file, MCP, and related tools
|-- examples/
| `-- broken-node-app/ # Fail-plan-fix-rerun demo fixture
|-- scripts/ # Smoke tests, evals, demos, deploy helpers
|-- docs/ # TracePilot verification and Gemini CLI docs
`-- cloudbuild.tracepilot-cloud-run.yaml
```

## Key Links

- [Verification guide](docs/tracepilot.md)
- [Release and demo checklist](docs/tracepilot-release-demo-checklist.md)
- [Full implementation README snapshot](docs/tracepilot-implementation-readme.md)
- [Broken demo fixture](examples/broken-node-app)
- [Telemetry code](packages/core/src/telemetry)
- [Repair and eval code](packages/core/src/tracepilot)
- [Command safety policy](packages/core/src/policy)

## Quick Start

```bash
npm ci
npm run lint
npm run typecheck
npm run build
```

Focused TracePilot checks:

```bash
npm run ci:tracepilot # fast local tier
npm run ci:tracepilot -- --tier=medium
npm run doctor:tracepilot
npm run tracepilot:check
npm run smoke:phoenix
npm run smoke:phoenix:mcp
npm run demo:broken-node-app
npm run demo:idempotency-race:controlled
npm run demo:trace-ablation:controlled
npm run demo:trace-ablation
npm run judge:tracepilot -- --repair-artifact .ai-logs/tracepilot-check/repair-artifact.json --eval-report .ai-logs/demo-broken-node-app/result.json --judge-input-output .ai-logs/judge-input.json --judge-result-output .ai-logs/judge-result.json
npm run dashboard:tracepilot -- --report .ai-logs/demo-broken-node-app/result.json --repair-artifact .ai-logs/tracepilot-check/repair-artifact.json --judge-input .ai-logs/judge-input.json --judge-result .ai-logs/judge-result.json
npm run workbench:tracepilot
npm run smoke:cloud-run:local
```

## Judge Evaluation

The shortest credential-free evaluation path is:

```bash
npm ci
npm run test:tracepilot
npm run demo:phoenix-repair-memory:controlled
npm run demo:broken-node-app:offline
npm run smoke:cloud-run:local
```

This verifies the repair planner, repair-memory replay contract, deterministic
evaluation, evidence generation, and local Cloud Run service contract without
claiming live Phoenix proof. Judges with Phoenix credentials can then run
`npm run smoke:phoenix`, `npm run smoke:phoenix:mcp`, and
`npm run demo:phoenix-repair-memory` to verify live export and retrieval.

Use focused checks during development because the full root `npm test` is long.
`ci:tracepilot` writes required, optional, and skipped gate results to
`.ai-logs/tracepilot-ci/summary.json`. Save full logs under ignored `.ai-logs/`
files and share only pass/fail status, exit codes, and short redacted tails on
failure.

`doctor:tracepilot` checks local readiness for strict proof prerequisites.
`tracepilot:check` writes a sanitized repair artifact/report for the current
folder. `judge:tracepilot` turns a repair artifact plus deterministic eval
report into `judge-input.json` and `judge-result.json`; supplied scored judge
results may also be bundled by demo commands with `--judge-result-input`. Judge
artifacts are repair-quality evidence, not a replacement for strict Phoenix
proof. `dashboard:tracepilot` renders a self-contained sanitized HTML proof
viewer for demos and judging. `workbench:tracepilot` starts the interactive
repair workbench at `http://127.0.0.1:4310`; controlled mode works without live
credentials.

## Environment

Copy [`.env.example`](.env.example) and set real values locally:

```bash
GEMINI_API_KEY=...
PHOENIX_API_KEY=...
PHOENIX_HOST=https://app.phoenix.arize.com/s/YOUR_SPACE
PHOENIX_BASE_URL=https://app.phoenix.arize.com/s/YOUR_SPACE
PHOENIX_COLLECTOR_ENDPOINT=
PHOENIX_PROJECT=tracepilot-gemini-cli
```

Never commit `.env` files, API keys, bearer tokens, private keys, or full
command outputs containing secrets. Any credential pasted into chat or
transcripts should be rotated before public submission.

## Current Status

| Area | Status | Evidence |
| -------------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
| Gemini CLI baseline | Working | Build, lint, typecheck slices passed during P0 work. |
| Phoenix OTEL export | Working | `npm run smoke:phoenix` passed with real Phoenix config. |
| Phoenix MCP visibility | Working | `npm run smoke:phoenix:mcp` returned the smoke trace. |
| Phoenix self-introspection | Working | Queries Phoenix MCP for matching failed span evidence and degrades when absent. |
| Broken repo repair demo | Working | Strict demo exported/queryed trace evidence and passed retry tests. |
| Redaction | Working for implemented patterns | Sanitizer, eval, and demo paths redact secrets before traces/reports. |
| Command safety gate | Working | Blocks destructive and credential-dumping commands in policy tests. |
| Proof dashboard | Working locally | Renders sanitized proof, repair, eval, and judge artifacts as self-contained HTML. |
| Cloud Run hosted URL | Not currently deployed | Redeploy only with approval, then run `npm run smoke:cloud-run -- --url "$CLOUD_RUN_SERVICE_URL"`. |

The GitHub repository is public. Keep public claims tied to verified evidence:
offline or degraded runs are not strict live proof, and hosted Cloud Run proof
requires a freshly deployed and smoked URL.