{"id":46472387,"url":"https://github.com/shepard-system/shepard-obs-stack","last_synced_at":"2026-04-01T18:01:37.258Z","repository":{"id":340068208,"uuid":"1164356806","full_name":"shepard-system/shepard-obs-stack","owner":"shepard-system","description":"Self-hosted observability for AI coding agents. Clone. Configure. See.","archived":false,"fork":false,"pushed_at":"2026-03-19T00:39:14.000Z","size":1486,"stargazers_count":63,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-19T14:00:51.095Z","etag":null,"topics":["ai-agents","claude-code","codex","gemini-cli","grafana","loki","observability","opentelemetry","prometheus"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shepard-system.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"digitalashes"}},"created_at":"2026-02-23T01:34:40.000Z","updated_at":"2026-03-19T00:39:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/shepard-system/shepard-obs-stack","commit_stats":null,"previous_names":["shepard-system/shepard-obs-stack"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/shepard-system/shepard-obs-stack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shepard-system%2Fshepard-obs-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shepard-system%2Fshepard-obs-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shepard-system%2Fshepard-obs-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shepard-system%2Fshepard-obs-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shepard-system","download_url":"https://codeload.github.com/shepard-system/shepard-obs-stack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shepard-system%2Fshepard-obs-stack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290741,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","claude-code","codex","gemini-cli","grafana","loki","observability","opentelemetry","prometheus"],"created_at":"2026-03-06T06:23:13.380Z","updated_at":"2026-04-01T18:01:37.250Z","avatar_url":"https://github.com/shepard-system.png","language":"Shell","funding_links":["https://github.com/sponsors/digitalashes"],"categories":[],"sub_categories":[],"readme":"# shepard-obs-stack\n\n[![Grafana](https://img.shields.io/badge/Grafana-12.4.0-F46800?logo=grafana\u0026logoColor=white)](https://grafana.com/)\n[![Prometheus](https://img.shields.io/badge/Prometheus-v3.9.1-E6522C?logo=prometheus\u0026logoColor=white)](https://prometheus.io/)\n[![Loki](https://img.shields.io/badge/Loki-3.6.7-2C3239?logo=grafana\u0026logoColor=white)](https://grafana.com/oss/loki/)\n[![OTel Collector](https://img.shields.io/badge/OTel_Collector-0.146.0-4B44CE?logo=opentelemetry\u0026logoColor=white)](https://opentelemetry.io/docs/collector/)\n[![License: Elastic-2.0](https://img.shields.io/badge/License-Elastic--2.0-blue.svg)](LICENSE)\n[![Tests](https://github.com/shepard-system/shepard-obs-stack/actions/workflows/test.yml/badge.svg)](https://github.com/shepard-system/shepard-obs-stack/actions/workflows/test.yml)\n\n**The Eye** — self-hosted observability for AI coding assistants.\n\nYou use Claude Code, Codex, or Gemini CLI every day.\nYou have no idea how much they cost, which tools they call, or whether they're actually helping.\nThis fixes that.\n\n![Cost Dashboard](docs/screenshots/cost-dashboard.png)\n\n\u003cdetails\u003e\n\u003csummary\u003eMore screenshots\u003c/summary\u003e\n\n**Tools** — 5K calls across all three CLIs, top tools ranked, failing tools by error count:\n\n![Tools Dashboard](docs/screenshots/tools-dashboard.png)\n\n**Operations** — live event rate, breakdown by source and event type:\n\n![Operations Dashboard](docs/screenshots/operations-dashboard.png)\n\n**Claude Code Deep Dive** — per-model cost, token breakdown, cache efficiency, productivity ratio:\n\n![Claude Deep Dive](docs/screenshots/claude-deep-dive.png)\n\n**Claude Code Deep Dive (Tools)** — tool decisions, active time breakdown:\n\n![Claude Deep Dive Tools](docs/screenshots/claude-deep-dive-tools.png)\n\n**Quality** — cache hit rates, error rates, session trends:\n\n![Quality Dashboard](docs/screenshots/quality-dashboard.png)\n\n\u003c/details\u003e\n\n## Table of Contents\n\n- [Highlights](#highlights)\n- [Quick Start](#quick-start)\n- [Dashboards](#dashboards)\n- [How It Works](#how-it-works)\n- [Hook Setup](#hook-setup)\n- [Claude Code Skills](#claude-code-skills)\n- [Rust Accelerator](#rust-accelerator-optional)\n- [Alerting](#alerting)\n- [Services](#services)\n- [Architecture](#architecture)\n- [Project Structure](#project-structure)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Highlights\n\n- **One command** to start: `./scripts/init.sh` — 6 services, 9 dashboards, under a minute\n- **Three CLIs supported**: Claude Code, Codex, Gemini CLI — hooks + native OpenTelemetry\n- **Nine Grafana dashboards** auto-provisioned: cost, tools, operations, quality, per-provider deep dives, session timeline, and side-by-side provider comparison\n- **Minimal dependencies** — Docker, plus `bash`, `curl`, and `jq` on the host for hooks and tests. No Python, no Node, no cloud accounts\n- **Optional [Rust accelerator](https://github.com/shepard-system/shepard-hooks-rs)** — drop-in `shepard-hook` binary replaces bash+jq+curl. Hooks auto-detect it; falls back to bash if absent\n- **Seven Claude Code [skills](#claude-code-skills)** — `/obs-status`, `/obs-cost`, `/obs-sessions`, `/obs-tools`, `/obs-alerts`, `/obs-compare`, `/obs-query` — query the stack without leaving your terminal\n- **Works offline** — everything runs on localhost, your data stays on your machine\n\n## Quick Start\n\n**Prerequisites:** Docker (with Compose v2), `curl`, `jq`, and at least one AI CLI installed.\n\n```bash\ngit clone https://github.com/shepard-system/shepard-obs-stack.git\ncd shepard-obs-stack\n./scripts/init.sh          # starts stack + health check\n./hooks/install.sh         # injects hooks into your CLI configs\n```\n\nOpen [localhost:3000](http://localhost:3000) (admin / shepherd). Use your CLI as usual — data appears in dashboards within seconds.\n\n```bash\n./scripts/test-signal.sh   # verify the full pipeline (11 checks)\n```\n\n## Dashboards\n\n### Unified (cross-provider)\n\n| Dashboard      | Question it answers                     |\n|----------------|-----------------------------------------|\n| **Cost**       | How much is this costing me?            |\n| **Tools**      | Who is performing and who is wandering? |\n| **Operations** | What is happening right now?            |\n| **Quality**    | How well is the system working?         |\n\n### Deep Dive (per-provider)\n\n| Dashboard       | What you see                                            |\n|-----------------|---------------------------------------------------------|\n| **Claude Code** | Token usage, cost by model, tool decisions, active time |\n| **Codex**       | Sessions, API latency percentiles, reasoning tokens     |\n| **Gemini CLI**  | Token breakdown, latency heatmap, tool call routing     |\n\n### Session Timeline \u0026 Comparative\n\n| Dashboard            | What you see                                                                               |\n|----------------------|--------------------------------------------------------------------------------------------|\n| **Session Timeline** | Synthetic traces from all 3 CLI session logs — tool call waterfall, MCP timing, sub-agents |\n| **Comparative**      | Side-by-side provider comparison: sessions, cost, tokens, tools, errors, top repos         |\n\nClick any Trace ID to open the full waterfall in Grafana Explore → Tempo.\n\nDashboard template variables: **Tools** and **Operations** support `$source` and `$git_repo` filtering.\n**Deep Dive** dashboards use `$model`. **Session Timeline** uses `$provider`. **Comparative** uses `$git_repo`. **Cost** and **Quality** show aggregated data without filters.\n\n## How It Works\n\nAI CLIs emit telemetry through two channels:\n\n```\nAI CLI (Claude Code / Codex / Gemini)\n    │\n    ├── bash hooks → OTLP metrics (tool calls, events, git context)\n    │                 └─→ OTel Collector :4318\n    │\n    └── native OTel → gRPC (tokens, cost, logs, traces)\n                       └─→ OTel Collector :4317\n                             │\n                             ├── metrics → Prometheus :9090\n                             ├── logs → Loki :3100\n                             └── traces → Tempo :3200\n                                           │\nLoki recording rules ──── remote_write ───→ Prometheus\n                                           │\nGrafana :3000 ←── PromQL + LogQL ──────────┘\n```\n\n**Hooks** provide what native OTel cannot: git repo context and labeled tool/event counters. \nEverything else (tokens, cost, sessions) comes from native OTel export.\n\n## Hook Setup\n\n```bash\n./hooks/install.sh              # all detected CLIs\n./hooks/install.sh claude       # specific CLI\n./hooks/install.sh codex gemini # selective\n./hooks/uninstall.sh            # clean removal\n```\n\nThe installer auto-detects installed CLIs and merges hook configuration into their config files (creating backups first).\n\n| CLI         | Hooks                                                 | Native OTel signals     |\n|-------------|-------------------------------------------------------|-------------------------|\n| Claude Code | `PreToolUse`, `PostToolUse`, `SessionStart`, `Stop`   | metrics + logs          |\n| Codex CLI   | `agent-turn-complete`                                 | logs                    |\n| Gemini CLI  | `AfterTool`, `AfterAgent`, `AfterModel`, `SessionEnd` | metrics + logs + traces |\n\n## Claude Code Skills\n\nSeven slash-command skills for querying the obs stack directly from Claude Code — no browser needed.\n\n| Skill | What it does |\n|-------|-------------|\n| `/obs-status` | Stack health: service status, scrape targets, last telemetry, active alerts |\n| `/obs-cost` | Cost report by provider and model (supports `today`, `yesterday`, `week`, `24h`) |\n| `/obs-sessions` | Recent sessions with model, duration, tool count, cost |\n| `/obs-tools` | Top tools, error rates, usage by provider and repo |\n| `/obs-alerts` | Active alerts with severity and resolution hints |\n| `/obs-compare` | Side-by-side provider comparison: sessions, cost, tokens, tools, errors |\n| `/obs-query` | Free-form PromQL or LogQL — run any query inline |\n\nSkills are installed automatically when you clone the repo (they live in `.claude/skills/`). All API calls go through `scripts/obs-api.sh` — a centralized helper that's ready for auth and TLS when you need it:\n\n```bash\n# Default: plain HTTP to localhost (single-machine, no auth)\n./scripts/obs-api.sh prometheus /api/v1/query --data-urlencode 'query=up'\n\n# With auth (set env vars when hardening or going multi-machine)\nSHEPARD_API_TOKEN=secret ./scripts/obs-api.sh prometheus /api/v1/query ...\nSHEPARD_CA_CERT=/path/to/ca.pem ./scripts/obs-api.sh loki /ready\n```\n\n## Rust Accelerator (optional)\n\nAll hooks work out of the box with bash + jq + curl. For faster execution, you can optionally install the [Rust accelerator](https://github.com/shepard-system/shepard-hooks-rs) — a single static binary that replaces the entire bash pipeline:\n\n```bash\n./scripts/install-accelerator.sh           # latest release → hooks/bin/ (no sudo)\n./scripts/install-accelerator.sh v0.4.0    # specific version\n```\n\nThe installer downloads a pre-built binary from [GitHub Releases](https://github.com/shepard-system/shepard-hooks-rs/releases) (linux/macOS, x64/arm64) and verifies it against the `SHA256SUMS` file published with each release. The binary is placed in `hooks/bin/` (gitignored, project-local).\n\nHooks auto-detect it via `hooks/lib/accelerator.sh` (project-local → PATH → bash fallback). No configuration needed — if the binary is present, hooks use it; if not, they fall back to bash.\n\nRemove with `./hooks/uninstall.sh` or simply delete `hooks/bin/`.\n\n## Alerting\n\nAlertmanager runs on :9093 with 16 alert rules in three tiers:\n\n| Tier               | Alerts | Examples                                                                                                                              |\n|--------------------|--------|---------------------------------------------------------------------------------------------------------------------------------------|\n| **Infrastructure** | 6      | `OTelCollectorDown`, `CollectorHighMemory`, `PrometheusHighMemory`, export failures                                                   |\n| **Pipeline**       | 5      | `LokiDown`, `ShepherdServicesDown`, `TempoDown`, `PrometheusTargetDown`, `LokiRecordingRulesFailing`                                   |\n| **Business logic** | 5      | `HighSessionCost` (\u003e$10/hr), `HighTokenBurn` (\u003e50k tok/min), `HighToolErrorRate` (\u003e10%), `SensitiveFileAccess`, `NoTelemetryReceived` |\n\nInhibit rules suppress business-logic alerts when infrastructure is down.\n\nNative Telegram, Slack, and Discord receivers are included — uncomment and configure in `configs/alertmanager/alertmanager.yaml`:\n\n```yaml\n# telegram_configs:\n#   - bot_token: 'YOUR_BOT_TOKEN'\n#     chat_id: YOUR_CHAT_ID\n#     send_resolved: true\n```\n\n## Services\n\n| Service        | Port      | Purpose              |\n|----------------|-----------|----------------------|\n| Grafana        | 3000      | Dashboards \u0026 explore |\n| Prometheus     | 9090      | Metrics \u0026 alerts     |\n| Loki           | 3100      | Log aggregation      |\n| Tempo          | 3200      | Distributed tracing  |\n| Alertmanager   | 9093      | Alert routing        |\n| OTel Collector | 4317/4318 | OTLP gRPC + HTTP     |\n\n## Architecture\n\n\u003cdetails\u003e\n\u003csummary\u003eC4 diagrams (click to expand)\u003c/summary\u003e\n\n### System Context\n\n![C1 System Context](docs/c4/c1-system-context.svg)\n\n### Containers\n\n![C2 Container](docs/c4/c2-container.svg)\n\n### Hook Components\n\n![C3 Hook Components](docs/c4/c3-hooks-components.svg)\n\n### Hook Event Flow\n\n![C4 Hook Event Flow](docs/c4/c4-hook-event-flow.svg)\n\n### Event Schema Normalization\n\n![C5 Event Schema Normalization](docs/c4/c5-event-schema-normalization.svg)\n\n\u003c/details\u003e\n\n## Project Structure\n\n```\nshepard-obs-stack/\n├── docker-compose.yaml\n├── .env.example\n├── .claude/skills/            # Claude Code slash-command skills\n│   ├── obs-status/            # /obs-status — stack health\n│   ├── obs-cost/              # /obs-cost — cost report\n│   ├── obs-sessions/          # /obs-sessions — session summary\n│   ├── obs-tools/             # /obs-tools — tool usage\n│   ├── obs-alerts/            # /obs-alerts — active alerts\n│   ├── obs-compare/           # /obs-compare — provider comparison\n│   └── obs-query/             # /obs-query — free-form PromQL/LogQL\n├── hooks/\n│   ├── bin/                   # Rust accelerator binary (gitignored, downloaded)\n│   ├── lib/                   # shared: accelerator, git context, OTLP metrics + traces, sensitive file detection, session parser\n│   ├── claude/                # PreToolUse + PostToolUse + SessionStart + Stop\n│   ├── codex/                 # notify.sh (agent-turn-complete)\n│   ├── gemini/                # AfterTool + AfterAgent + AfterModel + SessionEnd\n│   ├── install.sh             # auto-detect + inject\n│   └── uninstall.sh           # clean removal\n├── scripts/\n│   ├── init.sh                # bootstrap\n│   ├── install-accelerator.sh # download Rust accelerator to hooks/bin/\n│   ├── obs-api.sh             # centralized API client (auth-ready)\n│   ├── test-signal.sh         # pipeline verification (11 checks)\n│   └── render-c4.sh           # render PlantUML → SVG\n├── tests/\n│   ├── run-all.sh             # test orchestrator (--e2e for Docker smoke)\n│   ├── test-shell-syntax.sh   # bash -n + shellcheck\n│   ├── test-config-validate.sh # JSON + YAML validation\n│   ├── test-hooks.sh          # behavioral tests (41 tests)\n│   ├── test-parsers.sh        # session parser tests (37 tests)\n│   └── fixtures/              # minimal session logs (Claude, Codex, Gemini)\n├── configs/\n│   ├── otel-collector/        # receivers → processors → exporters\n│   ├── prometheus/            # scrape targets + alert rules\n│   ├── alertmanager/          # routing, Telegram/Slack/Discord receivers\n│   ├── loki/                  # storage + 15 recording rules\n│   ├── tempo/                 # trace storage, 7d retention\n│   └── grafana/               # provisioning + 9 dashboard JSONs\n└── docs/c4/                   # architecture diagrams\n```\n\n## Testing\n\n128 automated tests across 4 suites, plus a Docker-based E2E smoke test:\n\n```bash\nbash tests/run-all.sh         # unit tests: syntax, configs, hooks, parsers\nbash tests/run-all.sh --e2e   # + Docker E2E (starts stack, runs test-signal.sh)\n```\n\n| Suite | Tests | What it checks |\n|-------|-------|----------------|\n| Shell Syntax | 24 | `bash -n` on all scripts, shellcheck (if installed) |\n| Config Validation | 26 | JSON dashboards (jq) + YAML configs (PyYAML) + promtool rules + alert regression |\n| Hook Behavior | 41 | PreToolUse guard, PostToolUse metrics, Stop compaction, all Gemini hooks, Codex, install/uninstall |\n| Session Parsers | 37 | Span count, required fields, attributes, error status, trace_id consistency, context breakdown, per-turn spans |\n\nCI runs automatically on push/PR via [GitHub Actions](.github/workflows/test.yml).\n\n## Contributing\n\nIssues and pull requests are welcome. Before submitting changes, run the tests:\n\n```bash\nbash tests/run-all.sh\n```\n\n## License\n\n[Elastic License 2.0](LICENSE) — free to use, modify, and distribute. Cannot be offered as a hosted or managed service.\n\nPart of the [Shepard System](https://github.com/shepard-system).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshepard-system%2Fshepard-obs-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshepard-system%2Fshepard-obs-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshepard-system%2Fshepard-obs-stack/lists"}