https://github.com/helmcode/k8s-watchdog-ai
Autonomous Kubernetes cluster observability with AI-powered
https://github.com/helmcode/k8s-watchdog-ai
Last synced: 5 months ago
JSON representation
Autonomous Kubernetes cluster observability with AI-powered
- Host: GitHub
- URL: https://github.com/helmcode/k8s-watchdog-ai
- Owner: helmcode
- License: mit
- Created: 2026-01-13T14:31:00.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-01-22T13:20:06.000Z (5 months ago)
- Last Synced: 2026-01-23T04:22:41.992Z (5 months ago)
- Language: Python
- Homepage:
- Size: 106 KB
- Stars: 8
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# K8s Watchdog AI đ
> Autonomous Kubernetes cluster observability with AI-powered weekly health reports
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://fastapi.tiangolo.com/)
An intelligent Kubernetes monitoring agent that uses Claude AI to autonomously investigate cluster health, analyze metrics from Prometheus, and generate comprehensive weekly PDF reports delivered via Slack.
## ⨠Features
- đ¤ **AI-Powered Analysis**: Claude AI autonomously investigates cluster issues using direct Python tools
- đ **Prometheus Integration**: Analyzes metrics to detect resource inefficiencies (optional)
- đ **Read-Only by Design**: All operations are read-only for safety
- đ **PDF Reports**: Professional HTML reports converted to PDF via WeasyPrint
- đ§ **Slack Integration**: Reports delivered via Slack with detailed tool usage information
- đī¸ **Historical Tracking**: SQLite storage for report history
- đ **REST API**: FastAPI server for on-demand report generation
- ⥠**Graceful Degradation**: Works with or without Prometheus
## đī¸ Architecture
```
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Kubernetes Cluster â
â â
â ââââââââââââââââââââââââââââââââââââââââââââââââ â
â â K8s Watchdog AI (FastAPI) â â
â â â â
â â ââââââââââââââââââââââââââââââââââââââ â â
â â â Claude AI Agent â â â
â â â - Autonomous investigation â â â
â â â - Tool selection & execution â â â
â â â - Report generation â â â
â â âââââââââââŦâââââââââââââââââââââââââââ â â
â â â â â
â â âââââââââââŧâââââââââââ âââââââââââââââââââ â
â â â Kubernetes Tools â â Prometheus ââ â
â â â - get pods/nodes â â Tools ââ â
â â â - describe â â - query ââ â
â â â - logs â â - range query ââ â
â â â - events â â - memory/cpu ââ â
â â ââââââââââââââââââââââ âââââââââââââââââââ â
â â â â
â â ââââââââââââââââââââââââââââââââââââââ â â
â â â Report Generator & Storage â â â
â â â - WeasyPrint (HTML â PDF) â â â
â â â - SQLite (history) â â â
â â â - Slack Files API v2 â â â
â â ââââââââââââââââââââââââââââââââââââââ â â
â ââââââââââââââââââââŦââââââââââââââââââââââââââââ â
ââââââââââââââââââââââŧââââââââââââââââââââââââââââââââââ
â
âŧ
Slack Webhook
```
## đ Quick Start
### Prerequisites
- Kubernetes cluster with kubectl access (or kubeconfig for local development)
- Anthropic API key ([Get one here](https://console.anthropic.com/))
- Slack webhook URL ([Create one](https://api.slack.com/messaging/webhooks))
- Slack Bot Token and Channel ID for file uploads ([Create bot](https://api.slack.com/apps))
- Prometheus running in cluster (optional - reports work without it)
### Local Development with Docker Compose
```bash
# 1. Clone the repository
git clone https://github.com/helmcode/k8s-watchdog-ai.git
cd k8s-watchdog-ai
# 2. Copy and configure environment
cp .env.example .env
# Edit .env with your API keys and settings
# 3. Run with docker-compose
docker-compose up -d
# 4. Trigger report generation
curl -X POST http://localhost:8000/report
# 5. Check status
curl http://localhost:8000/health
# 6. View logs
docker-compose logs -f
```
### Deploy to Kubernetes with Helm
```bash
# 1. Store secrets in Vault (if using Vault)
vault kv put helmcode_platform/k8s_watchdog_ai \
ANTHROPIC_API_KEY="sk-ant-..." \
SLACK_WEBHOOK_URL="https://hooks.slack.com/..." \
SLACK_BOT_TOKEN="xoxb-..." \
SLACK_CHANNEL="C123456789"
# 2. Install with Helm
helm install k8s-watchdog-ai ./helm \
--namespace watchdog-ai \
--create-namespace \
--values ./helm/values/prod.yaml
# 3. Verify deployment
kubectl get pods -n watchdog-ai
kubectl logs -f deployment/k8s-watchdog-ai -n watchdog-ai
```
For detailed Helm deployment instructions, see [helm/README.md](helm/README.md).
### Deploy with ArgoCD
```bash
# Apply ArgoCD Application
kubectl apply -f helm/argocd/application.yaml
# Monitor deployment
argocd app get k8s-watchdog-ai
```
For ArgoCD configuration details, see [helm/argocd/README.md](helm/argocd/README.md).
## âī¸ Configuration
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `ANTHROPIC_API_KEY` | â
| - | Claude API key |
| `ANTHROPIC_MODEL` | â | claude-sonnet-4-20250514 | AI model to use |
| `SLACK_WEBHOOK_URL` | â
| - | Slack webhook for messages |
| `SLACK_BOT_TOKEN` | â
| - | Bot token for file uploads |
| `SLACK_CHANNEL` | â
| - | Channel ID (e.g., C123456789) |
| `PROMETHEUS_URL` | â | http://prometheus:9090 | Prometheus server URL |
| `CLUSTER_NAME` | â | default | Cluster identifier |
| `CLIENT_NAME` | â | default | Client/customer name |
| `EXCLUDED_NAMESPACES` | â | kube-system,kube-public,... | Namespaces to exclude |
| `REPORT_LANGUAGE` | â | spanish | Report language (spanish/english) |
| `JOB_POLL_INTERVAL` | â | 5 | Seconds between queue polls |
| `JOB_MAX_RETRIES` | â | 3 | Max retry attempts for failed jobs |
| `SQLITE_PATH` | â | /app/data/reports.db | SQLite database path |
| `LOG_LEVEL` | â | INFO | Logging level |
See [.env.example](.env.example) for complete list.
## đ How It Works
1. **FastAPI Server**: Runs continuously, exposing `/report` and `/health` endpoints
2. **Trigger**: Can be called via HTTP POST or scheduled with Kubernetes CronJob
3. **AI Investigation**:
- Claude receives a system prompt with available tools
- Agent autonomously decides what to investigate
- Makes iterative queries to Kubernetes and Prometheus (if available)
4. **Analysis**: AI analyzes cluster health, resource usage, and metrics
5. **Report Generation**: Creates HTML report, converts to PDF with WeasyPrint
6. **Delivery**: Uploads PDF to Slack with detailed tool usage information
7. **Storage**: Saves report to SQLite for history tracking
### Example AI Investigation Flow
```
Claude: "Let me check the overall pod status"
â Calls: kubectl_get_pods(namespace="default", all_namespaces=True)
Claude: "I see pod X has 15 restarts. Let me investigate"
â Calls: kubectl_describe_pod(pod="X", namespace="production")
â Calls: kubectl_get_pod_logs(pod="X", namespace="production", tail=100)
Claude: "This looks like OOMKilled. Let me check memory metrics"
â Calls: prometheus_check_pod_memory(pod="X", namespace="production")
â Calls: prometheus_query(query="container_memory_working_set_bytes{pod='X'}")
Claude: "Memory usage is consistently above request. Recommending increase"
â Generates HTML report with specific recommendations
â Report includes: issue analysis, metrics charts, action plan
```
### Tool Availability Detection
The system intelligently handles tool availability:
```
â
Kubernetes API: 5 tool types used
âĸ Tools: kubectl_describe_pod, kubectl_get_deployments, kubectl_get_events, ...
â Prometheus: Connection failed
âĸ Prometheus not available: All connection attempts failed
âšī¸ Report generated using Kubernetes data only
```
## đ Report Structure
Reports include:
1. **Executive Summary**: Overall health status (đĸđĄđ´)
2. **Top Issues**: 3-5 critical problems with severity levels
3. **Resource Analysis**: Over/under-provisioned workloads
4. **Prometheus Metrics**: CPU, memory, disk usage (when available)
5. **Action Plan**: Prioritized, actionable recommendations
6. **Footer**: Generated by Watchdog AI - Helmcode
The PDF report is accompanied by a Slack message showing:
- Report generation time
- Data sources used (Kubernetes API, Prometheus)
- Tool usage statistics
- Connection status for each service
## đ ī¸ Development
```bash
# Install dependencies
pip install -e ".[dev]"
# Run locally (requires kubeconfig)
python -m src.main
# Format code
black src/
ruff check src/
# Type check
mypy src/
# Build Docker image
docker build -t k8s-watchdog-ai:latest .
```
## đ Security
- **Read-only access**: All operations are read-only (get, list, watch, describe, logs)
- **RBAC**: Minimal permissions required in Kubernetes
- **No cluster modifications**: Agent cannot modify cluster state
- **Secrets management**: Kubernetes secrets for sensitive data
- **Connection errors**: Gracefully handles unavailable services
## đ API Endpoints
- `POST /report` - Generate and send report immediately (returns 202 Accepted)
- `GET /health` - Health check endpoint
- `GET /stats` - Report generation statistics
## đ Documentation
- [CLAUDE.md](CLAUDE.md) - Detailed technical documentation for AI assistants
- [Architecture Overview](#architecture) - System design
- [API Documentation](#api-endpoints) - REST endpoints
## đ¤ Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## đ License
MIT License - see [LICENSE](LICENSE) for details.
## đ Acknowledgments
- [Anthropic Claude](https://www.anthropic.com/claude) - AI engine
- [FastAPI](https://fastapi.tiangolo.com/) - Web framework
- [WeasyPrint](https://weasyprint.org/) - PDF generation
- [Kubernetes Python Client](https://github.com/kubernetes-client/python) - K8s integration
---
**Made with â¤ī¸ by [Helmcode](https://helmcode.com)**