{"id":50595821,"url":"https://github.com/tonytech83/llm-init-trblsh","last_synced_at":"2026-06-05T14:01:28.916Z","repository":{"id":355842814,"uuid":"1187303752","full_name":"tonytech83/llm-init-trblsh","owner":"tonytech83","description":"Automated initial troubleshooting of failed Linux systemd services using LLM analysis","archived":false,"fork":false,"pushed_at":"2026-05-05T13:10:05.000Z","size":203,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-05T15:04:42.206Z","etag":null,"topics":["fastapi","mpc","python"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tonytech83.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-20T15:15:04.000Z","updated_at":"2026-05-05T13:00:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tonytech83/llm-init-trblsh","commit_stats":null,"previous_names":["tonytech83/llm-init-trblsh"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tonytech83/llm-init-trblsh","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytech83%2Fllm-init-trblsh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytech83%2Fllm-init-trblsh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytech83%2Fllm-init-trblsh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytech83%2Fllm-init-trblsh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tonytech83","download_url":"https://codeload.github.com/tonytech83/llm-init-trblsh/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytech83%2Fllm-init-trblsh/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33944671,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-05T02:00:06.157Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","mpc","python"],"created_at":"2026-06-05T14:01:27.782Z","updated_at":"2026-06-05T14:01:28.910Z","avatar_url":"https://github.com/tonytech83.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Initial Troubleshooting\n\nA proof-of-concept system for **automated initial troubleshooting of failed Linux systemd services** using LLM analysis. When a service fails on a monitored host, the pipeline automatically collects logs, analyzes them with a local LLM, and delivers a structured diagnosis with investigation steps — all without human intervention.\n\n## Architecture\n\n![Architecture](docs/architecture.png)\n\n## Pipeline\n\n```\nAlloy → Loki → Alertmanager → Agent (MCP Client) → MCP Server\n```\n\n1. **Alloy** collects systemd journal entries (error level and above) from the target host and ships them to Loki.\n2. **Loki** aggregates logs and evaluates alert rules. When critical log activity is detected, it fires an alert to Alertmanager.\n3. **Alertmanager** groups alerts by hostname and routes them via webhook to the Agent.\n4. **Agent** (MCP Client) receives the webhook, creates an incident record, and invokes MCP Server tools to gather forensic data via SSH.\n5. **MCP Server** SSHes into the target host, retrieves failed services (`systemctl`) and their logs (`journalctl`).\n6. The Agent sends the collected logs to a local **Ollama** LLM instance for analysis.\n7. The analysis (root cause, investigation steps, possible causes) is stored in SQLite and a **Telegram** notification is sent with a link to the incident UI.\n\n## Infrastructure\n\nThree Vagrant VMs on a private network (`192.168.56.0/24`), provisioned automatically with shell scripts:\n\n| VM             | Hostname                   | IP            | Resources    | Services                         |\n| -------------- | -------------------------- | ------------- | ------------ | -------------------------------- |\n| target         | target.concept.lab         | 192.168.56.13 | 1 CPU / 1 GB | Alloy, dummy-fail.service        |\n| monitor        | monitor.concept.lab        | 192.168.56.14 | 2 CPU / 2 GB | Loki, Alertmanager, Grafana      |\n| troubleshooter | troubleshooter.concept.lab | 192.168.56.15 | 2 CPU / 4 GB | Agent, MCP Server, Docker, Gitea |\n\n## Components\n\n### Alloy (target VM — port 12345)\n\nGrafana Alloy reads the systemd journal and filters log entries at `err`, `crit`, `alert`, and `emerg` priority levels. Logs are labelled and pushed to Loki.\n\n### Loki (monitor VM — port 3100)\n\nStores logs and evaluates the `CriticalLogDetected` alert rule:\n\n```\nsum by (host, ip) (count_over_time({job=\"journald\", level=~\"err|crit|alert|emerg\"}[1m])) \u003e 0\n```\n\nRule is evaluated every 15 seconds. On match, an alert fires to Alertmanager.\n\n### Alertmanager (monitor VM — port 9093)\n\nGroups alerts by hostname with a 30-second group wait and routes them as HTTP webhooks to the Agent at `http://192.168.56.15:8080/alert`.\n\n### Grafana (monitor VM — port 3000)\n\nProvides a dashboard for manual log exploration using Loki as a datasource.\n\n### Agent + MCP Server (troubleshooter VM — port 8080)\n\nA FastAPI webhook receiver (MCP Client) paired with a FastMCP server. Together they collect forensic data from the target host via SSH, run LLM analysis through Ollama, persist incidents in SQLite, serve a web UI, and send Telegram notifications.\n\nSee [vagrant/trblsh/README.md](vagrant/trblsh/README.md) for full details: API endpoints, MCP tools, database schema, LLM prompt and output format, configuration reference, and known limitations.\n\n### Ollama (external)\n\nA local Ollama instance (default: `http://192.168.0.88:11434`) serves the `qwen2.5:3b` model. The address is configurable in the Agent's environment. Any Ollama-compatible model can be substituted.\n\n## Prerequisites\n\n- [Vagrant](https://www.vagrantup.com/) with [VirtualBox](https://www.virtualbox.org/)\n- Ollama running locally with a supported model pulled (e.g. `ollama pull qwen2.5:3b`)\n- A Telegram bot token and chat ID for notifications\n\n## Setup\n\n### 1. Clone and start VMs\n\n```bash\ngit clone \u003crepo-url\u003e\ncd llm-init-trblsh\nvagrant up\n```\n\nAll three VMs are provisioned automatically. This installs and configures Alloy, Loki, Alertmanager, Grafana, Docker, Gitea, and the Docker registry.\n\n### 2. Start the Troubleshooter FastAPI app\n\nDo ssh to troubleshooter `vagrant ssh troubleshooter`:\n\n```sh\ncd /vagrant/trblsh\n\nsource .venv/bin/activate\n\nuvicorn app.main:app --host 0.0.0.0 --port 8080\n```\n\n## Testing\n\nDo ssh to target VM `vagrant ssh target`\n\n```sh\ncd /vagrant/target\n\nsudo cp dummy-fail.service /etc/systemd/system\n\nsudo systemctl daemon-reload\nsudo systemctl enable dummy-fail.service\nsudo systemctl start dummy-fail.service\n```\n\nA dummy failing service and a helper script are provided on the target VM:\n\n```sh\n./create_event.sh\n```\n\nThis starts `dummy-fail.service`, which immediately fails and produces error-level journal entries. Within ~1 minute the alert fires and the Agent processes the incident. Check the UI or your Telegram chat for the result.\n\n## Project Structure\n\n```\n.\n├── Vagrantfile                        # VM definitions (3 VMs)\n├── scripts/                           # Provisioning shell scripts\n│   ├── add-hosts.sh\n│   ├── install-and-setup-alloy.sh\n│   ├── install-and-setup-loki.sh\n│   ├── install-and-setup-alertmanager.sh\n│   ├── install-and-setup-grafana.sh\n│   ├── install-docker.sh\n│   ├── install-docker-registry.sh\n│   ├── install-gitea.sh\n│   ├── set-and-copy-ssh-key.sh\n│   └── setup-gitea.sh\n└── vagrant/\n    ├── gitea/\n    │   └── gitea-compose.yaml         # Gitea + MySQL Docker Compose\n    ├── target/\n    │   ├── dummy-fail.service         # Systemd service that always fails (test)\n    │   └── create_event.sh            # Script to trigger a test failure event\n    └── trblsh/\n        ├── agent.py                   # MCP Client + FastAPI webhook handler\n        ├── server.py                  # MCP Server (SSH tools)\n        ├── requirements.txt\n        ├── .env.example\n        ├── ignore_list.txt            # Services to exclude from analysis\n        └── templates/\n            ├── home.html              # Incident list UI\n            └── alert.html             # Incident detail UI\n```\n\n## Configuration Reference\n\n| Setting               | Location             | Default | Description                     |\n| --------------------- | -------------------- | ------- | ------------------------------- |\n| Alert group wait      | `alertmanager.yml`   | 30s     | Delay before first alert fires  |\n| Alert repeat interval | `alertmanager.yml`   | 2m      | Interval for repeat alerts      |\n| Loki rule evaluation  | `journald-logs.yaml` | 15s     | How often alert rule is checked |\n\nFor Agent and MCP Server configuration (Ollama URL, model, SSH key, Telegram credentials, ignore list) see [vagrant/trblsh/README.md](vagrant/trblsh/README.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonytech83%2Fllm-init-trblsh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftonytech83%2Fllm-init-trblsh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonytech83%2Fllm-init-trblsh/lists"}