{"id":51312393,"url":"https://github.com/pgedge/pg-healthcheck","last_synced_at":"2026-07-01T05:02:04.796Z","repository":{"id":365253749,"uuid":"1230615281","full_name":"pgEdge/pg-healthcheck","owner":"pgEdge","description":"pgEdge Labs: Enterprise-grade PostgreSQL health diagnostics for single instances and pgEdge Spock clusters","archived":false,"fork":false,"pushed_at":"2026-06-16T13:54:27.000Z","size":20996,"stargazers_count":1,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-16T15:27:13.751Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pgEdge.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-06T06:54:20.000Z","updated_at":"2026-05-31T14:11:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pgEdge/pg-healthcheck","commit_stats":null,"previous_names":["pgedge/pg-healthcheck"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/pgEdge/pg-healthcheck","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgEdge%2Fpg-healthcheck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgEdge%2Fpg-healthcheck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgEdge%2Fpg-healthcheck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgEdge%2Fpg-healthcheck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pgEdge","download_url":"https://codeload.github.com/pgEdge/pg-healthcheck/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgEdge%2Fpg-healthcheck/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34993438,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-01T02:00:05.325Z","response_time":130,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-07-01T05:02:03.826Z","updated_at":"2026-07-01T05:02:04.788Z","avatar_url":"https://github.com/pgEdge.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pg-healthcheck\n\n\u003e Enterprise-grade PostgreSQL health diagnostics for single instances and pgEdge multi-node Spock clusters.\n\n![CI](https://github.com/ahsanhadi/pg-healthcheck/actions/workflows/ci.yml/badge.svg)\n![Go](https://img.shields.io/badge/Go-1.25+-00ADD8?logo=go\u0026logoColor=white)\n![PostgreSQL](https://img.shields.io/badge/PostgreSQL-13+-336791?logo=postgresql\u0026logoColor=white)\n![License](https://img.shields.io/badge/license-PostgreSQL-blue)\n![Platforms](https://img.shields.io/badge/platforms-linux%20%7C%20macOS%20%7C%20windows-lightgrey)\n\nRuns **180+ checks across 14 groups** and queries live PostgreSQL system catalog views — no estimates, no simulated data. Output is coloured terminal text or structured JSON for GUI/API consumption.\n\n---\n\n## How the App Works — Architecture\n\nIf you are new to Go, this section explains how all the pieces fit together before you start reading any code.\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│  You run:  ./pg-healthcheck --host db1 --dbname mydb             │\n└───────────────────────────┬─────────────────────────────────────┘\n                            │\n                    ┌───────▼────────┐\n                    │   main.go      │  ← Entry point. Parses flags\n                    │   (CLI layer)  │    and kicks everything off.\n                    └───────┬────────┘\n                            │\n               ┌────────────▼────────────┐\n               │     config.go           │  ← Loads healthcheck.yaml +\n               │  (Configuration layer)  │    applies CLI flags on top.\n               └────────────┬────────────┘\n                            │\n          ┌─────────────────▼──────────────────┐\n          │          connector/pg.go            │  ← Opens a PostgreSQL\n          │       (Database connection)         │    connection pool.\n          └─────────────────┬──────────────────┘\n                            │\n          ┌─────────────────▼──────────────────┐\n          │          Check Groups               │  ← The core work happens\n          │   G01 → G14  (14 groups, 100+ checks)   here. Each group is an\n          │         checks/*.go                 │    independent Go file.\n          └─────────────────┬──────────────────┘\n                            │  (list of Findings)\n          ┌─────────────────▼──────────────────┐\n          │         report/reporter.go          │  ← Formats and prints the\n          │         (Output layer)              │    results as text or JSON.\n          └────────────────────────────────────┘\n```\n\n### The five layers explained simply\n\n**1. CLI layer (`cmd/pg-healthcheck/main.go`)**\nThis is where `main()` lives. It uses a library called Cobra to define all the `--flags` you pass on the command line. Once the flags are parsed, it calls `run()` which orchestrates everything else. Think of this as the \"front door\" of the application.\n\n**2. Configuration layer (`internal/config/config.go`)**\nThere is one central `Config` struct that holds every threshold the tool uses — things like \"warn when WAL dir \u003e 20 GB\" or \"critical when TLS cert expires within 7 days\". Defaults are hardcoded. You can override any value with `healthcheck.yaml`. CLI flags override the YAML. The final Config is handed to every check.\n\n**3. Database connection (`internal/connector/pg.go`)**\nOpens a PostgreSQL connection pool using the `pgx` library. In cluster mode, it opens one pool per node. The pool is passed to every check — checks never open their own connections.\n\n**4. Check groups (`internal/checks/`)**\nThis is where all the actual work happens. Each group (G01, G02, … G14) is one Go file. They all follow the same simple pattern:\n\n```\n┌──────────────────────────────────────────────────────────┐\n│  Every check group implements the Checker interface:      │\n│                                                           │\n│  type Checker interface {                                 │\n│      GroupID() string          ← e.g. \"G05\"              │\n│      Name()    string          ← e.g. \"Vacuum \u0026 Bloat\"   │\n│      Run(ctx, db, cfg)         ← runs the SQL queries     │\n│          ([]Finding, error)       and returns results     │\n│  }                                                        │\n└──────────────────────────────────────────────────────────┘\n```\n\nEach check returns one or more **Findings**. A Finding is a simple struct:\n\n```\nFinding {\n    CheckID     \"G05-001\"                   ← unique check ID\n    Group       \"Vacuum \u0026 Bloat\"            ← which group\n    Severity    OK / INFO / WARN / CRITICAL ← how bad is it?\n    Title       \"TXID wraparound\"           ← what was checked\n    Observed    \"450M transactions left\"    ← what was found\n    Recommended \"Run VACUUM FREEZE\"         ← what to do\n    Detail      \"...\"                       ← extra context\n    DocURL      \"https://postgresql.org/…\"  ← link to docs\n    NodeName    \"node1:5432\"                ← (cluster mode only)\n}\n```\n\n**5. Output layer (`internal/report/reporter.go`)**\nTakes the full list of Findings and either:\n- Prints coloured terminal text grouped by check group, sorted by severity\n- Writes a single JSON object to stdout (for GUI / API consumption)\n\nAlso responsible for composite alerts — if two related groups are simultaneously CRITICAL (e.g. G02 archiving + G14 WAL growth), it prints a combined banner explaining the combined danger.\n\n---\n\n### How a single check actually works\n\nHere is a simplified real example from G13 (`g13_os_resources.go`):\n\n```go\n// Check if the background writer had to stop mid-scan\nfunc g13MaxwrittenClean(ctx context.Context, db *pgxpool.Pool) []Finding {\n\n    // 1. Run a SQL query against a PostgreSQL catalog view\n    var count int64\n    db.QueryRow(ctx, \"SELECT maxwritten_clean FROM pg_stat_bgwriter\").Scan(\u0026count)\n\n    // 2. Compare the result against a threshold\n    if count \u003e 0 {\n        // 3a. Return a WARN Finding with a recommendation\n        return []Finding{NewWarn(\"G13-003\", g13, \"maxwritten_clean\",\n            fmt.Sprintf(\"maxwritten_clean = %d\", count),\n            \"Increase bgwriter_lru_maxpages or shared_buffers.\",\n            \"bgwriter is stopping cleaning passes mid-scan.\",\n            \"https://postgresql.org/docs/...\")}\n    }\n\n    // 3b. Return an OK Finding\n    return []Finding{NewOK(\"G13-003\", g13, \"maxwritten_clean\",\n        fmt.Sprintf(\"maxwritten_clean = %d\", count),\n        \"https://postgresql.org/docs/...\")}\n}\n```\n\nEvery single check in the codebase follows this exact pattern. If you want to add a new check, you just add a function like this to the right group file and call it from that group's `Run()` method.\n\n---\n\n### Severity levels\n\n| Level | Meaning |\n|---|---|\n| ✓ **OK** | Everything is fine |\n| ⓘ **INFO** | Advisory — good to know, no action urgently needed |\n| ⚠ **WARN** | Should be fixed before the next incident window |\n| ✗ **CRITICAL** | Requires immediate attention |\n\nExit codes: `0` = all OK · `1` = any WARN · `2` = any CRITICAL\n\n---\n\n### WAL health — the three-group picture\n\nThree groups together cover the full WAL lifecycle:\n\n```\n  WAL is written here          WAL sits on disk here       WAL leaves the server\n  ┌──────────────────┐         ┌────────────────────┐      ┌───────────────────┐\n  │ G14 WAL Growth   │ ──────► │  pg_wal directory  │ ───► │  G02 pgBackRest   │\n  │ (generation rate)│         │  G09 slot retention│      │  (archive pipeline│\n  └──────────────────┘         └────────────────────┘      └───────────────────┘\n```\n\n- **G02** — monitors whether WAL is leaving the server successfully (archiving)\n- **G09** — monitors whether WAL is being held back by inactive replication slots\n- **G14** — monitors WAL as a raw resource: how fast it is produced and how much disk it occupies\n\nIf two of these are simultaneously CRITICAL, the reporter prints a **composite alert** explaining the combined danger.\n\n---\n\n## Quick Start\n\n### Download a release (no Go required)\n\nPre-built binaries for Linux, macOS, and Windows are available on the [Releases page](https://github.com/ahsanhadi/pg-healthcheck/releases).\n\n```bash\n# macOS (Apple Silicon)\ncurl -L https://github.com/ahsanhadi/pg-healthcheck/releases/latest/download/pg-healthcheck_macOS_arm64.tar.gz | tar xz\n\n# macOS (Intel)\ncurl -L https://github.com/ahsanhadi/pg-healthcheck/releases/latest/download/pg-healthcheck_macOS_amd64.tar.gz | tar xz\n\n# Linux (amd64)\ncurl -L https://github.com/ahsanhadi/pg-healthcheck/releases/latest/download/pg-healthcheck_linux_amd64.tar.gz | tar xz\n```\n\nEach archive includes the binary, `LICENSE`, `README.md`, and a ready-to-edit `healthcheck.yaml`.\n\n### Build from source\n\n```bash\ngit clone https://github.com/ahsanhadi/pg-healthcheck.git\ncd pg-healthcheck\ngo build -o pg-healthcheck ./cmd/pg-healthcheck/\n```\n\nRequires Go 1.25+. Install with `brew install go` on macOS or `apt install golang-go` on Ubuntu.\n\n### Run against a local database\n\n```bash\n./pg-healthcheck --host localhost --port 5432 --dbname mydb --user postgres\n```\n\n### Password — use an environment variable\n\n```bash\nPGPASSWORD=secret ./pg-healthcheck --host db1 --dbname prod --user postgres\n```\n\n### Run only specific groups\n\n```bash\n./pg-healthcheck --groups G01,G05,G09,G14\n```\n\n### Show all checks including OK ones\n\n```bash\n./pg-healthcheck --verbose\n```\n\n### JSON output (for GUI or scripting)\n\n```bash\n./pg-healthcheck --output json | jq '.summary'\n./pg-healthcheck --output json \u003e report.json\n```\n\n### Cluster mode (pgEdge / Spock)\n\n```bash\n./pg-healthcheck \\\n  --mode cluster \\\n  --nodes node1:5432,node2:5432,node3:5432 \\\n  --dbname mydb --user postgres\n```\n\n\u003e **Note:** If two entries in `--nodes` resolve to the same database (e.g. during testing\n\u003e with a single node), duplicate findings are automatically suppressed — each check\n\u003e appears exactly once per unique node.\n\n### Upgrade readiness check\n\n```bash\n./pg-healthcheck --groups G10 --target-version 17\n```\n\n---\n\n## Natural Language Interface — `ask`\n\nThe `ask` subcommand lets you describe what you want to check in plain English.\nIt maps your query to the right check group(s) and runs only those.\n\n```bash\npg-healthcheck ask \"check for TOAST table corruption\" --host db1 --dbname mydb\npg-healthcheck ask \"how is WAL disk usage?\" --host db1 --output json\npg-healthcheck ask \"are there any lock contention issues?\" --verbose\n```\n\nUnder the hood `ask` uses an LLM to understand your query.\nThree providers are supported:\n\n| Provider | When to use |\n|---|---|\n| `ollama` | Default. Local model, no internet, no API key — ideal for air-gapped servers |\n| `openai` | OpenAI GPT-4o / GPT-4o-mini, or any OpenAI-compatible endpoint (Azure, Groq, Together AI, …) |\n| `gemini` | Google Gemini — fastest and cheapest cloud option for this task |\n\nIf the LLM provider is unreachable or no API key is found, `ask` falls back to\nbuilt-in keyword matching automatically — no error, no crash.\n\n---\n\n### Provider 1 — Ollama (air-gapped, no API key)\n\n[Ollama](https://ollama.com) runs models locally. Nothing leaves your machine.\n\n**Install Ollama and pull a model:**\n\n```bash\n# Install (Linux)\ncurl -fsSL https://ollama.com/install.sh | sh\n\n# macOS\nbrew install ollama\n\n# Pull a model — llama3.2 is small (2 GB) and works well\nollama pull llama3.2\n\n# Or use a larger model for better accuracy\nollama pull mistral\nollama pull llama3.1:8b\n```\n\n**Run `ask` — Ollama is the default, no flags needed:**\n\n```bash\npg-healthcheck ask \"check for dead tuples and bloat\" --host db1 --dbname mydb\n```\n\n**Specify a different local model:**\n\n```bash\npg-healthcheck ask \"WAL disk usage?\" --ollama-model mistral --host db1\n```\n\n**Custom Ollama host (remote server or different port):**\n\n```bash\npg-healthcheck ask \"any replication lag?\" --ollama-host http://10.0.0.50:11434 --host db1\n```\n\n**Set Ollama as default in `healthcheck.yaml`:**\n\n```yaml\nllm_provider:   ollama\nollama_host:    http://localhost:11434\nollama_model:   llama3.2\n```\n\n---\n\n### Provider 2 — OpenAI (GPT-4o, GPT-4o-mini)\n\nRequires an API key from [platform.openai.com](https://platform.openai.com).\n\n**Pass the key on the command line:**\n\n```bash\npg-healthcheck ask \"check index health\" \\\n  --provider openai \\\n  --api-key sk-proj-... \\\n  --host db1 --dbname mydb\n```\n\n**Pass the key via environment variable (recommended — keeps it out of shell history):**\n\n```bash\nexport OPENAI_API_KEY=sk-proj-...\npg-healthcheck ask \"any security issues?\" --provider openai --host db1\n```\n\n**Set OpenAI as default in `healthcheck.yaml`:**\n\n```yaml\nllm_provider: openai\nllm_api_key:  \"\"               # leave empty and use OPENAI_API_KEY env var\nollama_model: gpt-4o-mini      # or gpt-4o, gpt-4-turbo, etc.\n```\n\n**OpenAI-compatible endpoints (Azure OpenAI, Groq, Together AI, Anyscale, …):**\n\nThese APIs use the same request format as OpenAI. Point `--ollama-host` at the\nbase URL of the compatible endpoint:\n\n```bash\n# Groq (free tier, fast)\nexport OPENAI_API_KEY=gsk_...\npg-healthcheck ask \"check vacuum health\" \\\n  --provider openai \\\n  --ollama-host https://api.groq.com/openai \\\n  --ollama-model llama-3.3-70b-versatile \\\n  --host db1\n\n# Azure OpenAI\npg-healthcheck ask \"security audit\" \\\n  --provider openai \\\n  --ollama-host https://mycompany.openai.azure.com \\\n  --api-key \u003cazure-key\u003e \\\n  --host db1\n```\n\n---\n\n### Provider 3 — Google Gemini\n\nRequires an API key from [Google AI Studio](https://aistudio.google.com/apikey) — free tier available.\n\n**Pass the key on the command line:**\n\n```bash\npg-healthcheck ask \"check for TOAST corruption\" \\\n  --provider gemini \\\n  --api-key AIza... \\\n  --host db1 --dbname mydb\n```\n\n**Pass the key via environment variable:**\n\n```bash\nexport GEMINI_API_KEY=AIza...\npg-healthcheck ask \"WAL growth rate?\" --provider gemini --host db1\n```\n\n**Set Gemini as default in `healthcheck.yaml`:**\n\n```yaml\nllm_provider: gemini\nllm_api_key:  \"\"              # leave empty and use GEMINI_API_KEY env var\nollama_model: gemini-2.0-flash   # or gemini-2.5-flash, gemini-2.5-pro, etc.\n```\n\n**Available Gemini models** (run `ollama pull` is not needed — they're cloud models):\n\n| Model | Notes |\n|---|---|\n| `gemini-2.0-flash` | Default. Fast, cheap, accurate for this task |\n| `gemini-2.5-flash` | Latest flash, slightly better reasoning |\n| `gemini-2.5-pro` | Highest accuracy, higher cost |\n\n---\n\n### How the query mapping works\n\n```\nYour query: \"check for TOAST table corruption and data checksum issues\"\n                              │\n                    ┌─────────▼──────────┐\n                    │   LLM Provider     │  ← Ollama / OpenAI / Gemini\n                    │  (or keyword match)│\n                    └─────────┬──────────┘\n                              │  Returns: \"G07\"\n                    ┌─────────▼──────────┐\n                    │  selectCheckers()  │  ← same as --groups G07\n                    └─────────┬──────────┘\n                              │\n                    ┌─────────▼──────────┐\n                    │   G07 checks run   │  ← normal health-check output\n                    └────────────────────┘\n```\n\nThe LLM is shown a list of all 15 check groups with their scope and asked to return\nmatching group IDs (`G07`, `G14`, …). The response is parsed with a regex — the tool\nis not affected by the LLM adding explanations or punctuation.\n\n**Fallback chain:**\n\n```\n1. Configured LLM provider    ← tried first\n2. Built-in keyword matching  ← used automatically if LLM fails or key is missing\n```\n\nThe status line tells you which was used:\n\n```\nAnalyzing query (provider: gemini)...\nProvider: gemini/gemini-2.0-flash          ← LLM was used\nMatched groups: G07 (TOAST \u0026 Corruption)\n\n# or:\n\nNote: LLM unavailable — using keyword matching   ← fallback\nMatched groups: G07 (TOAST \u0026 Corruption)\n```\n\n---\n\n### What queries work well\n\nThe LLM understands natural language so you are not limited to exact keywords.\nThese all work:\n\n```bash\nask \"is the database healthy?\"\nask \"check everything related to WAL\"\nask \"do we have any bloat or dead row issues?\"\nask \"are there slow queries or missing indexes?\"\nask \"how safe are we from transaction ID wraparound?\"\nask \"is replication keeping up?\"\nask \"check all security-related settings\"\nask \"what about upgrade readiness for PostgreSQL 17?\"\nask \"run the OS resource checks\"\n```\n\n**Multi-group queries** — the LLM returns multiple groups when appropriate:\n\n```bash\nask \"check WAL health and replication lag\"\n# → G09 (WAL \u0026 Replication Slots), G14 (WAL Growth), G15 (Replication Health)\n\nask \"check for corruption, bloat, and index issues\"\n# → G05 (Vacuum \u0026 Bloat), G06 (Indexes), G07 (TOAST \u0026 Corruption)\n```\n\n---\n\n### `ask` flags reference\n\n| Flag | Default | Description |\n|---|---|---|\n| `--provider` | `ollama` | LLM backend: `ollama`, `openai`, or `gemini` |\n| `--api-key` | | API key for cloud providers (or use `OPENAI_API_KEY` / `GEMINI_API_KEY`) |\n| `--ollama-host` | `http://localhost:11434` | Ollama URL or OpenAI-compatible base URL |\n| `--ollama-model` | `llama3.2` | Model name (auto-defaults to `gpt-4o-mini` / `gemini-2.0-flash` for cloud) |\n| `--ollama-timeout` | `30` | LLM request timeout in seconds before falling back to keyword matching |\n\nAll standard connection flags (`--host`, `--port`, `--dbname`, `--user`, `--password`,\n`--output`, `--verbose`, `--no-color`, `--mode`, etc.) work with `ask` exactly as they\ndo with the main command.\n\n---\n\n### `healthcheck.yaml` — NLP configuration\n\n```yaml\n# ── Natural Language Interface (ask subcommand) ───────────────────────────────\n#\n# Provider options:\n#   ollama  — local Ollama server (air-gapped, no API key needed)\n#   openai  — OpenAI or any OpenAI-compatible API (Azure, Groq, etc.)\n#   gemini  — Google Gemini\n#\n# If the provider fails or no key is found, keyword matching is used automatically.\n\nllm_provider:  ollama         # ollama | openai | gemini\nllm_api_key:   \"\"             # leave empty to read from OPENAI_API_KEY or GEMINI_API_KEY\n\nollama_host:   http://localhost:11434   # Ollama URL (also used as OpenAI base URL)\nollama_model:  llama3.2                 # auto-defaults to gpt-4o-mini / gemini-2.0-flash for cloud\nollama_timeout_seconds: 30              # seconds before falling back to keyword matching\n```\n\n---\n\n## All Flags\n\n| Flag | Default | Description |\n|---|---|---|\n| `--host` | `localhost` | PostgreSQL host |\n| `--port` | `5432` | PostgreSQL port |\n| `--dbname` | `postgres` | Database name |\n| `--user` | `postgres` | Role name (or `PGUSER` env) |\n| `--password` | `` | Password (prefer `PGPASSWORD` env) |\n| `--mode` | `single` | `single` or `cluster` |\n| `--nodes` | | Comma-separated `host:port` list (cluster mode) |\n| `--config` | | Path to YAML config file |\n| `--output` | `text` | `text` or `json` |\n| `--groups` | all | Comma-separated group IDs, e.g. `G01,G05,G14` |\n| `--target-version` | `0` | Target PG major version for G10 upgrade checks |\n| `--backrest-config` | | Path to `pgbackrest.conf` |\n| `--no-color` | false | Disable terminal colour |\n| `--verbose` | false | Show OK findings (hidden by default) |\n\n**`ask` subcommand flags** (all standard flags above also apply):\n\n| Flag | Default | Description |\n|---|---|---|\n| `--provider` | `ollama` | LLM backend: `ollama`, `openai`, or `gemini` |\n| `--api-key` | | API key for cloud providers (or `OPENAI_API_KEY` / `GEMINI_API_KEY` env) |\n| `--ollama-host` | `http://localhost:11434` | Ollama URL or OpenAI-compatible base URL |\n| `--ollama-model` | `llama3.2` | Model name (auto-defaults for cloud providers) |\n| `--ollama-timeout` | `30` | LLM timeout in seconds before keyword fallback |\n\n---\n\n## Configuration File (`healthcheck.yaml`)\n\n### How configuration loading works\n\nThresholds are applied in this order — later steps always win:\n\n```\n1. Built-in defaults (safe baselines hardcoded in config.go)\n        ↓\n2. healthcheck.yaml  (your environment-specific overrides)\n        ↓\n3. CLI flags         (one-off overrides for a single run)\n```\n\nYou never have to edit the file. But tuning it is how you make the tool fit your environment rather than the defaults.\n\n### Where to put the file\n\nThe tool looks for `healthcheck.yaml` in the **current working directory** automatically. To use a different path pass `--config`:\n\n```bash\n./pg-healthcheck --config /etc/pg-healthcheck/prod.yaml\n```\n\nA common pattern is one file per environment:\n\n```\n/etc/pg-healthcheck/\n    prod.yaml\n    staging.yaml\n    dev.yaml\n```\n\n### YAML editing rules\n\n- Use **2 spaces** for indentation — no tabs\n- You only need to include the keys you want to change. Omitted keys keep their built-in default\n- Lists can be written inline `[\"a\", \"b\"]` or as block items:\n  ```yaml\n  cross_node_tables:\n    - public.orders\n    - public.users\n  ```\n- Numbers are plain integers or decimals — no quotes needed\n- Comments start with `#`\n\nIf the file has a syntax error the tool will print a warning and fall back to built-in defaults:\n```\nconfig warning: parsing config prod.yaml: yaml: line 12: ...\n```\n\n### Test your file before deploying\n\n```bash\n./pg-healthcheck --config /etc/pg-healthcheck/prod.yaml --groups G01 --verbose\n```\n\n---\n\n### All configuration keys explained\n\n#### Connection (G01)\n\n```yaml\nconnection_timeout_ms:    5000   # milliseconds to wait for a TCP connection\npg_isready_warn_ms:        500   # WARN if SELECT 1 round-trip takes longer than this\nwarn_connections_pct:       75   # WARN when active connections exceed 75% of max_connections\ncritical_connections_pct:   90   # CRITICAL when active connections exceed 90% of max_connections\nidle_in_tx_warn_seconds:    30   # WARN on sessions sitting idle-in-transaction longer than this\n```\n\n\u003e **Tip:** On a busy OLTP server with a connection pooler (PgBouncer), you can safely raise\n\u003e `warn_connections_pct` to 85 since the pooler manages bursts.\n\n#### TLS certificates (G01)\n\n```yaml\nssl_cert_warn_days:      30   # WARN when the server TLS cert expires within 30 days\nssl_cert_critical_days:   7   # CRITICAL when the cert expires within 7 days\n```\n\n#### pgBackRest backup (G02)\n\n```yaml\nbackrest_config:   /etc/pgbackrest/pgbackrest.conf   # path to your pgbackrest.conf\nbackrest_stanza:   main                              # stanza name — run `pgbackrest info` to find yours\nbackup_max_age_hours:     26   # WARN if no successful backup in the last 26 hours\nmin_retention_full:        2   # WARN if fewer than 2 full backups exist\nwal_ready_warn_count:    100   # WARN if \u003e100 WAL files are waiting to be archived\nwal_ready_critical_count: 500  # CRITICAL if \u003e500 WAL files waiting\n```\n\n\u003e **Tip:** `backrest_stanza` is the most common thing to change. Check your pgbackrest.conf\n\u003e or run `pgbackrest info` — the stanza name appears at the top of the output.\n\n#### Queries \u0026 locks (G03, G04)\n\n```yaml\nlong_query_warn_seconds:     60   # WARN on queries running longer than 1 minute\nlong_query_critical_seconds: 300  # CRITICAL on queries running longer than 5 minutes\n```\n\n#### Vacuum \u0026 TXID wraparound (G05)\n\n```yaml\ntxid_wrap_warn_million:     500   # WARN when fewer than 500M transaction IDs remain\ntxid_wrap_critical_million: 200   # CRITICAL when fewer than 200M remain\n```\n\n\u003e **Tip:** Tighten these on high-write databases (lower the numbers). If you see frequent\n\u003e false positives on a read-heavy replica, you can safely raise them.\n\n#### WAL \u0026 replication slots (G09)\n\n```yaml\nreplication_lag_warn_bytes:     52428800    # WARN at  50 MB of replication lag\nreplication_lag_critical_bytes: 524288000   # CRITICAL at 500 MB\nwal_slot_retain_warn_gb:      5    # WARN when a slot is retaining \u003e 5 GB of WAL\nwal_slot_retain_critical_gb:  20   # CRITICAL when retaining \u003e 20 GB\n```\n\n\u003e **New in G09:** Checks G09-009 through G09-013 cover logical replication slot health —\n\u003e invalidated slots, missing workers, and subscription relation sync state. These fire\n\u003e automatically; no additional YAML configuration is required.\n\n#### pgEdge / Spock cluster (G12)\n\n```yaml\nspock_exception_log_warn_rows:   10000   # WARN if spock exception log has \u003e 10k rows\nspock_exception_log_crit_rows:  100000   # CRITICAL at 100k rows\nspock_resolutions_warn_rows:     50000   # WARN if resolutions table has \u003e 50k rows\nspock_old_exception_days:            7   # WARN on unresolved exceptions older than 7 days\n\ncross_node_count_threshold_pct: 1.0   # WARN if row counts differ by more than 1% between nodes\ncross_node_tables:                    # tables to sample for row-count parity (leave empty to skip)\n  - public.orders\n  - public.accounts\n```\n\n#### amcheck — B-tree structural verification (G07)\n\n```yaml\namcheck_table_list:          # tables to run structural B-tree checks on\n  - public.orders\n  - public.accounts\n```\n\n\u003e Leave as `[]` to skip amcheck entirely. Add your most critical indexed tables here.\n\u003e Requires the `amcheck` extension: `CREATE EXTENSION amcheck;`\n\n#### pg_visibility — VM integrity checks (G08)\n\n```yaml\npg_visibility_table_list:    # tables to run pg_check_visible() and pg_check_frozen() on\n  - public.orders\n  - public.accounts\n```\n\n\u003e Leave as `[]` to skip G08-006 entirely. Add your most critical tables here.\n\u003e Requires the `pg_visibility` extension: `CREATE EXTENSION pg_visibility;`\n\u003e G08-006 detects file-level visibility map / heap mismatches — the class of corruption\n\u003e where a page is marked ALL_FROZEN in the VM but still contains unfrozen tuples.\n\u003e This state cannot be detected by VACUUM and can persist silently across major version upgrades.\n\n#### WAL growth \u0026 generation rate (G14)\n\n```yaml\nwal_rate_warn_mb_s:            50    # WARN if WAL is generating faster than 50 MB/s\nwal_rate_critical_mb_s:       200    # CRITICAL at 200 MB/s\n\nwal_dir_warn_gb:               20    # WARN if the pg_wal directory exceeds 20 GB\nwal_dir_critical_gb:           50    # CRITICAL if it exceeds 50 GB\n\nwal_rate_baseline_multiplier:  3.0   # WARN if current rate is \u003e3× the rolling average\nwal_rate_baseline_samples:      12   # how many past samples to keep for the rolling average\n                                     # (12 samples × run frequency = your baseline window)\n\nwal_fpi_ratio_warn:            0.40  # WARN if full-page writes exceed 40% of all WAL records\n\nwal_filesystem_warn_pct:        60   # WARN if the pg_wal filesystem is \u003e60% full\nwal_filesystem_critical_pct:    80   # CRITICAL at \u003e80% — pg_wal exhaustion crashes PostgreSQL\n\nwal_rate_state_file: /var/lib/pg-healthcheck/wal_rate.json   # where to store the rolling baseline\n```\n\n\u003e **Important:** Change `wal_rate_state_file` from `/tmp/` to a persistent path like\n\u003e `/var/lib/pg-healthcheck/`. Files in `/tmp/` are cleared on reboot and the rolling\n\u003e baseline resets, giving false spike alerts on startup.\n\u003e\n\u003e Set `wal_dir_warn_gb` to roughly 40% of your actual pg_wal partition size, and\n\u003e `wal_dir_critical_gb` to 70%.\n\n#### Per-check timeout\n\n```yaml\ncheck_timeout_seconds: 10   # each individual check is cancelled after this many seconds\n```\n\n\u003e Increase to `30` if the tool is connecting over a slow network or the database is under\n\u003e heavy load and catalog queries are slow.\n\n---\n\n### Minimal example for a production server\n\nYou do not need to include every key — only what differs from the defaults:\n\n```yaml\n# /etc/pg-healthcheck/prod.yaml\n\n# Our backups run every 12 hours\nbackup_max_age_hours:        13\nbackrest_stanza:             prod-db\n\n# Tighter wraparound thresholds for our high-write workload\ntxid_wrap_warn_million:      300\ntxid_wrap_critical_million:  100\n\n# pg_wal lives on a 100 GB dedicated volume\nwal_dir_warn_gb:             40\nwal_dir_critical_gb:         70\nwal_filesystem_warn_pct:     50\nwal_filesystem_critical_pct: 70\n\n# Persistent baseline file\nwal_rate_state_file: /var/lib/pg-healthcheck/wal_rate.json\n\n# Slow network — give queries more time\ncheck_timeout_seconds: 30\n```\n\n---\n\n## Check Groups\n\n| Group | Name | Checks |\n|---|---|---|\n| G01 | Connection \u0026 Availability | 9 |\n| G02 | pgBackRest Backup | 14 |\n| G03 | Performance \u0026 Query Stats | 17 |\n| G04 | Locks \u0026 Blocking | 10 |\n| G05 | Vacuum \u0026 Bloat | 11 |\n| G06 | Indexes | 9 |\n| G07 | TOAST \u0026 Data Integrity | 9 |\n| G08 | Visibility Map | 6 |\n| G09 | WAL \u0026 Replication Slots | 13 |\n| G10 | Upgrade Readiness | 15 |\n| G11 | Security | 8 |\n| G12 | pgEdge / Spock Cluster | 20 |\n| G13 | OS \u0026 Resource-Level | 11 |\n| G14 | WAL Growth \u0026 Generation Rate | 14 |\n\n### G08 — Visibility Map\n\n| Check | What it detects |\n|---|---|\n| G08-001 | Tables with disproportionately high heap block reads relative to index scans |\n| G08-002 | `relallvisible \u003e relpages` in pg_class — stale or corrupted VM catalog statistics |\n| G08-003 | Post-crash visibility map advisory — recommends VACUUM after unclean shutdown |\n| G08-004 | pg_visibility extension installation status |\n| G08-005 | Tables with suspiciously low dead tuple counts despite high write activity |\n| G08-006 | **File-level VM/heap mismatches via pg_check_visible() and pg_check_frozen()** — detects pages the VM marks ALL_FROZEN that still contain unfrozen tuples; requires `pg_visibility` extension and `pg_visibility_table_list` in `healthcheck.yaml` |\n\n\u003e G08-006 catches the specific corruption class where `vacuumlazy.c` sets `VISIBILITYMAP_ALL_FROZEN`\n\u003e without `VISIBILITYMAP_ALL_VISIBLE` due to a race condition fixed in PG 10.4 (commit `e1d634758e4`).\n\u003e Corruption from older versions can persist silently through upgrades — only `pg_check_frozen()` surfaces it.\n\n### G09 — WAL \u0026 Replication Slots (recent additions)\n\n| Check | What it detects |\n|---|---|\n| G09-009 | Invalidated logical replication slots (PG 16+ marks slots invalid when WAL is gone) |\n| G09-010 | `max_slot_wal_keep_size` not set — slots can retain unlimited WAL and fill the disk |\n| G09-011 | Inactive logical replication slots older than 1 hour |\n| G09-012 | Logical replication worker status — workers not running for active subscriptions |\n| G09-013 | Subscription relation sync state — tables stuck in error or non-ready state |\n\n### G12 — pgEdge / Spock Cluster (recent additions)\n\n| Check | What it detects |\n|---|---|\n| G12-022 | Per-subscription conflict and DCA counters from `spock.channel_summary_stats` |\n| G12-023 | Replication LSN lag in MB between each node pair from `spock.progress` |\n\n\u003e All G12 Spock catalog queries have been verified against live pgEdge Spock schema.\n\u003e Checks that reference tables or columns not present on the installed Spock version\n\u003e skip gracefully with an INFO message rather than erroring.\n\n### G13 — OS \u0026 Resource-Level (recent additions)\n\n| Check | What it detects |\n|---|---|\n| G13-008 | **Transparent Huge Pages** — warns when THP is set to `always`; causes unpredictable latency spikes in PostgreSQL due to background `khugepaged` compaction (Linux only) |\n| G13-009 | **CPU frequency governor** — warns when governor is `powersave` or `schedutil`, which throttles CPU frequency under load (Linux only; skips gracefully on cloud VMs without cpufreq) |\n| G13-010 | **Data directory disk space** — checks free space on the `data_directory` filesystem via `syscall.Statfs`; WARN at 80% used, CRITICAL at 90% (the data dir may be on a different mount than `pg_wal`, which is checked by G14-013) |\n| G13-011 | **Postmaster uptime** — queries `pg_postmaster_start_time()`; WARN if restarted within the last hour (possible crash/OOM kill), INFO if within 24 hours |\n\n\u003e **Note:** G13-010 requires pg-healthcheck to run directly on the PostgreSQL host (same as G14-013).\n\u003e Remote connections will receive an INFO skip with instructions to run locally.\n\n### G14 checks at a glance\n\n| Check | What it detects |\n|---|---|\n| G14-001 | pg_wal directory size vs configured GB thresholds |\n| G14-002 | Live WAL generation rate in MB/s |\n| G14-003 | Current rate vs rolling baseline (detects sudden spikes) |\n| G14-004 | WAL statistics summary from pg_stat_wal (PG 14+) |\n| G14-005 | Full-page write ratio — warns when FPI \u003e 40% of all records |\n| G14-006 | Top 5 WAL-generating tables by modification count |\n| G14-007 | wal_compression advisory |\n| G14-008 | wal_level=logical with no logical consumers (wastes WAL) |\n| G14-009 | WAL segment file count (high count = recycling blocked) |\n| G14-010 | WAL archiver status and time since last successful archive |\n| G14-011 | UNLOGGED tables (converting them causes a WAL spike) |\n| G14-012 | Forced checkpoint rate (checkpoints_req \u003e 20% = max_wal_size too small) |\n| G14-013 | pg_wal filesystem percentage (CRITICAL at 80% — no graceful degradation) |\n| G14-014 | Long transactions blocking WAL segment recycling |\n\n---\n\n## JSON Output Schema\n\n```json\n{\n  \"timestamp\": \"2025-01-15T10:30:00Z\",\n  \"hostname\": \"db1:5432\",\n  \"pg_version\": \"16.2\",\n  \"mode\": \"single\",\n  \"summary\": {\n    \"ok\": 88,\n    \"info\": 5,\n    \"warn\": 2,\n    \"critical\": 0,\n    \"total\": 95\n  },\n  \"checks\": [\n    {\n      \"check_id\": \"G14-002\",\n      \"group\": \"WAL Growth \u0026 Generation Rate\",\n      \"severity\": \"WARN\",\n      \"title\": \"WAL generation rate\",\n      \"observed\": \"WAL rate: 67.3 MB/s  (over 2.1s sample)\",\n      \"recommended\": \"Identify top WAL-generating tables; look for bulk writes or FPI storms.\",\n      \"detail\": \"\",\n      \"doc_url\": \"https://www.postgresql.org/docs/current/wal-configuration.html\",\n      \"node_name\": \"\"\n    }\n  ]\n}\n```\n\n---\n\n## Exit Codes\n\n```\n0  — all checks passed (or only INFO)\n1  — at least one WARN finding\n2  — at least one CRITICAL finding\n```\n\nUse in scripts and CI:\n\n```bash\n./pg-healthcheck --host prod-db \u0026\u0026 echo \"healthy\" || echo \"issues found (exit $?)\"\n```\n\n---\n\n## Project Structure\n\n```\npg-healthcheck/\n│\n├── cmd/pg-healthcheck/\n│   └── main.go                  CLI entry point — flags, orchestration, ask subcommand\n│\n├── internal/\n│   ├── config/\n│   │   └── config.go            Config struct, YAML loader, defaults\n│   │\n│   ├── connector/\n│   │   └── pg.go                PostgreSQL connection pool helper\n│   │\n│   ├── severity/\n│   │   └── severity.go          OK / INFO / WARN / CRITICAL type\n│   │\n│   ├── nlp/                     Natural language → check group mapping\n│   │   ├── provider.go          Provider interface + NewProvider() factory\n│   │   ├── ollama.go            Ollama /api/generate client\n│   │   ├── openai.go            OpenAI chat completions client\n│   │   ├── gemini.go            Google Gemini generateContent client\n│   │   ├── keywords.go          Keyword-to-group fallback mapping\n│   │   ├── mapper.go            MapQuery() — tries LLM, falls back to keywords\n│   │   └── mapper_test.go       Unit tests (mock servers for all three providers)\n│   │\n│   ├── checks/\n│   │   ├── checker.go           Finding struct + Checker interface\n│   │   ├── g01_connection.go    Connection \u0026 availability (9 checks)\n│   │   ├── g02_backrest.go      pgBackRest backup (14 checks)\n│   │   ├── g03_performance.go   Performance \u0026 query stats (17 checks)\n│   │   ├── g04_locks.go         Locks \u0026 blocking (10 checks)\n│   │   ├── g05_vacuum.go        Vacuum \u0026 bloat (11 checks)\n│   │   ├── g06_indexes.go       Indexes (9 checks)\n│   │   ├── g07_toast.go         TOAST \u0026 data integrity (9 checks)\n│   │   ├── g08_visibility.go    Visibility map (6 checks)\n│   │   ├── g09_wal_slots.go     WAL \u0026 replication slots (13 checks)\n│   │   ├── g10_upgrade.go       Upgrade readiness (15 checks)\n│   │   ├── g11_security.go      Security (8 checks)\n│   │   ├── g12_spock.go         pgEdge / Spock cluster (20 checks)\n│   │   ├── g13_os_resources.go  OS \u0026 resource-level (11 checks)\n│   │   └── g14_wal_growth.go    WAL growth \u0026 generation rate (14 checks)\n│   │\n│   └── report/\n│       └── reporter.go          Text + JSON output, composite alerts, exit code\n│\n├── healthcheck.yaml             All tunable thresholds (copy and customise)\n├── .goreleaser.yaml             GoReleaser — multi-platform release builds\n├── .golangci.yml                golangci-lint configuration\n├── .github/\n│   └── workflows/\n│       ├── ci.yml               CI — lint, vet, build, test on every push/PR\n│       └── release.yml          Release — builds \u0026 publishes binaries on v* tags\n├── go.mod\n└── README.md\n```\n\n---\n\n## Requirements\n\n- **Go 1.25+** — install with `brew install go`\n- **PostgreSQL 13+** — checks that need PG 14/15/16/17 skip gracefully on older versions\n- **`pg_monitor` role** — recommended minimum privilege; grants access to all catalog views and `pg_stat_*` functions without superuser. Some checks (G11 security inspection, `amcheck` index verification) benefit from superuser privileges and will skip or return partial results without them\n- **pgBackRest** — G02 checks skip gracefully with a clear message if pgBackRest is not installed; no config needed on non-pgBackRest environments\n- **amcheck extension** — G07 B-tree and heap integrity checks skip if not installed (`CREATE EXTENSION amcheck`)\n- **pg_visibility extension** — G08-006 VM integrity checks (pg_check_visible / pg_check_frozen) skip if not installed (`CREATE EXTENSION pg_visibility`). Tables to scan must be listed in `pg_visibility_table_list` in `healthcheck.yaml`\n- **pgEdge Spock extension** — G12 emits a single INFO finding and skips all 20 checks if Spock is not installed; safe to run on standard PostgreSQL\n- **Local execution** — required for G13-010 (data directory disk space) and G14-013 (pg_wal filesystem); both use `syscall.Statfs` and must run on the PostgreSQL host. Remote connections receive a graceful INFO skip\n\n\u003e **Threshold tuning:** All warning and critical thresholds have safe built-in defaults but\n\u003e **should be reviewed and tuned to your workload** before treating findings as actionable.\n\u003e A threshold appropriate for a small reporting database will produce false positives on a\n\u003e high-throughput OLTP system, and vice versa. See the [Configuration File](#configuration-file-healthcheckyaml)\n\u003e section for a full list of tunable keys.\n\n---\n\n## CI \u0026 Releases\n\nEvery push and pull request to `main` runs the full CI pipeline:\n\n- **gofmt** — formatting check\n- **go vet** — static analysis\n- **golangci-lint** — errcheck, staticcheck, unused, ineffassign\n- **Cross-compile** — verified to build on linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64\n- **go test -race** — race detector enabled\n\n### Cutting a release\n\nTag the commit and push — GoReleaser does the rest:\n\n```bash\ngit tag v0.2.0\ngit push origin v0.2.0\n```\n\nThis automatically builds binaries for all platforms, packages each one with `LICENSE`, `README.md`, and `healthcheck.yaml`, and publishes a GitHub Release with a generated changelog.\n\n---\n\n## License\n\npg-healthcheck is released under the [PostgreSQL License](LICENSE) — the same permissive license used by PostgreSQL itself.\n\nCopyright (c) 2025, Ahsan Hadi\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgedge%2Fpg-healthcheck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpgedge%2Fpg-healthcheck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgedge%2Fpg-healthcheck/lists"}