{"id":19155931,"url":"https://github.com/lyokha/nginx-healthcheck-plugin","last_synced_at":"2025-03-17T07:08:41.810Z","repository":{"id":46271510,"uuid":"141179103","full_name":"lyokha/nginx-healthcheck-plugin","owner":"lyokha","description":"Active health checks and monitoring of Nginx upstreams","archived":false,"fork":false,"pushed_at":"2024-09-14T12:44:41.000Z","size":165,"stargazers_count":41,"open_issues_count":0,"forks_count":3,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-11T09:54:46.888Z","etag":null,"topics":["healthcheck","monitoring","nginx","upstream"],"latest_commit_sha":null,"homepage":"","language":"Haskell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyokha.png","metadata":{"files":{"readme":"README.md","changelog":"Changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-16T18:40:31.000Z","updated_at":"2024-11-09T17:04:29.000Z","dependencies_parsed_at":"2023-09-24T11:41:31.933Z","dependency_job_id":"ec30e15a-5265-4304-b697-df75cf4f31fa","html_url":"https://github.com/lyokha/nginx-healthcheck-plugin","commit_stats":{"total_commits":112,"total_committers":1,"mean_commits":112.0,"dds":0.0,"last_synced_commit":"63d479a8bdc2e92d461a0a93f2da5c1b584e72c1"},"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyokha%2Fnginx-healthcheck-plugin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyokha%2Fnginx-healthcheck-plugin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyokha%2Fnginx-healthcheck-plugin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyokha%2Fnginx-healthcheck-plugin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyokha","download_url":"https://codeload.github.com/lyokha/nginx-healthcheck-plugin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243989576,"owners_count":20379648,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["healthcheck","monitoring","nginx","upstream"],"created_at":"2024-11-09T08:32:43.824Z","updated_at":"2025-03-17T07:08:41.775Z","avatar_url":"https://github.com/lyokha.png","language":"Haskell","funding_links":[],"categories":[],"sub_categories":[],"readme":"Active health checks and monitoring of Nginx upstreams\n======================================================\n\n[![Build Status](https://github.com/lyokha/nginx-healthcheck-plugin/workflows/CI/badge.svg)](https://github.com/lyokha/nginx-healthcheck-plugin/actions?query=workflow%3ACI)\n[![Hackage](https://img.shields.io/hackage/v/ngx-export-healthcheck.svg?label=hackage%20%7C%20ngx-export-healthcheck\u0026logo=haskell\u0026logoColor=%239580D1)](https://hackage.haskell.org/package/ngx-export-healthcheck)\n\n**Disclaimer**: this is not an Nginx module in the traditional sense! It\ncompiles to a shared library that gets loaded in Nginx using directive\n`haskell load` from Nginx module\n[*nginx-haskell-module*](https://github.com/lyokha/nginx-haskell-module). Let's\ncall this *plugin*. The plugin provides support for active health checks and\nmonitoring of peers in normal *per-worker* and *shared* upstreams (those\ndeclared with directive `zone`, possibly with dynamic peers), it supports all\nkinds of balancing models present in Nginx (*round-robin*, *least_conn*, *hash*\nand so forth). Both health checks and monitoring are optional, meaning that they\ndo not depend on each other, and one feature may be switched off while the other\nis on.\n\nTable of contents\n-----------------\n\n- [What the active health checks here means](#what-the-active-health-checks-here-means)\n- [What the monitoring here means](#what-the-monitoring-here-means)\n- [Examples](#examples)\n    + [Normal upstreams, health checks and monitoring](#normal-upstreams-health-checks-and-monitoring)\n    + [Normal upstreams, only health checks](#normal-upstreams-only-health-checks)\n    + [Normal upstreams, only monitoring](#normal-upstreams-only-monitoring)\n    + [Shared upstreams, health checks and monitoring](#shared-upstreams-health-checks-and-monitoring)\n- [Nginx-based monitoring of normal upstreams](#nginx-based-monitoring-of-normal-upstreams)\n    + [Normal upstreams, health checks and monitoring, related changes](#normal-upstreams-health-checks-and-monitoring-related-changes)\n- [Periodic checks of healthy peers](#periodic-checks-of-healthy-peers)\n- [Collecting Prometheus metrics](#collecting-prometheus-metrics)\n    + [Normal upstreams, related changes](#normal-upstreams-related-changes)\n    + [Shared upstreams, related changes](#shared-upstreams-related-changes)\n- [Corner cases](#corner-cases)\n- [Building and installation](#building-and-installation)\n\nWhat the active health checks here means\n----------------------------------------\n\nWell, they are not completely *active*. They are active only in one direction.\nIf all the peers in a checked upstream respond successfully then the active\nhealth check does nothing! Successfulness of a peer response is checked against\nthe rules of directive `proxy_next_upstream` and settings in directive `server`.\nThis is quite standard in Nginx: a peer gets marked as *failed* when its\n*N* responses do not pass the `proxy_next_upstream` rules during time period of\n*T*. Values of *N* and *T* are defined in arguments *max_fails* and\n*fail_timeout* in directive `server` respectively: by default they are equal to\n*1* and *10s*. When a peer *fails*, it cannot be chosen for proxying requests\nduring the next time period of *T*.\n\nAnd here the active health checks come into play! They merely move the time of\nthe last check of failed peers forward periodically, say, every 5 seconds, which\nmakes Nginx unable to return them back to the valid peers. So, do they keep\nstaying failed forever? No! The active health check tests them every those 5\nseconds against its own rules and recover them when the check passes.\n\nWhat the monitoring here means\n------------------------------\n\nThere are traditional *per-worker* and *shared* upstreams. Monitoring of\ntraditional (or normal) upstreams is built in a dedicated *location* that\nreturns a JSON object which contains a list of *failed* peers nested in\n**worker's PID / health check service / upstream** hierarchy. Monitoring of\nshared upstreams is as well built in a dedicated location that returns a JSON\nobject with a list of failed peers nested in **health check service / upstream**\nhierarchy. The meaning of the *health check service* will be explained later.\n\nExamples\n--------\n\nIn all examples below, the shared library gets built from a simple Haskell\ncode which only exports service handlers from module *NgxExport.Healthcheck*.\nThe code is saved in file *ngx_healthcheck.hs*.\n\n```haskell\nmodule NgxHealthcheck where\n\nimport NgxExport.Healthcheck ()\n```\n\n### Normal upstreams, health checks and monitoring\n\n```nginx\nuser                    nginx;\nworker_processes        4;\n\nevents {\n    worker_connections  1024;\n}\n\nhttp {\n    default_type        application/octet-stream;\n    sendfile            on;\n\n    upstream u_backend {\n        server localhost:8020 fail_timeout=600s;\n        server localhost:8030 fail_timeout=600s;\n    }\n\n    upstream u_backend0 {\n        server localhost:8040 fail_timeout=600s;\n        server localhost:8050 fail_timeout=600s backup;\n    }\n\n    upstream u_backend1 {\n        server localhost:8060 fail_timeout=600s;\n        server localhost:8050 fail_timeout=600s backup;\n    }\n\n    proxy_next_upstream error timeout http_502;\n\n    haskell load /var/lib/nginx/ngx_healthcheck.so;\n\n    haskell_run_service checkPeers $hs_service_healthcheck\n        'hs_service_healthcheck\n         Conf { upstreams     = [\"u_backend\"\n                                ,\"u_backend0\"\n                                ,\"u_backend1\"\n                                ]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule = DefaultPassRule\n                                              }\n              , sendStatsPort = Just 8100\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck;\n\n    haskell_run_service checkPeers $hs_service_healthcheck0\n        'hs_service_healthcheck0\n         Conf { upstreams     = [\"u_backend\"]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule =\n                                                      PassRuleByHttpStatus\n                                                      [200, 404]\n                                              }\n              , sendStatsPort = Just 8100\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck0;\n\n    haskell_run_service statsServer $hs_service_stats\n        'StatsServerConf { ssPort          = 8100\n                         , ssPurgeInterval = Min 5\n                         }\n        ';\n\n    haskell_service_var_in_shm stats 64k /tmp $hs_service_stats;\n\n    server {\n        listen          8010;\n        server_name     main;\n\n        location /pass {\n            proxy_pass http://u_backend;\n        }\n\n        location /pass0 {\n            proxy_pass http://u_backend0;\n        }\n\n        location /pass1 {\n            proxy_pass http://u_backend1;\n        }\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://127.0.0.1:8100;\n        }\n    }\n}\n```\n\nIn this configuration two *health check services* are declared: they are bound\nto *service variables* `$hs_service_healthcheck` and `$hs_service_healthcheck0`.\nThe services run two instances of an *asynchronous* Haskell handler `checkPeers`\nin every Nginx worker process when the workers start. The services have\ncomplementary *service update hooks* that run a Haskell handler `updatePeers`\n*synchronously* in the main Nginx thread when the service variable updates, i.e.\nevery 5 (or up to 7) seconds as stated in values *interval* and *peerTimeout* in\n*constructor* *Conf*. Each service has an associated unique *key* specified\nbefore the *Conf* declaration: in this example keys are equal to the names of\nservice variables. Service keys also define values of header *Host* (with port\nif given) and the host name for validation of server certificates used in\nhealth checks over *https* (without port). The host name gets known from a\nservice key according to the following rule:\n\n1. *if the service key contains no slashes*: the host name is equal to the\n   service key,\n2. *otherwise, if the service key contains the only slash at its end*: the host\n   name is taken from the name of the server bound to the peer in the upstream\n   configuration,\n3. *otherwise (if the service key contains slashes on the left of its end)*: the\n   host name is equal to the part of the service key after the first slash from\n   its beginning, for example, if a service key is equal to *1/healthcheck* then\n   the host name is equal to *healthcheck*.\n\nBesides time intervals in constructor *Conf*, a list of upstreams to check, an\noptional *endpoint* and an optional monitoring port where statistics about\nfailed peers should be sent to are defined. Endpoints contain an URL on which\nhealth checks will test failed peers, transport protocol (*Http* or *Https*),\nand *passing rules* to describe in which cases failed peers must be regarded as\nrecovered. Currently, only two kinds of passing rules are supported:\n*DefaultPassRule* (when a peer responds with *HTTP status 200*) and\n*PassRuleByHttpStatus*. Internally, `updatePeers` calls a C function that reads\nand updates Nginx data related to peer statuses.\n\nDo not hesitate to declare as many Haskell services as you want, because they\nare very cheap. When you have to check against a large number of upstreams, it\nis even recommended to split services because this may potentially shorten\nduration of the synchronous service hook.\n\nMonitoring service *statsServer* collects statistics of failed peers. This is a\n*shared* service because the associated service variable are declared as shared\nin directive `haskell_service_var_in_shm`, and therefore it runs on a single\narbitrary Nginx worker all the time. Internally, it is implemented using [*Snap\nframework*](http://snapframework.com/). The server is bound to the local IP\naddress *127.0.0.1* and only accessible from outside via a dedicated location\n*/stat*. It will listen to incoming requests on the port specified in option\n*ssPort*. Option *ssPurgeInterval* defines how often stats from died workers\nwill be purged.\n\nBelow is a typical response from the monitoring service.\n\n```json\n{\n  \"31493\": [\n    \"2020-10-08T11:27:05.914756785Z\",\n    {\n      \"hs_service_healthcheck\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ],\n        \"u_backend1\": [\n          \"127.0.0.1:8060\",\n          \"127.0.0.1:8050\"\n        ]\n      },\n      \"hs_service_healthcheck0\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ]\n      }\n    }\n  ],\n  \"31494\": [\n    \"2020-10-08T11:27:05.945503075Z\",\n    {\n      \"hs_service_healthcheck\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ]\n      },\n      \"hs_service_healthcheck0\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ]\n      }\n    }\n  ],\n  \"31496\": [\n    \"2020-10-08T11:27:05.946501076Z\",\n    {}\n  ],\n  \"31497\": [\n    \"2020-10-08T11:27:05.932294049Z\",\n    {\n      \"hs_service_healthcheck\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ]\n      },\n      \"hs_service_healthcheck0\": {\n        \"u_backend\": [\n          \"127.0.0.1:8020\"\n        ]\n      }\n    }\n  ]\n}\n```\n\nWhen the monitoring service is accessed via URL */stat/merge*, its response data\ngets merged from all worker processes.\n\n```json\n{\n  \"hs_service_healthcheck\": {\n    \"u_backend\": [\n      [\n        \"2020-10-08T11:27:05.945503075Z\",\n        \"127.0.0.1:8020\"\n      ]\n    ],\n    \"u_backend1\": [\n      [\n        \"2020-10-08T11:27:05.914756785Z\",\n        \"127.0.0.1:8060\"\n      ],\n      [\n        \"2020-10-08T11:27:05.914756785Z\",\n        \"127.0.0.1:8050\"\n      ]\n    ]\n  },\n  \"hs_service_healthcheck0\": {\n    \"u_backend\": [\n      [\n        \"2020-10-08T11:27:05.945503075Z\",\n        \"127.0.0.1:8020\"\n      ]\n    ]\n  }\n}\n```\n\nIn this *merged view*, all faulty peers are tagged with times of their latest\nchecks. Notice also that upstreams in the merged view never contain duplicate\npeers.\n\n### Normal upstreams, only health checks\n\nHenceforth I am going to skip unrelated parts of the configuration for brevity.\n\n```nginx\n    haskell_run_service checkPeers $hs_service_healthcheck\n        'hs_service_healthcheck\n         Conf { upstreams     = [\"u_backend\"\n                                ,\"u_backend0\"\n                                ,\"u_backend1\"\n                                ]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule = DefaultPassRule\n                                              }\n              , sendStatsPort = Nothing\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck;\n\n    haskell_run_service checkPeers $hs_service_healthcheck0\n        'hs_service_healthcheck0\n         Conf { upstreams     = [\"u_backend\"]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule =\n                                                      PassRuleByHttpStatus\n                                                      [200, 404]\n                                              }\n              , sendStatsPort = Nothing\n              }\n        ';\n\n    server {\n        listen          8010;\n        server_name     main;\n\n        location /pass {\n            proxy_pass http://u_backend;\n        }\n\n        location /pass0 {\n            proxy_pass http://u_backend0;\n        }\n\n        location /pass1 {\n            proxy_pass http://u_backend1;\n        }\n    }\n}\n```\n\nValues of *sendStatsPort* in both the health check services are not defined\n(i.e. they are equal to special value *Nothing*), service *statsServer* and\nlocation */stats* are removed.\n\n### Normal upstreams, only monitoring\n\n```nginx\n    haskell_run_service checkPeers $hs_service_healthcheck\n        'hs_service_healthcheck\n         Conf { upstreams     = [\"u_backend\"\n                                ,\"u_backend0\"\n                                ,\"u_backend1\"\n                                ]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Nothing\n              , sendStatsPort = Just 8100\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck;\n\n    haskell_run_service checkPeers $hs_service_healthcheck0\n        'hs_service_healthcheck0\n         Conf { upstreams     = [\"u_backend\"]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Nothing\n              , sendStatsPort = Just 8100\n              }\n        ';\n\n    haskell_run_service statsServer $hs_service_stats\n        'StatsServerConf { ssPort          = 8100\n                         , ssPurgeInterval = Min 5\n                         }\n        ';\n\n    haskell_service_var_in_shm stats 64k /tmp $hs_service_stats;\n\n    server {\n        listen          8010;\n        server_name     main;\n\n        location /pass {\n            proxy_pass http://u_backend;\n        }\n\n        location /pass0 {\n            proxy_pass http://u_backend0;\n        }\n\n        location /pass1 {\n            proxy_pass http://u_backend1;\n        }\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://127.0.0.1:8100;\n        }\n    }\n}\n```\n\nEndpoints are not defined, the monitoring service and location */stat* are\nintact.\n\n### Shared upstreams, health checks and monitoring\n\n```nginx\n    upstream u_backend {\n        zone u_backend 64k;\n        server localhost:8020 fail_timeout=600s;\n        server localhost:8030 fail_timeout=600s;\n    }\n\n    upstream u_backend0 {\n        zone u_backend 64k;\n        server localhost:8040 fail_timeout=600s;\n        server localhost:8050 fail_timeout=600s backup;\n    }\n\n    upstream u_backend1 {\n        zone u_backend 64k;\n        server localhost:8060 fail_timeout=600s;\n        server localhost:8050 fail_timeout=600s backup;\n    }\n\n    haskell_run_service checkPeers $hs_service_healthcheck\n        'hs_service_healthcheck\n         Conf { upstreams     = [\"u_backend\"\n                                ,\"u_backend0\"\n                                ,\"u_backend1\"\n                                ]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule = DefaultPassRule\n                                              }\n              , sendStatsPort = Nothing\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck;\n\n    haskell_run_service checkPeers $hs_service_healthcheck0\n        'hs_service_healthcheck0\n         Conf { upstreams     = [\"u_backend\"]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule =\n                                                      PassRuleByHttpStatus\n                                                      [200, 404]\n                                              }\n              , sendStatsPort = Nothing\n              }\n        ';\n\n    haskell_service_update_hook updatePeers $hs_service_healthcheck0;\n\n    haskell_service_var_in_shm upstreams 64k /tmp\n            $hs_service_healthcheck $hs_service_healthcheck0;\n\n    server {\n        listen          8010;\n        server_name     main;\n\n        location /pass {\n            proxy_pass http://u_backend;\n        }\n\n        location /pass0 {\n            proxy_pass http://u_backend0;\n        }\n\n        location /pass1 {\n            proxy_pass http://u_backend1;\n        }\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            haskell_async_content reportPeers;\n        }\n    }\n\n```\n\nThe upstreams are now associated with a *shared memory zone*. The health check\nservices become *shared*, the monitoring service gets replaced with a simple\ncontent handler `reportPeers` because peers are now shared between all Nginx\nworker processes.\n\nCombinations of *only health checks / only monitoring* are built trivially like\nin the case of normal per-worker upstreams.\n\nBelow is a sample stats output for shared upstreams.\n\n```json\n{\n  \"hs_service_healthcheck\": {\n    \"u_backend\": [\n      \"127.0.0.1:8020\"\n    ],\n    \"u_backend1\": [\n      \"127.0.0.1:8060\",\n      \"127.0.0.1:8050\"\n    ]\n  },\n  \"hs_service_healthcheck0\": {\n    \"u_backend\": [\n      \"127.0.0.1:8020\"\n    ]\n  }\n}\n```\n\nNginx-based monitoring of normal upstreams\n------------------------------------------\n\nService *statsServer* is implemented using *Snap framework*. Basically, a native\nNginx implementation is not easy because the service must listen on a single\n(not duplicated) file descriptor which is not the case when Nginx spawns more\nthan one worker processes. Running *statsServer* as a shared service is an\nelegant solution as shared services guarantee that they occupy only one worker\nat a time. However, *nginx-haskell-module* provides directive *single_listener*\nwhich can be used to apply the required restriction in a custom Nginx virtual\nserver. This directive requires that the virtual server listens with option\n*reuseport* and is only available on Linux with socket option\n*SO_ATTACH_REUSEPORT_CBPF*.\n\nLet's replace *statsServer* from the example with normal upstreams and\nmonitoring with an Nginx-based monitoring service using *single_listener* and\nlistening on port *8200*.\n\n### Normal upstreams, health checks and monitoring, related changes\n\n```nginx\n    haskell_run_service checkPeers $hs_service_healthcheck\n        'hs_service_healthcheck\n         Conf { upstreams     = [\"u_backend\"\n                                ,\"u_backend0\"\n                                ,\"u_backend1\"\n                                ]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule = DefaultPassRule\n                                              }\n              , sendStatsPort = Just 8200\n              }\n        ';\n\n# ...\n\n\n    haskell_run_service checkPeers $hs_service_healthcheck0\n        'hs_service_healthcheck0\n         Conf { upstreams     = [\"u_backend\"]\n              , interval      = Sec 5\n              , peerTimeout   = Sec 2\n              , endpoint      = Just Endpoint { epUrl = \"/healthcheck\"\n                                              , epProto = Http\n                                              , epPassRule =\n                                                      PassRuleByHttpStatus\n                                                      [200, 404]\n                                              }\n              , sendStatsPort = Just 8200\n              }\n        ';\n\n# ...\n\n    haskell_var_empty_on_error $hs_stats;\n\n# ...\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://127.0.0.1:8200;\n        }\n\n# ...\n\n    server {\n        listen          8200 reuseport;\n        server_name     stats;\n\n        single_listener on;\n\n        location /report {\n            haskell_run_async_on_request_body receiveStats $hs_stats \"Min 1\";\n\n            if ($hs_stats = '') {\n                return 400;\n            }\n\n            return 200;\n        }\n\n        location /stat {\n            haskell_async_content sendStats noarg;\n        }\n\n        location /stat/merge {\n            haskell_async_content sendMergedStats noarg;\n        }\n    }\n}\n```\n\nHandler *receiveStats* accepts a time interval corresponding to the value of\n*ssPurgeInterval* from service *statsServer*. If the value is not readable (say,\n*noarg*) then it is defaulted to *Min 5*.\n\nNotice that the monitoring virtual server must listen on address *127.0.0.1*\nbecause service *checkPeers* reports stats to this address.\n\nPeriodic checks of healthy peers\n--------------------------------\n\nThe plugin checks periodically only faulty peers. Healthy peers become faulty\nif they do not pass the standard Nginx rules, but this may happen only when\nclient requests trigger forwarding to the corresponding upstreams. Sometimes, it\nmakes sense to check healthy peers unconditionally. One way to achieve this is\nrunning periodic active checks of the corresponding upstreams. For this, let's\nuse Haskell module\n[*NgxExport.Tools.Subrequest*](https://hackage.haskell.org/package/ngx-export-tools-extra/docs/NgxExport-Tools-Subrequest.html).\n\nBelow is the source code of the shared library.\n\n```haskell\n{-# LANGUAGE TemplateHaskell #-}\n\nmodule NgxHealthcheckPeriodic where\n\nimport           NgxExport\nimport           NgxExport.Tools\nimport           NgxExport.Tools.Subrequest\n\nimport           NgxExport.Healthcheck ()\n\nimport           Data.ByteString (ByteString)\nimport qualified Data.ByteString.Lazy as L\n\nmakeRequest :: ByteString -\u003e Bool -\u003e IO L.ByteString\nmakeRequest = const . makeSubrequest\n\nngxExportSimpleService 'makeRequest $ PersistentService $ Just $ Sec 10\n```\n\nThe periodic services must be declared in the Nginx configuration for all the\nupstreams to check.\n\n```nginx\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend\n            '{\"uri\": \"http://127.0.0.1:8010/Local/check/0/u_backend\"}';\n\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend0\n            '{\"uri\": \"http://127.0.0.1:8010/Local/check/0/u_backend0\"}';\n\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend1\n            '{\"uri\": \"http://127.0.0.1:8010/Local/check/0/u_backend1\"}';\n```\n\nLocation */Local/check/0/* shall proxy to the specified upstream.\n\n```nginx\n        location ~ ^/Local/check/0/(.+) {\n            allow 127.0.0.1;\n            deny all;\n            haskell_run_async makeSubrequest $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8010/Local/check/1/$1\"}';\n            return 200;\n        }\n\n        location ~ ^/Local/check/1/(.+) {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://$1/healthcheck;\n        }\n```\n\nThe intermediate *makeSubrequest* catches possible *5xx* and other bad HTTP\nstatuses received from the upstream to prevent the periodic checks from throwing\nexceptions. Additionally, checking the response status in the intermediate\nlocation can be used for alerting that all servers in the upstream have failed.\n\nSadly, the subrequests may come to arbitrary worker processes which means that\nthere is no guarantee that faulty peers in *normal* upstreams will be checked in\nall worker processes! The worker processes get chosen not only randomly but also\nnot fairly: a single worker process may serve all incoming requests during a\nvery long time. To make load between the workers more uniform, we can forward\nthe subrequests to some dedicated virtual server listening on a port with\n*reuseport* and, additionally, randomize the hash for the *reuseport* by always\nchanging ports of the subrequests with HTTP request header *Connection: close*.\n\n```nginx\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend\n            '{\"uri\": \"http://127.0.0.1:8011/Local/check/0/u_backend\"}';\n\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend0\n            '{\"uri\": \"http://127.0.0.1:8011/Local/check/0/u_backend0\"}';\n\n    haskell_run_service simpleService_makeRequest $hs_check_u_backend1\n            '{\"uri\": \"http://127.0.0.1:8011/Local/check/0/u_backend1\"}';\n```\n\n```nginx\n    server {\n        listen          8011 reuseport;\n        server_name     aux_fair_load;\n\n        location ~ ^/Local/check/0/(.+) {\n            allow 127.0.0.1;\n            deny all;\n            haskell_run_async makeSubrequest $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8011/Local/check/1/$1\"\n                     ,\"headers\": [[\"Connection\", \"close\"]]\n                     }';\n            return 200;\n        }\n\n        location ~ ^/Local/check/1/(.+) {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://$1/healthcheck;\n        }\n    }\n```\n\nAs soon as faulty servers from *normal* upstreams may still appear arbitrarily\nin different worker processes, it makes sense to monitor them using the *merged\nview*, i.e. via URL */stat/merge*.\n\nCollecting Prometheus metrics\n-----------------------------\n\nWith modules\n[*NgxExport.Tools.Prometheus*](https://hackage.haskell.org/package/ngx-export-tools-extra/docs/NgxExport-Tools-Prometheus.html),\n[*NgxExport.Tools.Subrequest*](https://hackage.haskell.org/package/ngx-export-tools-extra/docs/NgxExport-Tools-Subrequest.html),\nand [*nginx-custom-counters-module*](https://github.com/lyokha/nginx-custom-counters-module),\ncustom Prometheus metrics can be collected. Let's monitor the number of\ncurrently failed peers in all checked upstreams by extending examples from\nsection [*Examples*](#examples).\n\nBelow is the source code of the shared library.\n\n```haskell\n{-# LANGUAGE TemplateHaskell, TypeApplications, ViewPatterns #-}\n\nmodule NgxHealthcheckPrometheus where\n\nimport           NgxExport\nimport           NgxExport.Tools\nimport           NgxExport.Tools.Prometheus ()\nimport           NgxExport.Tools.Subrequest ()\n\nimport           NgxExport.Healthcheck\n\nimport qualified Data.Map.Strict as M\nimport qualified Data.Map.Lazy as ML\nimport           Data.ByteString (ByteString)\nimport qualified Data.ByteString.Lazy as L\nimport qualified Data.ByteString.Char8 as C8\nimport qualified Data.ByteString.Lazy.Char8 as C8L\nimport qualified Data.Text.Encoding as T\nimport           Data.Binary\nimport           Data.Maybe\n\ntype MergedStats = MServiceKey AnnotatedFlatPeers\ntype SharedStats = MServiceKey FlatPeers\ntype FlatStats = MUpstream Int\n\ntoFlatStats :: MServiceKey a -\u003e FlatStats\ntoFlatStats = ML.foldr (flip $ M.foldrWithKey $\n                           \\k v -\u003e M.alter (setN $ length v) k\n                       ) M.empty\n    where setN n Nothing = Just n\n          setN n (Just m) = Just $ max n m\n\nmergedFlatStats :: ByteString -\u003e L.ByteString\nmergedFlatStats =\n    encode . toFlatStats . fromJust . readFromByteStringAsJSON @MergedStats\n\nngxExportYY 'mergedFlatStats\n\nsharedFlatStats :: ByteString -\u003e L.ByteString\nsharedFlatStats =\n    encode . toFlatStats . fromJust . readFromByteStringAsJSON @SharedStats\n\nngxExportYY 'sharedFlatStats\n\nnFailedServers :: ByteString -\u003e L.ByteString\nnFailedServers v =\n    let (T.decodeUtf8 -\u003e u, decode @FlatStats . L.fromStrict . C8.tail -\u003e s) =\n            C8.break (== '|') v\n    in C8L.pack $ show $ fromMaybe 0 $ M.lookup u s\n\nngxExportYY 'nFailedServers\n```\n\n### Normal upstreams, related changes\n\n```nginx\n    haskell_run_service simpleService_prometheusConf $hs_prometheus_conf\n            'PrometheusConf\n                { pcMetrics = fromList\n                    [(\"cnt_upstream_failure\"\n                      ,\"Number of servers which are currently failed\")\n                    ]\n                , pcGauges = fromList\n                    [\"cnt_upstream_failure@upstream=(u_backend)\"\n                    ,\"cnt_upstream_failure@upstream=(u_backend0)\"\n                    ,\"cnt_upstream_failure@upstream=(u_backend1)\"\n                    ]\n                , pcScale1000 = fromList []\n                }';\n\n# ...\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            proxy_pass http://127.0.0.1:8100;\n        }\n\n        location /stat/merge {\n            allow 127.0.0.1;\n            deny all;\n\n            counter $cnt_upstream_failure@upstream=(u_backend)\n                    set $hs_n_u_backend;\n            counter $cnt_upstream_failure@upstream=(u_backend0)\n                    set $hs_n_u_backend0;\n            counter $cnt_upstream_failure@upstream=(u_backend1)\n                    set $hs_n_u_backend1;\n\n            haskell_run_async makeSubrequestFull $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8100/stat/merge\"}';\n            haskell_run extractBodyFromFullResponse $hs_subrequest_body\n                    $hs_subrequest;\n            haskell_run mergedFlatStats $hs_failed_backends\n                    $hs_subrequest_body;\n            haskell_run nFailedServers $hs_n_u_backend\n                    u_backend|$hs_failed_backends;\n            haskell_run nFailedServers $hs_n_u_backend0\n                    u_backend0|$hs_failed_backends;\n            haskell_run nFailedServers $hs_n_u_backend1\n                    u_backend1|$hs_failed_backends;\n\n            haskell_content fromFullResponse $hs_subrequest;\n        }\n\n        location /metrics {\n            allow 127.0.0.1;\n            deny all;\n\n            haskell_run_async makeSubrequest $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8010/stat/merge\"}';\n            haskell_async_content prometheusMetrics \n                    '[\"main\", $cnt_collection, {}, {}]';\n        }\n```\n\n### Shared upstreams, related changes\n\n```nginx\n    haskell_run_service simpleService_prometheusConf $hs_prometheus_conf\n            'PrometheusConf\n                { pcMetrics = fromList\n                    [(\"cnt_upstream_failure\"\n                      ,\"Number of servers which are currently failed\")\n                    ]\n                , pcGauges = fromList\n                    [\"cnt_upstream_failure@upstream=(u_backend)\"\n                    ,\"cnt_upstream_failure@upstream=(u_backend0)\"\n                    ,\"cnt_upstream_failure@upstream=(u_backend1)\"\n                    ]\n                , pcScale1000 = fromList []\n                }';\n\n# ...\n\n        location /stat {\n            allow 127.0.0.1;\n            deny all;\n            haskell_async_content reportPeers;\n        }\n\n        location /stat/shared {\n            allow 127.0.0.1;\n            deny all;\n\n            counter $cnt_upstream_failure@upstream=(u_backend)\n                    set $hs_n_u_backend;\n            counter $cnt_upstream_failure@upstream=(u_backend0)\n                    set $hs_n_u_backend0;\n            counter $cnt_upstream_failure@upstream=(u_backend1)\n                    set $hs_n_u_backend1;\n\n            haskell_run_async makeSubrequestFull $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8010/stat\"}';\n            haskell_run extractBodyFromFullResponse $hs_subrequest_body\n                    $hs_subrequest;\n            haskell_run sharedFlatStats $hs_failed_backends\n                    $hs_subrequest_body;\n            haskell_run nFailedServers $hs_n_u_backend\n                    u_backend|$hs_failed_backends;\n            haskell_run nFailedServers $hs_n_u_backend0\n                    u_backend0|$hs_failed_backends;\n            haskell_run nFailedServers $hs_n_u_backend1\n                    u_backend1|$hs_failed_backends;\n\n            haskell_content fromFullResponse $hs_subrequest;\n        }\n\n        location /metrics {\n            allow 127.0.0.1;\n            deny all;\n\n            haskell_run_async makeSubrequest $hs_subrequest\n                    '{\"uri\": \"http://127.0.0.1:8010/stat/shared\"}';\n            haskell_async_content prometheusMetrics\n                    '[\"main\", $cnt_collection, {}, {}]';\n        }\n```\n\nBelow is a typical sample of the Prometheus metrics.\n\n```ShellSession\n# HELP cnt_upstream_failure Number of servers which are currently failed\n# TYPE cnt_upstream_failure gauge\ncnt_upstream_failure{upstream=\"u_backend\"} 2.0\ncnt_upstream_failure{upstream=\"u_backend0\"} 0.0\ncnt_upstream_failure{upstream=\"u_backend1\"} 0.0\n```\n\nCorner cases\n------------\n\n- In Nginx before version *1.12.0*, there was a so-called *quick recovery*\n  mechanism, meaning that when all peers in the main or the backup part of an\n  upstream get failed, then they immediately were recovered. As such, health\n  checks are impossible in such a case.\n\n- When the main or the backup part of an upstream contains only one peer, then\n  the peer never fails. For the sake of the health checks, such an upstream\n  could be extended by a fake server marked as *down*, for example\n\n    ```nginx\n        upstream u_backend {\n            server localhost:8020;\n            server localhost:8020 down;\n        }\n    ```\n\n  But there is a better solution proposed in Nginx module\n  [*nginx-combined-upstreams-module*](https://github.com/lyokha/nginx-combined-upstreams-module).\n\n    ```nginx\n        upstream u_backend {\n            server localhost:8020;\n            extend_single_peers;\n        }\n    ```\n\n  Directive `extend_single_peers` adds a fake peer only when an upstream part\n  (the main or the backup) contains a single peer.\n\nBuilding and installation\n-------------------------\n\nThe plugin contains Haskell and C parts, and thus, it requires *ghc*, *Cabal*,\n*gcc*, and a directory with the Nginx sources. The build tool also requires\n[*patchelf*](https://github.com/NixOS/patchelf) and utility *nhm-tool* which is\nshipped with package\n[*ngx-export-distribution*](https://hackage.haskell.org/package/ngx-export-distribution).\n\nLet's first install the Nginx module. For this, go to the directory with the\nNginx source code,\n\n```ShellSession\n$ cd /path/to/nginx/sources\n```\n\ncompile,\n\n```ShellSession\n$ ./configure --add-dynamic-module=/path/to/nginx-healthcheck-plugin/sources\n$ make modules\n```\n\nand install *ngx_healthcheck_plugin.so*.\n\n```ShellSession\n$ export NGX_HS_INSTALL_DIR=/var/lib/nginx\n$ sudo install -d $NGX_HS_INSTALL_DIR\n$ sudo cp objs/ngx_healthcheck_plugin.so $NGX_HS_INSTALL_DIR/libngx_healthcheck_plugin.so\n```\n\nNotice that we added prefix *lib* to the module's name!\n\nNow let's build the Haskell code. For this, go to one of the directories with\nHaskell handlers: *simple/*, *periodic/*, or *prometheus/*.\n\n```ShellSession\n$ cd -\n$ cd simple\n```\n\nBefore running *make*, tune the *constraints* stanza in *cabal.project*.\nCurrently, it should look similar to\n\n```Cabal Config\nconstraints: ngx-export-healthcheck +snapstatsserver +healthcheckhttps\n```\n\nThis line enforces building the Snap monitoring server and support for secure\nconnections to endpoints. To disable them, replace *+snapstatsserver* by\n*-snapstatsserver* and *+healthcheckhttps* by *-healthcheckhttps*. To let Cabal\ndeduce whether to build these features automatically, remove the constraints.\n\nNow run\n\n```ShellSession\n$ make PREFIX=$NGX_HS_INSTALL_DIR\n$ sudo make PREFIX=$NGX_HS_INSTALL_DIR install\n```\n\nor simply\n\n```ShellSession\n$ make\n$ sudo make install\n```\n\nif installation directory is */var/lib/nginx/*.\n\nWith ghc older than *9.0.1*, build with\n\n```ShellSession\n$ make LINKRTS=-lHSrts_thr-ghc$(ghc --numeric-version)\n```\n\nBy default, package *ngx-export-healthcheck* gets installed from *Hackage*. To\nbuild it locally, augment stanza *packages* inside *cabal.project* according to\nthe commentary attached to it.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyokha%2Fnginx-healthcheck-plugin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyokha%2Fnginx-healthcheck-plugin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyokha%2Fnginx-healthcheck-plugin/lists"}