{"id":13582125,"url":"https://github.com/openshift/autoheal","last_synced_at":"2025-04-11T09:31:13.990Z","repository":{"id":57516510,"uuid":"126371616","full_name":"openshift/autoheal","owner":"openshift","description":"Autoheals based on monitoring alerts","archived":false,"fork":false,"pushed_at":"2020-06-22T13:02:34.000Z","size":25033,"stargazers_count":67,"open_issues_count":0,"forks_count":24,"subscribers_count":205,"default_branch":"master","last_synced_at":"2025-03-25T12:05:29.238Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openshift.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-22T17:25:28.000Z","updated_at":"2024-10-12T02:31:10.000Z","dependencies_parsed_at":"2022-09-26T18:00:50.949Z","dependency_job_id":null,"html_url":"https://github.com/openshift/autoheal","commit_stats":null,"previous_names":[],"tags_count":526,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openshift%2Fautoheal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openshift%2Fautoheal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openshift%2Fautoheal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openshift%2Fautoheal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openshift","download_url":"https://codeload.github.com/openshift/autoheal/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248368207,"owners_count":21092317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:02:26.496Z","updated_at":"2025-04-11T09:31:12.163Z","avatar_url":"https://github.com/openshift.png","language":"Go","readme":"# Auto-heal Service\n\nThis project contains the _auto-heal_ service. It receives alert notifications\nfrom the Prometheus alert manager and executes Ansible playbooks to resolve the\nroot cause.\n\n## Configuration\n\nMost of the configuration of the auto-heal service is kept in a YAML\nconfiguration file. The name of the configuration file is specified using the\n`--config-file` command line option. If this option isn't explicitly given then\nthe service will try to load the `autoheal.yml` file from the current working\ndirectory.\n\nIn addition to the configuration file the auto-heal service also uses command\nline options to configure the connection to the Kubernetes API and the log\nlevel. Use the `-h` option to get a complete list of these command line options.\n\nThe `--kubeconfig` command line option is used to specify the location of the\nKubernetes client configuration file. When running outside of a Kubernetes\ncluster the auto-heal service will use `$HOME/.kube/config` by default, the same\nused by the `kubectl` command. When running inside a Kubernetes cluster it will\nuse the configuration that Kubernetes mounts automatically in the pod file\nsystem. So in most cases this command line option won't have to be explicitly\nincluded.\n\nAssuming that you want to have your own `my.yml` configuration file a typical\ncommand line will be the following:\n\n```bash\n$ autoheal server --config-file=my.yml --logtostderr\n```\n\nSee the `autoheal.yml` file for a complete example.\n\n### AWX or AnsibleTower configuration\n\nThe first section of the configuration file is named `awx` and it contains all\nthe details needed to connect to the [AWX](https://www.ansible.com/products/awx-project)\nor [Ansible Tower](https://www.ansible.com/products/tower) server:\n\n```yaml\nawx:\n  address: https://myawx.example.com/api\n  proxy: http://myproxy.example.com:3128\n  credentialsRef:\n    namespace: my-namespace\n    name: my-awx-credentials\n  tlsRef:\n    namespace: my-namespace\n    name: my-awx-ca\n  project: \"Auto-heal\"\n```\n\nThe `address` parameter is the URL of the API of the AWX server. It should\ncontain the `/api` suffix, but not the `/v1` or `/v2` suffix, as the auto-heal\nservice will internally decide which version to use.\n\nThe `proxy` parameter is optional, and it indicates what HTTP proxy should be\nused to connect to the AWX API. If this parameter is not specified, or if it is\nempty, then the connection will be direct to the AWX server, without a proxy.\n\nThe `credentialsRef` parameter is a reference to the [Kubernetes\nsecret](https://kubernetes.io/docs/concepts/configuration/secret) that contains\nthe user name and password used to connect to the AWX API. That secret should\ncontain the `username` and `password` keys. For example:\n\n```yaml\napiVersion: v1\nkind: Secret\nmetadata:\n  namespace: my-namespace\n  name: my-awx-credentials\ndata:\n  username: YWxlcnQtaGVhbGVy\n  password: ...\n```\n\nAlternatively it is also possible to specify the user name and password directly\ninside the configuration file, using the `credentials` section. For example:\n\n```yaml\ncredentials:\n  username: autoheal\n  password: ...\n```\n\nThis is very convenient for development environments, but it is not recommended\nfor production environments, as then the configuration file needs to be\nprotected very carefully. For example, you can create a separate file for the\ncredentials, give it restricted permissions, and then load it using the\n`--config-file` option twice:\n\n```\n$ echo \u003e general.yml \u003c\u003c.\nawx:\n  address: https://myawx.example.com/api\n.\n$ echo \u003e credentials.yml \u003c\u003c.\ncredentials:\n  username: \"autoheal\"\n  password: \"...\"\n.\n$ chmod u=r,g=,o= credentials.yml\n$ autoheal server --config-file=general.yml --config-file=credentials.yml\n\n```\n\nThe `tlsRef` parameter is a reference to the [Kubernetes\nsecret](https://kubernetes.io/docs/concepts/configuration/secret) that contains\nthe certificates used to connect to the AWX API. That secret should contain the\n`ca.crt` key, for example:\n\n```yaml\napiVersion: v1\nkind: Secret\nmetadata:\n  namespace: my-namespace\n  name: my-awx-tls\ndata:\n  ca.crt: |-\n    LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvVENDQWVXZ0F3SUJBZ0lKQUxNRXB6OWxa\n    VkVzdzI3Sm5BYlMyejNhbUF0YTc1QmNnVGcvOUFCdDV0VVc2VTJOKzkKbXc9PQotLS0tLUVORCBD\n    ...\n```\n\nAlternatively it is also possible to specify the CA certificates directly inside\nthe configuration file, using the `tls` section. For example:\n\n```yaml\ntls:\n  caCerts: |-\n    -----BEGIN CERTIFICATE-----\n    MIIFgzCCA2ugAwIBAgIPXZONMGc2yAYdGsdUhGkHMA0GCSqGSIb3DQEBCwUAMDsx\n    CzAJBgNVBAYTAkVTMREwDwYDVQQKDAhGTk1ULVJDTTEZMBcGA1UECwwQQUMgUkFJ\n    ...\n    -----END CERTIFICATE-----\n```\n\nThey can also be specified indirectly, putting the name of a PEM file in the\n`caFile` parameter:\n\n```yaml\ntls:\n  caFile: /etc/autoheal/my-ca.pem\n```\n\nThe `insecure` parameter controls whether to use an insecure connection to the\nAWX server. If the connection is insecure then the TLS will not be verified. It\nshould always be set to `false` (the default) in production environments.\n\nThe `project` parameter is the name of the AWX project that contains the job\ntemplates that will be used to run the playbooks.\n\n### Throttling configuration\n\nThe `throttling` section of the configuration describes how to throttle the\nexecution of healing actions. This is intended to prevent _healing storms_ that\ncould happen if the same alerts are send repeatedly to the service.\n\nThe `interval` parameter controls the time that the service will remember an\nexecuted healing action. If an action is triggered more than once in the given\ninterval it will be executed only the first time. The rest of the times it will\nbe logged and ignored. (see `autoheal.yml` for an example.)\n\nThe default interval value is one hour. Leaving the `interval` parameter 0\nwill *disable* throttling altogether.\n\nNote that for throttling purposes actions are considered the same if they\nhave exactly the same fields with exactly the same values *after* processing\nthem as templates. For example, an action defined like this:\n\n```yaml\nawxJob:\n  template: \"Restart {{ $labels.service }}\"\n```\n\nWill have different values for the `template` field if the triggering alerts\nhave different `service` labels.\n\nThe auto-heal service performs a periodic job status check against AWX server,\nto check the status of the active jobs that were triggered.\nThe `jobStatusCheckInterval` parameter determines how often to perform this check.\nIt is optional, and the defult is '5m' (every 5 minutes).\n\n### Healing rules configuration\n\nThe second important section of the configuration file is `rules`. It contains\nthe list of _healing rules_ used by the auto-heal service to decide which action\nto run for each received alert. For example:\n\n```yaml\nrules:\n\n- metadata:\n    name: start-node\n  labels:\n    alertname: \"NodeDown\"\n  awxJob:\n    template: \"Start node\"\n    extraVars: \n      node: \"{{ $labels.instance }}\"\n\n- metadata:\n    name: start-service\n  labels:\n    alertname: \".*Down\"\n    service: \".*\"\n  awxJob:\n    template: \"Start service\"\n```\n\nThe above example contains two _healing rules_. The first rule will be\nexecuted when the alert received contains a label named `alertname` with\na value that matches the regular expression `NodeDown`.\n\nThe second rule will be executed when the alert received contains a\nlabels `alertname` *and* `service`, matching the regular expressions\n`.*Down` and `.*` respectively.\n\nThe `metadata` parameter of each rule is used to specify the `name` of\nthe rule, which is used by the auto-heal service to reference it in log\nmessages and in metrics.\n\nThe `labels` and `annotations` parameters of a rule are maps of strings\nused to specify the labels and annotations that the alerts should\ncontain in order to match the rule. The keys of these maps are the names\nof the labels or annotations. The values of these maps are regular\nexpressions that the values of those labels or annotations should match.\n\nThe `awxJob` parameter indicates which job template should be executed\nwhen an alert matches the rule.\n\nThe `template` parameter is the name of the AWX job template.\n\nThe `extraVars` parameter is optional, and if specified it is used to\npass additional variables to the playbook, like with the `--extra-vars`\noption of the `ansible-playbook` command.\n\nRegardless to the `extraVars` setting, the content of the alert that \ntriggered the AWX job will be passed to the playbook as part of \n`extraVars`, in a variable named `alert`.\n\nThe `limit` parameter is optional, and if specified it is passed to\nAWX to constrain the list of hosts managed or affected by the \nplaybook. Multiple patterns can be separated by colons (`:`). \nAs with core Ansible, `a:b` means \"in group a or b\", `a:b:\u0026c` means \n\"in a or b but must be in c\", and `a:!b` means \"in a, and definitely \nnot in b\".\n\n\u003e Note that in order to be able to use `extraVars` and `limit`\n\u003e mechanisms the AWX job template should have the \n\u003e _Prompt on lauch_ box checked, otherwise the variables passed \n\u003e will be ignored.\n\nThe values of all the parameters inside `awxJob` are processed as [Go\ntemplates](https://golang.org/pkg/text/template) before executing the\njob. These templates receive the details of the alert inside the\n`$labels` and `$annotations` variables. For example, to generate\ndynamically the name of the job templates to execute from the value of\nthe `template` annotation of the alert:\n\n```yaml\nawxJob:\n  template: \"{{ $annotations.template }}\"\n```\n\nOr to pass a variable `node` to the playbook, calculated from the\n`instance` label:\n\n```yaml\nawxJob:\n  template: \"My template\"\n  extraVars: \n    node: \"{{ $labels.node }}\"\n```\n\nLimit execution to a host, calculated from the `instance` label:\n\n```yaml\nawxJob:\n  template: \"My template\"\n  limit: \"{{ $labels.instance }}\"\n```\n\n### Alertmanager Configuration\n\nFollow the upstream [Prometheus Alertmanager documentation](https://prometheus.io/docs/alerting/configuration/)\nto configure alerts.\n\nFor reference, here is an example Alertmanager configuration that sends\nan alert to the auto-heal service with authentication. This example assumes\nautoheal and the Alertmanager are running on the same OpenShift cluster,\nand requires Alertmanager 0.15 or newer.\n\n```yaml\nglobal:\n  resolve_timeout: 1m\n\nroute:\n  group_wait: 1s\n  group_interval: 1s\n  repeat_interval: 5m\n  receiver: autoheal\n  routes:\n  - match:\n      alertname: DeadMansSwitch\n    repeat_interval: 5m\n    receiver: autoheal \nreceivers:\n- name: default\n- name: deadmansswitch\n- name: autoheal\n  webhook_configs:\n  - url: https://autoheal.openshift-autoheal.svc/alerts\n    http_config:\n      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token\n      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt\n```\n\nWhen using the cluster-monitoring-operator, save the configuration as\n`alertmanager.yaml` and use this command to apply it:\n\n```oc create secret generic alertmanager-main \\\n   --namespace=openshift-monitoring \\\n   --from-literal=alertmanager.yaml=\"$(\u003c alertmanager.yaml)\" \\\n   --dry-run -oyaml \\\n   | \\\n   oc replace secret \\\n   --namespace=openshift-monitoring \\\n   --filename=-\n```\n\n\n\n## Building\n\nTo build the binary run this command:\n\n```\n$ make\n```\n\nTo build the RPM and the images, run this command:\n\n```\n$ make build-images\n```\n\n## Testing\n\nTo run the automated tests of the project run this command:\n\n```\n$ make check\n```\n\nTo manually test the service, without having to have a running Prometheus alert\nmanager that generates the alert notifications, you can use the `*-alert.json`\nfiles that are inside the `examples` directory. For example, to simulate the\n`NodeDown` alert start the server and then use [curl](https://curl.haxx.se) to\nsend the alert notification:\n\n```\n$ autoheal server --config-file=my.yml\n$ curl --data @examples/node-down-alert.json http://localhost:9099/alerts\n```\n\n## Installing\n\nTo install the service to an _OpenShift_ cluster use the template contained in\nthe `template.yml` file. This template requires at the very minimum the address\nand the credentials to connect to the AWX or Ansible Tower server. See the\n`template.sh` script for an example of how to use it.\n\n## Development\n\nIf needed for development, we can run the server without an OpenShift cluster,\nsimulating OpenShift's alert manager using curl commands.\n\nIn the examples dir we have examples of firing alerts, and a configuration file\nthat does not require a connection to a working OpenShift cluster.\n\nTo run autoheal in dev mode (without a running OpenShift cluster) developers\ncan use the dev config file in the examples dir.\n\nTo simulate alerts firing, developers can use the example alerts.\n\n```\n$ make build\n$ make run-dev\n```\n\n```\n$ curl --data @examples/node-down-alert.json http://localhost:9099/alerts\n```\n\nWhen developing features that does not require AWX server, developers can use\na mock-awx server from the examples dir. The mock server will listen on port\n8080.\n\n```\n$ cd examples/mock-awx\n$ go run mock-awx.go\n```\n","funding_links":[],"categories":["Go"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenshift%2Fautoheal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenshift%2Fautoheal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenshift%2Fautoheal/lists"}