{"id":13562108,"url":"https://github.com/planetlabs/draino","last_synced_at":"2025-12-29T23:28:40.448Z","repository":{"id":33073469,"uuid":"145902689","full_name":"planetlabs/draino","owner":"planetlabs","description":"Automatically cordon and drain Kubernetes nodes based on node conditions","archived":false,"fork":false,"pushed_at":"2024-03-26T15:33:02.000Z","size":183,"stargazers_count":630,"open_issues_count":35,"forks_count":84,"subscribers_count":53,"default_branch":"master","last_synced_at":"2024-11-15T00:25:00.680Z","etag":null,"topics":["autoremediation","drain","kubernetes","kubernetes-node"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/planetlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-23T20:17:32.000Z","updated_at":"2024-11-14T18:30:15.000Z","dependencies_parsed_at":"2024-05-28T17:13:40.845Z","dependency_job_id":"14e462eb-0780-48d0-bbff-8edceeaf9bc9","html_url":"https://github.com/planetlabs/draino","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planetlabs%2Fdraino","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planetlabs%2Fdraino/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planetlabs%2Fdraino/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planetlabs%2Fdraino/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/planetlabs","download_url":"https://codeload.github.com/planetlabs/draino/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247056799,"owners_count":20876463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autoremediation","drain","kubernetes","kubernetes-node"],"created_at":"2024-08-01T13:01:04.639Z","updated_at":"2025-12-29T23:28:40.394Z","avatar_url":"https://github.com/planetlabs.png","language":"Go","funding_links":[],"categories":["Go","kubernetes","Kubernetes tooling","OPS","Operators vs Controllers"],"sub_categories":["Cleanup"],"readme":"# draino [![Docker Pulls](https://img.shields.io/docker/pulls/planetlabs/draino.svg)](https://hub.docker.com/r/planetlabs/draino/) [![Godoc](https://img.shields.io/badge/godoc-reference-blue.svg)](https://godoc.org/github.com/planetlabs/draino) [![Travis](https://img.shields.io/travis/com/planetlabs/draino.svg?maxAge=300)](https://travis-ci.com/planetlabs/draino/) [![Codecov](https://img.shields.io/codecov/c/github/planetlabs/draino.svg?maxAge=3600)](https://codecov.io/gh/planetlabs/draino/)\nDraino automatically drains Kubernetes nodes based on labels and node\nconditions. Nodes that match _all_ of the supplied labels and _any_ of the\nsupplied node conditions will be cordoned immediately and drained after a\nconfigurable `drain-buffer` time.\n\nDraino is intended for use alongside the Kubernetes [Node Problem Detector](https://github.com/kubernetes/node-problem-detector)\nand [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler).\nThe Node Problem Detector can set a node condition when it detects something\nwrong with a node - for instance by watching node logs or running a script. The\nCluster Autoscaler can be configured to delete nodes that are underutilised.\nAdding Draino to the mix enables autoremediation:\n\n1. The Node Problem Detector detects a permanent node problem and sets the\n   corresponding node condition.\n2. Draino notices the node condition. It immediately cordons the node to prevent\n   new pods being scheduled there, and schedules a drain of the node.\n3. Once the node has been drained the Cluster Autoscaler will consider it\n   underutilised. It will be eligible for scale down (i.e. termination) by the\n   Autoscaler after a configurable period of time.\n\n## Usage\n```\n$ docker run planetlabs/draino /draino --help\nusage: draino [\u003cflags\u003e] \u003cnode-conditions\u003e...\n\nAutomatically cordons and drains nodes that match the supplied conditions.\n\nFlags:\n      --help                     Show context-sensitive help (also try --help-long and --help-man).\n  -d, --debug                    Run with debug logging.\n      --listen=\":10002\"          Address at which to expose /metrics and /healthz.\n      --kubeconfig=KUBECONFIG    Path to kubeconfig file. Leave unset to use in-cluster config.\n      --master=MASTER            Address of Kubernetes API server. Leave unset to use in-cluster config.\n      --dry-run                  Emit an event without cordoning or draining matching nodes.\n      --max-grace-period=8m0s    Maximum time evicted pods will be given to terminate gracefully.\n      --eviction-headroom=30s    Additional time to wait after a pod's termination grace period for it to have been deleted.\n      --drain-buffer=10m0s       Minimum time between starting each drain. Nodes are always cordoned immediately.\n      --node-label=\"foo=bar\"     (DEPRECATED) Only nodes with this label will be eligible for cordoning and draining. May be specified multiple times.\n      --node-label-expr=\"metadata.labels.foo == 'bar'\"\n                                 This is an expr string https://github.com/antonmedv/expr that must return true or false. See `nodefilters_test.go` for examples\n      --namespace=\"kube-system\"  Namespace used to create leader election lock object.\t\n      --leader-election-lease-duration=15s\n                                 Lease duration for leader election.\n      --leader-election-renew-deadline=10s\n                                 Leader election renew deadline.\n      --leader-election-retry-period=2s\n                                 Leader election retry period.\n      --skip-drain               Whether to skip draining nodes after cordoning.\n      --evict-daemonset-pods     Evict pods that were created by an extant DaemonSet.\n      --evict-emptydir-pods      Evict pods with local storage, i.e. with emptyDir volumes.\n      --evict-unreplicated-pods  Evict pods that were not created by a replication controller.\n      --protected-pod-annotation=KEY[=VALUE] ...\n                                 Protect pods with this annotation from eviction. May be specified multiple times.\n\nArgs:\n  \u003cnode-conditions\u003e  Nodes for which any of these conditions are true will be cordoned and drained.\n```\n\n### Labels and Label Expressions\n\nDraino allows filtering the elligible set of nodes using `--node-label` and `--node-label-expr`.\nThe original flag `--node-label` is limited to the boolean AND of the specified labels. To express more complex predicates, the new `--node-label-expr`\nflag allows for mixed OR/AND/NOT logic via https://github.com/antonmedv/expr.\n\nAn example of `--node-label-expr`:\n\n```\n(metadata.labels.region == 'us-west-1' \u0026\u0026 metadata.labels.app == 'nginx') || (metadata.labels.region == 'us-west-2' \u0026\u0026 metadata.labels.app == 'nginx')\n```\n\n## Considerations\nKeep the following in mind before deploying Draino:\n\n* Always run Draino in `--dry-run` mode first to ensure it would drain the nodes\n  you expect it to. In dry run mode Draino will emit logs, metrics, and events\n  but will not actually cordon or drain nodes.\n* Draino immediately cordons nodes that match its configured labels and node\n  conditions, but will wait a configurable amount of time (10 minutes by default)\n  between draining nodes. i.e. If two nodes begin exhibiting a node condition\n  simultaneously one node will be drained immediately and the other in 10 minutes.\n* Draino considers a drain to have failed if at least one pod eviction triggered\n  by that drain fails. If Draino fails to evict two of five pods it will consider\n  the Drain to have failed, but the remaining three pods will always be evicted.\n* Pods that can't be evicted by the cluster-autoscaler won't be evicted by draino.\n  See annotation `\"cluster-autoscaler.kubernetes.io/safe-to-evict\": \"false\"` in\n  [cluster-autoscaler documentation](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node)\n\n## Deployment\n\nDraino is automatically built from master and pushed to the [Docker Hub](https://hub.docker.com/r/planetlabs/draino/).\nBuilds are tagged `planetlabs/draino:$(git rev-parse --short HEAD)`.\n\n**Note:** As of September, 2020 we no longer publish `planetlabs/draino:latest`\nin order to encourage explicit and pinned releases.\n\nAn [example Kubernetes deployment manifest](manifest.yml) is provided.\n\n## Monitoring\n\n### Metrics\nDraino provides a simple healthcheck endpoint at `/healthz` and Prometheus\nmetrics at `/metrics`. The following metrics exist:\n\n```bash\n$ kubectl -n kube-system exec -it ${DRAINO_POD} -- apk add curl\n$ kubectl -n kube-system exec -it ${DRAINO_POD} -- curl http://localhost:10002/metrics\n# HELP draino_cordoned_nodes_total Number of nodes cordoned.\n# TYPE draino_cordoned_nodes_total counter\ndraino_cordoned_nodes_total{result=\"succeeded\"} 2\ndraino_cordoned_nodes_total{result=\"failed\"} 1\n# HELP draino_drained_nodes_total Number of nodes drained.\n# TYPE draino_drained_nodes_total counter\ndraino_drained_nodes_total{result=\"succeeded\"} 1\ndraino_drained_nodes_total{result=\"failed\"} 1\n```\n\n### Events\nDraino is generating event for every relevant step of the eviction process. Here is an example that ends with a reason `DrainFailed`. When everything is fine the last event for a given node will have a reason `DrainSucceeded`.\n```\n\u003e kubectl get events -n default | grep -E '(^LAST|draino)'\n\nLAST SEEN   FIRST SEEN   COUNT   NAME                                               KIND TYPE      REASON             SOURCE MESSAGE\n5m          5m           1       node-demo.15fe0c35f0b4bd10    Node Warning   CordonStarting     draino Cordoning node\n5m          5m           1       node-demo.15fe0c35fe3386d8    Node Warning   CordonSucceeded    draino Cordoned node\n5m          5m           1       node-demo.15fe0c360bd516f8    Node Warning   DrainScheduled     draino Will drain node after 2020-03-20T16:19:14.91905+01:00\n5m          5m           1       node-demo.15fe0c3852986fe8    Node Warning   DrainStarting      draino Draining node\n4m          4m           1       node-demo.15fe0c48d010ecb0    Node Warning   DrainFailed        draino Draining failed: timed out waiting for evictions to complete: timed out\n```\n\n### Conditions\nWhen a drain is scheduled, on top of the event, a condition is added to the status of the node. This condition will hold information about the beginning and the end of the drain procedure. This is something that you can see by describing the node resource:\n\n```\n\u003e kubectl describe node {node-name}\n......\nUnschedulable:      true\nConditions:\n  Type                  Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message\n  ----                  ------  -----------------                 ------------------                ------                       -------\n  OutOfDisk             False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available\n  MemoryPressure        False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available\n  DiskPressure          False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure\n  PIDPressure           False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available\n  Ready                 True    Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:02:09 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled\n  ec2-host-retirement   True    Fri, 20 Mar 2020 15:23:26 +0100   Fri, 20 Mar 2020 15:23:26 +0100   NodeProblemDetector          Condition added with tooling\n  DrainScheduled        True    Fri, 20 Mar 2020 15:50:50 +0100   Fri, 20 Mar 2020 15:23:26 +0100   Draino                       Drain activity scheduled 2020-03-20T15:50:34+01:00\n```\n\n  Later when the drain activity will be completed the condition will be amended letting you know if it succeeded of failed:\n\n```\n\u003e kubectl describe node {node-name}\n......\nUnschedulable:      true\nConditions:\n  Type                  Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message\n  ----                  ------  -----------------                 ------------------                ------                       -------\n  OutOfDisk             False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available\n  MemoryPressure        False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available\n  DiskPressure          False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure\n  PIDPressure           False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available\n  Ready                 True    Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:02:09 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled\n  ec2-host-retirement   True    Fri, 20 Mar 2020 15:23:26 +0100   Fri, 20 Mar 2020 15:23:26 +0100   NodeProblemDetector          Condition added with tooling\n  DrainScheduled        True    Fri, 20 Mar 2020 15:50:50 +0100   Fri, 20 Mar 2020 15:23:26 +0100   Draino                       Drain activity scheduled 2020-03-20T15:50:34+01:00 | Completed: 2020-03-20T15:50:50+01:00\n  ```\n\nIf the drain had failed the condition line would look like:\n```\n  DrainScheduled        True    Fri, 20 Mar 2020 15:50:50 +0100   Fri, 20 Mar 2020 15:23:26 +0100   Draino                       Drain activity scheduled 2020-03-20T15:50:34+01:00| Failed:2020-03-20T15:55:50+01:00\n```\n\n## Retrying drain\n\nIn some cases the drain activity may failed because of restrictive Pod Disruption Budget or any other reason external to Draino. The node remains `cordon` and the drain condition \nis marked as `Failed`. If you want to reschedule a drain tentative on that node, add the annotation: `draino/drain-retry: true`. A new drain schedule will be created. Note that the annotation is not modified and will trigger retries in loop in case the drain fails again.\n\n```\nkubectl annotate node {node-name} draino/drain-retry=true\n```\n## Modes\n\n### Dry Run\nDraino can be run in dry run mode using the `--dry-run` flag.\n\n### Cordon Only\nDraino can also optionally be run in a mode where the nodes are only cordoned, and not drained. This can be achieved by using the `--skip-drain` flag.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplanetlabs%2Fdraino","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplanetlabs%2Fdraino","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplanetlabs%2Fdraino/lists"}