https://github.com/linkerd/linkerd-failover
Linkerd Failover Operator
https://github.com/linkerd/linkerd-failover
Last synced: 10 months ago
JSON representation
Linkerd Failover Operator
- Host: GitHub
- URL: https://github.com/linkerd/linkerd-failover
- Owner: linkerd
- Created: 2022-02-03T21:04:21.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2025-03-20T04:57:55.000Z (about 1 year ago)
- Last Synced: 2025-04-17T08:59:38.548Z (about 1 year ago)
- Language: Rust
- Size: 726 KB
- Stars: 37
- Watchers: 12
- Forks: 8
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- Codeowners: .github/CODEOWNERS.yml
Awesome Lists containing this project
README
# linkerd-failover
Linkerd-failover is a Linkerd extension whose goal is to provide a failover
mechanism so that a service that exists in multiple clusters may continue to
operate as long as the service is available in any cluster.
The mechanism relies on Linkerd’s traffic-splitting functionality by providing
an operator to alter the backend services' weights in real time depending on
their readiness.
**Note:** if you find issues in this extension, please file an issue against
the [main Linkerd2 repo](https://github.com/linkerd/linkerd2/issues).
## Table of contents
- [Issue Tracking](#issue-tracking)
- [Requirements](#requirements)
- [Configuration](#configuration)
- [Installation](#installation)
- [Example](#example)
- [Implementation details](#implementation-details)
- [Failover criteria](#failover-criteria)
- [Failover logic](#failover-criteria)
## Issue Tracking
Issues are disabled in this repository since we track issues with this
extension in the [main Linkerd2
repo](https://github.com/linkerd/linkerd2/issues). The rationale here is
simply that we've found over time that having issues scattered across multiple
repos makes it too easy to lose track of things.
## Requirements
- Linkerd `stable-2.11.2` or later
- Linkerd-smi `v0.2.0` or later (required if using Linkerd `stable-2.12.0` or
later)
## Configuration
The following Helm values are available:
- `selector`: determines which `TrafficSplit` instances to consider for
failover. It defaults to `failover.linkerd.io/controlled-by={{.Release.Name}}`
(the value refers to the release name used in `helm install`).
- `logLevel`, `logFormat`: for configuring the operator's logging.
## Installation
Note the SMI extension CRD is included in Linkerd 2.11.x so you can skip this
step for those versions. As of version `stable-2.12.0`, it's no longer included
so you need to install it as described here.
The SMI extension and the operator are to be installed in the local cluster
(where the clients consuming the service are located).
Linkerd-smi installation:
```console
helm repo add linkerd-smi https://linkerd.github.io/linkerd-smi
helm repo up
helm install linkerd-smi -n linkerd-smi --create-namespace linkerd-smi/linkerd-smi
```
Linkerd-failover installation:
```console
# For edge releases
helm repo add linkerd-edge https://helm.linkerd.io/edge
helm repo up
helm install linkerd-failover -n linkerd-failover --create-namespace --devel linkerd-edge/linkerd-failover
# For stable releases
helm repo add linkerd https://helm.linkerd.io/stable
helm repo up
helm install linkerd-failover -n linkerd-failover --create-namespace linkerd/linkerd-failover
```
## Example
The following `TrafficSplit` serves as the initial state for a failover setup.
Clients should send requests to the apex service `sample-svc`. The primary
service that will serve these requests is declared through the
`failover.linkerd.io/primary-service` annotation, `sample-svc` in this case. If
the `TrafficSplit` does not include this annotation, it will treat the first
backend as the primary service.
When `sample-svc` starts failing, the weights will be switched over the other
backends.
Note that the failover services can be located in the local cluster, or they can
point to mirror services backed by services in other clusters (through Linkerd's
multicluster functionality).
```yaml
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
name: sample-svc
annotations:
failover.linkerd.io/primary-service: sample-svc
labels:
failover.linkerd.io/controlled-by: linkerd-failover
spec:
service: sample-svc
backends:
- service: sample-svc
weight: 1
- service: sample-svc-central1
weight: 0
- service: sample-svc-east1
weight: 0
- service: sample-svc-east2
weight: 0
- service: sample-svc-asia1
weight: 0
```
## Implementation details
### Failover criteria
The failover criteria is readiness failures on the targeted Pods. This is
directly reflected on the Endpoints object associated with those Pods: only when
Pods are ready, does the `addresses` field of the relevant Endpoints get
populated.
### Failover logic
The following describes the logic used to change the `TrafficSplit` weights:
- Whenever the primary backend is ready, all the weight is set to it, setting
the weights for all the secondary backends to zero.
- Whenever the primary backend is not ready, the following rules apply only if
there is at least one secondary backend that is ready:
- The primary backend’s weight is set to zero.
- The weight is distributed equally among all the secondary backends that
are ready.
- Whenever a secondary backend changes its readiness, the weight is
redistributed among all the secondary backends that are ready
- Whenever both the primary and secondaries are unavailable, the connection will
fail at the client-side, as expected.