Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/kubernetes-csi/external-provisioner

Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a CSI endpoint
https://github.com/kubernetes-csi/external-provisioner

k8s-sig-storage

Last synced: 21 days ago
JSON representation

Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a CSI endpoint

Host: GitHub
URL: https://github.com/kubernetes-csi/external-provisioner
Owner: kubernetes-csi
License: apache-2.0
Created: 2017-10-27T15:24:12.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-06-04T14:54:00.000Z (23 days ago)
Last Synced: 2024-06-05T13:58:51.830Z (22 days ago)
Topics: k8s-sig-storage
Language: Go
Homepage:
Size: 101 MB
Stars: 326
Watchers: 30
Forks: 312
Open Issues: 29
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG/CHANGELOG-1.1.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: code-of-conduct.md
- Security: SECURITY_CONTACTS

Lists

awesome-stars - external-provisioner - csi | 326 | (Go)

README

# CSI provisioner

The external-provisioner is a sidecar container that dynamically provisions volumes by calling `CreateVolume` and `DeleteVolume` functions of CSI drivers. It is necessary because internal persistent volume controller running in Kubernetes controller-manager does not have any direct interfaces to CSI drivers.

## Overview
The external-provisioner is an external controller that monitors `PersistentVolumeClaim` objects created by user and creates/deletes volumes for them.
The [Kubernetes Container Storage Interface (CSI) Documentation](https://kubernetes-csi.github.io/docs/) explains how to develop, deploy, and test a Container Storage Interface (CSI) driver on Kubernetes.

## Compatibility

This information reflects the head of this branch.

| Compatible with CSI Version | Container Image | [Min K8s Version](https://kubernetes-csi.github.io/docs/kubernetes-compatibility.html#minimum-version) | [Recommended K8s Version](https://kubernetes-csi.github.io/docs/kubernetes-compatibility.html#recommended-version) |
| ------------------------------------------------------------------------------------------ | -------------------------------| --------------- | ------------- |
| [CSI Spec v1.7.0](https://github.com/container-storage-interface/spec/releases/tag/v1.7.0) | registry.k8s.io/sig-storage/csi-provisioner | 1.20 | 1.27 |

## Feature status

Various external-provisioner releases come with different alpha / beta features. Check `--help` output for alpha/beta features in each release.

Following table reflects the head of this branch.

| Feature | Status | Default | Description | Provisioner Feature Gate Required |
| -------------- | ------- | ------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------| --------------------------------- |
| Snapshots | GA | On | [Snapshots and Restore](https://kubernetes-csi.github.io/docs/snapshot-restore-feature.html). | No |
| CSIMigration | GA | On | [Migrating in-tree volume plugins to CSI](https://kubernetes.io/docs/concepts/storage/volumes/#csi-migration). | No |
| CSIStorageCapacity | GA | On | Publish [capacity information](https://kubernetes.io/docs/concepts/storage/volumes/#storage-capacity) for the Kubernetes scheduler. | No |
| ReadWriteOncePod | Beta | On | [Single pod access mode for PersistentVolumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes). | No |
| CSINodeExpandSecret | GA | On | [CSI Node expansion secret](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3107-csi-nodeexpandsecret) | No |
| HonorPVReclaimPolicy| Beta | On | [Honor the PV reclaim policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2644-honor-pv-reclaim-policy) | No |
| PreventVolumeModeConversion | Beta |On | [Prevent unauthorized conversion of source volume mode](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3141-prevent-volume-mode-conversion) | `--prevent-volume-mode-conversion` (No in-tree feature gate) |
| CrossNamespaceVolumeDataSource | Alpha |Off | [Cross-namespace volume data source](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3294-provision-volumes-from-cross-namespace-snapshots) | `--feature-gates=CrossNamespaceVolumeDataSource=true` |

All other external-provisioner features and the external-provisioner itself is considered GA and fully supported.

## Usage

It is necessary to create a new service account and give it enough privileges to run the external-provisioner, see `deploy/kubernetes/rbac.yaml`. The provisioner is then deployed as single Deployment as illustrated below:

```sh
kubectl create deploy/kubernetes/deployment.yaml
```

The external-provisioner may run in the same pod with other external CSI controllers such as the external-attacher, external-snapshotter and/or external-resizer.

Note that the external-provisioner does not scale with more replicas. Only one external-provisioner is elected as leader and running. The others are waiting for the leader to die. They re-elect a new active leader in ~15 seconds after death of the old leader.

### Command line options

#### Recommended optional arguments
* `--csi-address `: This is the path to the CSI driver socket inside the pod that the external-provisioner container will use to issue CSI operations (`/run/csi/socket` is used by default).

* `--leader-election`: Enables leader election. This is mandatory when there are multiple replicas of the same external-provisioner running for one CSI driver. Only one of them may be active (=leader). A new leader will be re-elected when current leader dies or becomes unresponsive for ~15 seconds.

* `--leader-election-namespace`: Namespace where leader election object will be created. It is recommended that this parameter is populated from Kubernetes DownwardAPI with the namespace where the external-provisioner runs in.

* `--leader-election-lease-duration `: Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds.

* `--leader-election-renew-deadline `: Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds.

* `--leader-election-retry-period `: Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds.

* `--timeout `: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of `ControllerCreateVolume` and `ControllerDeleteVolume` calls. See [CSI error and timeout handling](#csi-error-and-timeout-handling) for details. 15 seconds is used by default.

* `--retry-interval-start `: Initial retry interval of failed provisioning or deletion. It doubles with each failure, up to `--retry-interval-max` and then it stops increasing. Default value is 1 second. See [CSI error and timeout handling](#csi-error-and-timeout-handling) for details.

* `--retry-interval-max `: Maximum retry interval of failed provisioning or deletion. Default value is 5 minutes. See [CSI error and timeout handling](#csi-error-and-timeout-handling) for details.

* `--worker-threads `: Number of simultaneously running `ControllerCreateVolume` and `ControllerDeleteVolume` operations. Default value is `100`.

* `--kube-api-qps `: The number of requests per second sent by a Kubernetes client to the Kubernetes API server. Defaults to `5.0`.

* `--kube-api-burst `: The number of requests to the Kubernetes API server, exceeding the QPS, that can be sent at any given time. Defaults to `10`.

* `--cloning-protection-threads `: Number of simultaneously running threads, handling cloning finalizer removal. Defaults to `1`.

* `--http-endpoint`: The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: `:8080` which corresponds to port 8080 on local host). The default is empty string, which means the server is disabled.

* `--metrics-path`: The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.

* `--extra-create-metadata`: Enables the injection of extra PVC and PV metadata as parameters when calling `CreateVolume` on the driver (keys: "csi.storage.k8s.io/pvc/name", "csi.storage.k8s.io/pvc/namespace", "csi.storage.k8s.io/pv/name")

* `controller-publish-readonly`: This option enables PV to be marked as readonly at controller publish volume call if PVC accessmode has been set to ROX. Defaults to `false`.

* `--enable-pprof`: Enable pprof profiling on the TCP network address specified by `--http-endpoint`. The HTTP path is `/debug/pprof/`.

##### Storage capacity arguments

See the [storage capacity section](#capacity-support) below for details.

* `--enable-capacity`: This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. The default is to not produce CSIStorageCapacity objects.

* `--capacity-ownerref-level `: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet and DaemonSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet). Ownership is optional and can be disabled with -1.

* `--capacity-threads `: Number of simultaneously running threads, handling CSIStorageCapacity objects. Defaults to `1`.

* `--capacity-poll-interval `: How long the external-provisioner waits before checking for storage capacity changes. Defaults to `1m`.

* `--capacity-for-immediate-binding `: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.

##### Distributed provisioning

* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.

* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired. Disabling it may be useful for example when a custom controller will select nodes for PVCs.

* `--node-deployment-base-delay`: Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding. Defaults to 20 seconds.

* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.

#### Other recognized arguments
* `--feature-gates `: A set of comma separated `=` pairs that describe feature gates for alpha/experimental features. See [list of features](#feature-status) or `--help` output for list of recognized features. Example: `--feature-gates Topology=true` to enable Topology feature that's disabled by default.

* `--strict-topology`: This controls what topology information is passed to `CreateVolumeRequest.AccessibilityRequirements` in case of delayed binding. See [the table below](#topology-support) for an explanation how this option changes the result. This option has no effect if either `Topology` feature is disabled or `Immediate` volume binding mode is used.

* `--immediate-topology`: This controls what topology information is passed to `CreateVolumeRequest.AccessibilityRequirements` in case of immediate binding. See [the table below](#topology-support) for an explanation how this option changes the result. This option has no effect if either `Topology` feature is disabled or `WaitForFirstConsumer` (= delayed) volume binding mode is used. The default is true, so use `--immediate-topology=false` to disable it. It should not be disabled if the CSI driver might create volumes in a topology segment that is not accessible in the cluster. Such a driver should use the topology information to create new volumes where they can be accessed.

* `--kubeconfig `: Path to Kubernetes client configuration that the external-provisioner uses to connect to Kubernetes API server. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-provisioner does not run as a Kubernetes pod, e.g. for debugging. Either this or `--master` needs to be set if the external-provisioner is being run out of cluster.

* `--master `: Master URL to build a client config from. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-provisioner does not run as a Kubernetes pod, e.g. for debugging. Either this or `--kubeconfig` needs to be set if the external-provisioner is being run out of cluster.

* `--metrics-address`: (deprecated) The TCP network address where the prometheus metrics endpoint will run (example: `:8080` which corresponds to port 8080 on local host). The default is empty string, which means metrics endpoint is disabled.

* `--volume-name-prefix `: Prefix of PersistentVolume names created by the external-provisioner. Default value is "pvc", i.e. created PersistentVolume objects will have name `pvc-`.

* `--volume-name-uuid-length`: Length of UUID to be added to `--volume-name-prefix`. Default behavior is to NOT truncate the UUID.

* `--version`: Prints current external-provisioner version and quits.

* `--prevent-volume-mode-conversion`: Prevents an unauthorized user from modifying the volume mode when creating a PVC from an existing VolumeSnapshot. Defaults to true.

* All glog / klog arguments are supported, such as `-v ` or `-alsologtostderr`.

### Design

External-provisioner interacts with Kubernetes by watching PVCs and
PVs and implementing the [external provisioner
protocol](https://github.com/kubernetes/design-proposals-archive/blob/main/storage/container-storage-interface.md#provisioning-and-deleting).
The [design document](./doc/design.md) explains this in more detail.

### Topology support
When `Topology` feature is enabled* and the driver specifies `VOLUME_ACCESSIBILITY_CONSTRAINTS` in its plugin capabilities, external-provisioner prepares `CreateVolumeRequest.AccessibilityRequirements` while calling `Controller.CreateVolume`. The driver has to consider these topology constraints while creating the volume. Below table shows how these `AccessibilityRequirements` are prepared:

[Delayed binding](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode) | Strict topology | [Allowed topologies](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) | Immediate Topology | [Resulting accessibility requirements](https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume)
:---: |:---:|:---:|:---:|:---|
Yes | Yes | Irrelevant | Irrelevant | `Requisite` = `Preferred` = Selected node topology
Yes | No | No | Irrelevant | `Requisite` = Aggregated cluster topology
`Preferred` = `Requisite` with selected node topology as first element
Yes | No | Yes | Irrelevant | `Requisite` = Allowed topologies
`Preferred` = `Requisite` with selected node topology as first element
No | Irrelevant | Yes | Irrelevant | `Requisite` = Allowed topologies
`Preferred` = `Requisite` with randomly selected node topology as first element
No | Irrelevant | No | Yes | `Requisite` = Aggregated cluster topology
`Preferred` = `Requisite` with randomly selected node topology as first element
No | Irrelevant | No | No | `Requisite` and `Preferred` both nil

*) `Topology` feature gate is enabled by default since v5.0.

When enabling topology support in a CSI driver that had it disabled, please make sure the topology is first enabled in the driver's node DaemonSet and topology labels are populated on all nodes. The topology can be then updated in the driver's Deployment and its external-provisioner sidecar.

### Capacity support

The external-provisioner can be used to create CSIStorageCapacity
objects that hold information about the storage capacity available
through the driver. The Kubernetes scheduler then [uses that
information](https://kubernetes.io/docs/concepts/storage/storage-capacity)
when selecting nodes for pods with unbound volumes that wait for the
first consumer.

All CSIStorageCapacity objects created by an instance of
the external-provisioner have certain labels:

* `csi.storage.k8s.io/drivername`: the CSI driver name
* `csi.storage.k8s.io/managed-by`: external-provisioner for central
provisioning, external-provisioner- for distributed
provisioning

They get created in the namespace identified with the `NAMESPACE`
environment variable.

Each external-provisioner instance manages exactly those objects with
the labels that correspond to the instance.

Optionally, all CSIStorageCapacity objects created by an instance of
the external-provisioner can have an
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents).
This ensures that the objects get removed automatically when uninstalling
the CSI driver. The owner is
determined with the `POD_NAME/NAMESPACE` environment variables and
the `--capacity-ownerref-level` parameter. Setting an owner reference is highly
recommended whenever possible (i.e. in the most common case that drivers are
run inside containers).

If ownership is disabled the storage admin is responsible for removing
orphaned CSIStorageCapacity objects, and the following command can be
used to clean up orphaned objects of a driver:

```
kubectl delete csistoragecapacities -l csi.storage.k8s.io/drivername=my-csi.example.com
```

When switching from a deployment without ownership to one with
ownership, managed objects get updated such that they have the
configured owner. When switching in the other direction, the owner
reference is not removed because the new deployment doesn't know
what the old owner was.

To enable this feature in a driver deployment with a central controller (see also the
[`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml)
example):

- Set the `POD_NAME` and `NAMESPACE` environment variables like this:
```yaml
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
```
- Add `--enable-capacity` to the command line flags.
- Add `StorageCapacity: true` to the CSIDriver information object.
Without it, external-provisioner will publish information, but the
Kubernetes scheduler will ignore it. This can be used to first
deploy the driver without that flag, then when sufficient
information has been published, enabled the scheduler usage of it.
- If external-provisioner is not deployed with a StatefulSet, then
configure with `--capacity-ownerref-level` which object is meant to own
CSIStorageCapacity objects.
- Optional: configure how often external-provisioner polls the driver
to detect changed capacity with `--capacity-poll-interval`.
- Optional: configure how many worker threads are used in parallel
with `--capacity-threads`.
- Optional: enable producing information also for storage classes that
use immediate volume binding with
`--capacity-for-immediate-binding`. This is usually not needed
because such volumes are created by the driver without involving the
Kubernetes scheduler and thus the published information would just
be ignored.

To determine how many different topology segments exist,
external-provisioner uses the topology keys and labels that the CSI
driver instance on each node reports to kubelet in the
`NodeGetInfoResponse.accessible_topology` field. The keys are stored
by kubelet in the CSINode objects and the actual values in Node
annotations.

CSI drivers must report topology information that matches the storage
pool(s) that it has access to, with granularity that matches the most
restrictive pool.

For example, if the driver runs in a node with region/rack topology
and has access to per-region storage as well as per-rack storage, then
the driver should report topology with region/rack as its keys. If it
only has access to per-region storage, then it should just use region
as key. If it uses region/rack, then redundant CSIStorageCapacity
objects will be published, but the information is still correct. See
the
[KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1472-storage-capacity-tracking#with-central-controller)
for details.

For each segment and each storage class, CSI `GetCapacity` is called
once with the topology of the segment and the parameters of the
class. If there is no error and the capacity is non-zero, a
CSIStorageCapacity object is created or updated (if it
already exists from a prior call) with that information. Obsolete
objects are removed.

To ensure that CSIStorageCapacity objects get removed when the
external-provisioner gets removed from the cluster, they all have an
owner and therefore get garbage-collected when that owner
disappears. The owner is not the external-provisioner pod itself but
rather one of its parents as specified by `--capacity-ownerref-level`.
This way, it is possible to switch between external-provisioner
instances without losing the already gathered information.

CSIStorageCapacity objects are namespaced and get created in the
namespace of the external-provisioner. Only CSIStorageCapacity objects
with the right owner are modified by external-provisioner and their
name is generated, so it is possible to deploy different drivers in
the same namespace. However, Kubernetes does not check who is creating
CSIStorageCapacity objects, so in theory a malfunctioning or malicious
driver deployment could also publish incorrect information about some
other driver.

The deployment with [distributed
provisioning](#distributed-provisioning) is almost the same as above,
with some minor change:
- Use `--capacity-ownerref-level=1` and the `NAMESPACE/POD_NAME`
variables to make the DaemonSet that contains the external-provisioner
the owner of CSIStorageCapacity objects for the node.

Deployments of external-provisioner outside the Kubernetes cluster
are also possible, albeit only without an owner for the objects.
`NAMESPACE` still needs to be set to some existing namespace also
in this case.

### CSI error and timeout handling
The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default).

Correct timeout value and number of worker threads depends on the storage backend and how quickly it is able to process `ControllerCreateVolume` and `ControllerDeleteVolume` calls. The value should be set to accommodate majority of them. It is fine if some calls time out - such calls will be retried after exponential backoff (starting with 1s by default), however, this backoff will introduce delay when the call times out several times for a single volume.

Frequency of `ControllerCreateVolume` and `ControllerDeleteVolume` retries can be configured by `--retry-interval-start` and `--retry-interval-max` parameters. The external-provisioner starts retries with `retry-interval-start` interval (1s by default) and doubles it with each failure until it reaches `retry-interval-max` (5 minutes by default). The external provisioner stops increasing the retry interval when it reaches `retry-interval-max`, however, it still retries provisioning/deletion of a volume until it's provisioned. The external-provisioner keeps its own number of provisioning/deletion failures for each volume.

The external-provisioner can invoke up to `--worker-threads` (100 by default) `ControllerCreateVolume` **and** up to `--worker-threads` (100 by default) `ControllerDeleteVolume` calls in parallel, i.e. these two calls are counted separately. The external-provisioner assumes that the storage backend can cope with such high number of parallel requests and that the requests are handled in relatively short time (ideally sub-second). Lower value should be used for storage backends that expect slower processing related to newly created / deleted volumes or can handle lower amount of parallel calls.

Details of error handling of individual CSI calls:
* `ControllerCreateVolume`: The call might have timed out just before the driver provisioned a volume and was sending a response. From that reason, timeouts from `ControllerCreateVolume` is considered as "*volume may be provisioned*" or "*volume is being provisioned in the background*." The external-provisioner will retry calling `ControllerCreateVolume` after exponential backoff until it gets either successful response or final (non-timeout) error that the volume cannot be created.
* `ControllerDeleteVolume`: This is similar to `ControllerCreateVolume`, The external-provisioner will retry calling `ControllerDeleteVolume` with exponential backoff after timeout until it gets either successful response or a final error that the volume cannot be deleted.
* `Probe`: The external-provisioner retries calling Probe until the driver reports it's ready. It retries also when it receives timeout from `Probe` call. The external-provisioner has no limit of retries. It is expected that ReadinessProbe on the driver container will catch case when the driver takes too long time to get ready.
* `GetPluginInfo`, `GetPluginCapabilitiesRequest`, `ControllerGetCapabilities`: The external-provisioner expects that these calls are quick and does not retry them on any error, including timeout. Instead, it assumes that the driver is faulty and exits. Note that Kubernetes will likely start a new provisioner container and it will start with `Probe` call.

### HTTP endpoint

The external-provisioner optionally exposes an HTTP endpoint at address:port specified by `--http-endpoint` argument. When set, these two paths are exposed:

* Metrics path, as set by `--metrics-path` argument (default is `/metrics`).
* Leader election health check at `/healthz/leader-election`. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-provisioner leader that fails to connect to the API server to renew its leadership. See https://github.com/kubernetes-csi/csi-lib-utils/issues/66 for details.

### Deployment on each node

Normally, external-provisioner is deployed once in a cluster and
communicates with a control instance of the CSI driver which then
provisions volumes via some kind of storage backend API. CSI drivers
which manage local storage on a node don't have such an API that a
central controller could use.

To support this case, external-provisioner can be deployed alongside
each CSI driver on different nodes. The CSI driver deployment must:
- support topology, usually with one topology key
("csi.example.org/node") and the Kubernetes node name as value
- use a service account that has the same RBAC rules as for a normal
deployment
- invoke external-provisioner with `--node-deployment`
- tweak `--node-deployment-base-delay` and `--node-deployment-max-delay`
to match the expected cluster size and desired response times
(only relevant when there are storage classes with immediate binding,
see below for details)
- set the `NODE_NAME` environment variable to the name of the Kubernetes node
- implement `GetCapacity`

Usage of `--strict-topology` and `--immediate-topology=false` is
recommended because it makes the `CreateVolume` invocations simpler.
Topology information is always derived exclusively from the
information returned by the CSI driver that runs on the same node,
without combining that with information stored for other nodes. This
works as long as each node is in its own topology segment,
i.e. usually with a single topology key and one unique value for each
node.

Volume provisioning with late binding works as before, except that
each external-provisioner instance checks the "selected node"
annotation and only creates volumes if that node is the one it runs
on. It also only deletes volumes on its own node.

Immediate binding is also supported, but not recommended. It is
implemented by letting the external-provisioner instances assign a PVC
to one of them: when they see a new PVC with immediate binding, they
all attempt to set the "selected node" annotation with their own node
name as value. Only one update request can succeed, all others get a
"conflict" error and then know that some other instance was faster. To
avoid the thundering herd problem, each instance waits for a random
period before issuing an update request.

When `CreateVolume` call fails with `ResourcesExhausted`, the normal
recovery mechanism is used, i.e. the external-provisioner instance
removes the "selected node" annotation and the process repeats. But
this triggers events for the PVC and delays volume creation, in
particular when storage is exhausted on most nodes. Therefore
external-provisioner checks with `GetCapacity` *before* attempting to
own a PVC whether the currently available capacity is sufficient for
the volume. When it isn't, the PVC is ignored and some other instance
can own it.

The `--node-deployment-base-delay` parameter determines the initial
wait period. It also sets the jitter, so in practice the initial wait period will be
in the range from zero to the base delay. If the value is high,
volumes with immediate binding get created more slowly. If it is low,
then the risk of conflicts while setting the "selected node"
annotation increases and the apiserver load will be higher.

There is an exponential backoff per PVC which is used for unexpected
problems. Normally, an owner for a PVC is chosen during the first
attempt, so most PVCs will use the base delays. A maximum can be set
with `--node-deployment-max-delay` anyway, to avoid very long delays
when something went wrong repeatedly.

During scale testing with 100 external-provisioner instances, a base
delay of 20 seconds worked well. When provisioning 3000 volumes, there
were only 500 conflicts which the apiserver handled without getting
overwhelmed. The average provisioning rate of around 40 volumes/second
was the same as with a delay of 10 seconds. The worst-case latency per
volume was probably higher, but that wasn't measured.

Note that the QPS settings of kube-controller-manager and
external-provisioner have to be increased at the moment (Kubernetes
1.19) to provision volumes faster than around 4 volumes/second. Those
settings will eventually get replaced with [flow control in the API
server
itself](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/).

Beware that if *no* node has sufficient storage available, then also
no `CreateVolume` call is attempted and thus no events are generated
for the PVC, i.e. some other means of tracking remaining storage
capacity must be used to detect when the cluster runs out of storage.

Because PVCs with immediate binding get distributed randomly among
nodes, they get spread evenly. If that is not desirable, then it is
possible to disable support for immediate binding in distributed
provisioning with `--node-deployment-immediate-binding=false` and
instead implement a custom policy in a separate controller which sets
the "selected node" annotation to trigger local provisioning on the
desired node.

### Deleting local volumes after a node failure or removal

When a node with local volumes gets removed from a cluster before
deleting those volumes, the PV and PVC objects may still exist. It may
be possible to remove the PVC normally if the volume was not in use by
any pod on the node, but normal deletion of the volume and thus
deletion of the PV is not possible anymore because the CSI driver
instance on the node is not available or reachable anymore and therefore
Kubernetes cannot be sure that it is okay to remove the PV.

When an administrator is sure that the node is never going to come
back, then the local volumes can be removed manually:
- force-delete objects: `kubectl delete pv --wait=false --grace-period=0 --force`
- remove all finalizers: `kubectl patch pv -p '{"metadata":{"finalizers":null}}'`

It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that.

If there still was a PVC which was bound to that PV, it then will be
moved to phase "Lost". It has to be deleted and re-created if still
needed because no new volume will be created for it. Editing the PVC
to revert it to phase "Unbound" is not allowed by the Kubernetes
API server.

## Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/).

You can reach the maintainers of this project at:

- [Slack channel](https://kubernetes.slack.com/messages/sig-storage)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-storage)

### Code of conduct

Participation in the Kubernetes community is governed by the [Kubernetes Code of Conduct](code-of-conduct.md).