https://github.com/yashbhutwala/concurrency-limits-demo

Throughtput is NOT Capacity. Stop Rate Limiting, Use Concurrency Limits!
https://github.com/yashbhutwala/concurrency-limits-demo

Last synced: 2 months ago
JSON representation

Throughtput is NOT Capacity. Stop Rate Limiting, Use Concurrency Limits!

Host: GitHub
URL: https://github.com/yashbhutwala/concurrency-limits-demo
Owner: yashbhutwala
Created: 2020-07-14T15:55:52.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-08-15T21:29:17.000Z (almost 5 years ago)
Last Synced: 2024-12-25T15:41:59.830Z (7 months ago)
Language: Go
Size: 2.86 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Capacity management

Table of contents:

- [Get started](#get-started)
- [Experiments](#experiments)
- [Slower processing, service is down](#slower-processing-service-is-down)
- [Slower processing, service is up](#slower-processing-service-is-up)
- [Fixed in-flight requests quota](#fixed-in-flight-requests-quota)
- [Adaptive in-flight requests quota](#adaptive-in-flight-requests-quota)

> The most common mechanism available in both open source and commercial API gateways is rate limiting:
> making sure a particular client sends no more than a certain number of requests per unit time.
> As it turns out, this is exactly the wrong abstraction for multiple reasons.

This small project aims to reproduce results from Jon Moore's talk
[Stop Rate Limiting! Capacity Management Done Right](https://www.youtube.com/watch?v=m64SWl9bfvk).
He illustrated where rate limiting can break down using
[Little's law](https://en.wikipedia.org/wiki/Little%27s_law) `N = X * R`:

- N is capacity (number of workers)
- X is throughput (requests arrival rate)
- R is service time (how long it takes a worker to process a request)

In the examples client sends 5 requests per second using 10 workers
which wait for a response no longer than 2.5 seconds.

```shell script
$ ./client -worker=10 -rps=5 -origin=http://origin:8000
```

Origin server has an SLO to serve 99% of requests within 1 second.
It has fixed number of workers, each takes 1 second on average to process a request.
Requests are enqueued when workers are busy and discarded when the queue is full.

```shell script
$ ./origin -worker=7 -worktime=1s -queue=100
```

According to Little's law, origin should be able to handle 7 requests per second.

```
N = X * R
7 workers = X rps * 1s
X = 7/1 = 7 rps
```

[Run the programs](#get-started) in Docker Compose and check out Grafana dashboard:

- client sends 5 requests per second and receives HTTP 200 OK responses
- origin processes 5 requests per second
- origin has served requests with average latency 1 second
- origin has served 50% of requests (50th percentile) within 1 second
- origin has served 99% of requests (99th percentile) within 1 second

Note, [quantiles are estimated](https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation).
Almost all observations fall into the bucket `{le="1.05"}`, i.e. the bucket from 1s to 1.05s.
The histogram implementation guarantees that the true 99th percentile is somewhere between 1s and 1.05s.
The calculated quantile might give an impression that API is close to breaching the SLO
if bucket boundaries were not chosen appropriately (sharp spikes).

![Normal processing, service is up](images/normal.png)

## Get started

Clone the repository.

```shell script
$ git clone https://github.com/ybhutdocker/concurrency-limits-demo.git
$ cd ./concurrency-limits-demo/docker
```

Run origin server, client (load generator), Grafana, and Prometheus using Docker Compose.

```shell script
$ docker-compose up -d
```

Open Grafana [http://localhost:3000](http://localhost:3000) with default credentials admin/admin.
Prometheus dashboard is available at [http://localhost:9090](http://localhost:9090).

Clean up once you've done experimenting.

```shell script
$ docker-compose down
$ docker image prune --filter label=stage=intermediate
```

## Experiments

### Slower processing, service is down

New release of origin server has a bug that made workers process a request within 2 seconds.

```shell script
$ ORIGIN_WORKTIME=2s docker-compose up -d
```

According to Little's law, origin should be able to handle 3.5 requests per second.

```
N = X * R
7 workers = X rps * 2s
X = 7/2 = 3.5 rps
```

Since worker pool is able to process 3.5 requests per second, it can drain the queue at the same rate.
Requests arrive at 5 rps, which means queue will be growing infinitely.
Therefore processing time will also be growing infinitely.

```
N = X * R
∞ = 3.5 rps * R
R = ∞ / 3.5 = ∞ seconds
```

Observations:

- client sends 5 requests per second and they all time out
- origin processes 3.5 requests per second
- origin's average latency grows with queue
- origin's 50th and 99th percentiles show maximum configured latency of 4 seconds (the biggest bucket)

![Slower processing, service is down](images/slow-down.png)

### Slower processing, service is up

Developers increased number of origin workers to 20 while they investigate
why a worker takes 2 seconds to process a request instead of 1 second.

```shell script
$ ORIGIN_WORKER=20 ORIGIN_WORKTIME=2s docker-compose up -d
```

According to Little's law, origin should be able to handle 10 requests per second.

```
N = X * R
20 workers = X rps * 2s
X = 20/2 = 10 rps
```

Observations:

- client sends 5 requests per second and receives HTTP 200 OK responses
- origin processes 5 requests per second
- origin has served requests with average latency 2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 2 seconds

![Slower processing, service is up](images/slow-up.png)

### Fixed in-flight requests quota

Client should be able to send X=5 requests per second with average response time R=1 second.
This means a client should be limited to N=5 concurrent requests on a proxy.

```
N = X * R
N = 5 rps * 1s = 5 requests in flight
```

When processing time increases to R=2 seconds, a client is still limited to 5 requests in flight by proxy.
Therefore a client will end up limited to X=2.5 rps.

```
N = X * R
5 requests in flight = X rps * 2s
X = 5/2 = 2.5 rps
```

In order to allow a service to recover, a client is forced to back-off: send 2.5 rps instead of 5 rps.
Proxy limits concurrency (how many requests are in flight), not request rate (rps).

```shell script
$ ORIGIN_WORKTIME=2s CLIENT_ORIGIN=http://proxy:7000 docker-compose up -d
```

Observations:

- client sends 5 requests per second and receives 2.3 rps (HTTP 200)
- proxy oscillates between 4 and 5 in-flight requests
- origin processes 2.3 requests per second
- origin has served requests with average latency 2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 2 seconds

![Fixed in-flight requests quota](images/fixed-quota.png)

### Adaptive in-flight requests quota

If origin's capacity is unknown, it's possible to estimate capacity using
[AIMD algorithm](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease)
(additive-increase/multiplicative-decrease):

- increase target concurrency by a constant `c` per unit time, e.g., allow 1 more rps every second
- set target concurrency to a fraction `p` of its current size (0 <= p <= 1), e.g.,
back-off to 75% when a service is overloaded (429 or 50x status codes, connection timeout)

Client waits for 2.5 seconds before timing out (cancels request).
In order to receive HTTP 200 OK responses a request should be processed in less than 2.5 seconds.

Request is processed within 2 seconds by origin's workers (N=7).

```
N = X * R
7 workers = X rps * 2s
X = 7/2 = 3.5 rps
```

Request has only 2.5s - 2s = 0.5 second to stay in a queue, otherwise a client won't receive it due to timeout.
Expected origin's queue length is 1.75 requests.

```
N = X * R
N = 3.5 rps * 0.5s = 1.75 requests in queue
```

Expected number of concurrent requests to origin is 8.75.

```
N = X * R
N = 3.5 rps * 2.5s = 8.75 requests in flight
```

Performance of my naive proxy is inferior and results don't correspond to Jon Moore's chart (Nginx/Lua).

```shell script
$ ORIGIN_WORKTIME=2s CLIENT_ORIGIN=http://proxy:7000 PROXY_ADAPTIVE=true docker-compose up -d
```

Observations:

- client sends 5 requests per second and receives between 2.4 rps and 2.7 rps (HTTP 200)
- proxy oscillates between 4 and 6 in-flight requests
- origin processes between 2.7 and 3 requests per second
- origin has served requests with average latency 2.2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 3 seconds with periodic spikes

![Adaptive in-flight requests quota](images/adaptive-quota.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yashbhutwala/concurrency-limits-demo

Awesome Lists containing this project

README