https://github.com/txstate-etc/wrkbench

Scripting Benchmark Tool
https://github.com/txstate-etc/wrkbench

Last synced: 5 months ago
JSON representation

Scripting Benchmark Tool

Host: GitHub
URL: https://github.com/txstate-etc/wrkbench
Owner: txstate-etc
License: apache-2.0
Created: 2021-09-23T21:01:54.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-09-27T18:38:45.000Z (almost 4 years ago)
Last Synced: 2025-01-01T06:13:53.048Z (7 months ago)
Language: Lua
Size: 35.2 KB
Stars: 0
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# wrkbench
Scripting Benchmark Tool

This is a repo to build wrk2 and run benchmarks on services. Included in the readme is a collection of material about load testing we have found useful from different sites throughout the web to help understand benchmarking best practices.

## How to build example:
```bash
REGISTRY=https://registry.dept.example.com/
docker build -t ${REGISTRY}wrkbench:qual .
```

## How to run wrk2 example:
```bash
RPS=""
USERS=""
# NOTE: User connections make requests at the same time.
# If you have 8 user connections and 4 requests per second
# setting then 8 requests will be made every 2 seconds.
THREADS=""
TIMEOUT=""
DURATION=""
# NOTE: Duration must 20s or more as wrk2 has a calibration period of 10 seconds.
URL=""
SCRIPT=""
# NOTE: we include a graphql.lua script to aid with query requests.
# if arguments are required by the script they can be passed in
# by adding them after the "--" dashes as seen in the example.
# -L is used for reporting out Latency Statistics generated by HdrHistrogram
docker run --rm \
-v `pwd`/scripts:/scripts \
-v `pwd`/data:/data \
${REGISTRY}wrkbench:qual wrk2 -R$RPS -t$THREADS --timeout ${TIMEOUT} -c$USERS -d$DURATION -L -s /scripts/$SCRIPT $URL -- -i$INDEX -d/data/$DATA
```

A bench_example.sh script is included as a guide. Also `scripts/graphql.lua` and `data/queries.example.json` are included to get started with graphql benchmarking. The arguments passed to the script would be the index to the queries to run and the data json formated file that will include the queries, token keys, and variable lists to use for the queries.

While benchmarking monitor resources on the host being tested.
```bash
# Watch CPU %idle column for drops with mpstat command set for every 2 seconds
mpstat 2
Linux ... ... (2 CPU)

10:27:44 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:46 AM all 0.25 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 99.50
...
10:27:54 AM all 2.51 0.00 0.75 0.00 0.00 0.25 0.00 0.00 0.00 96.48
10:27:56 AM all 32.75 0.00 2.77 0.00 0.00 0.00 0.00 0.00 0.00 64.48
10:27:58 AM all 44.72 0.00 5.03 0.25 0.00 0.25 0.00 0.00 0.00 49.75
10:28:00 AM all 41.35 0.00 3.26 0.00 0.00 0.25 0.00 0.00 0.00 55.14
10:28:02 AM all 8.33 0.00 1.52 0.00 0.00 0.25 0.00 0.00 0.00 89.90
10:28:04 AM all 37.44 0.00 5.03 0.00 0.00 0.25 0.00 0.00 0.00 57.29
10:28:06 AM all 63.91 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 33.58
...

# Watch available memory column for drops with free command
free -h
total used free shared buff/cache available
Mem: 3.7G 816M 376M 183M 2.5G 2.4G
Swap: 1.0G 256K 1.0G
```

## Benchmarking Tips:
_This is taken from wg/wrk github repo_:

The machine running wrk must have a sufficient number of ephemeral ports available and closed sockets should be recycled quickly. To handle the initial connection burst the server's listen(2) backlog should be greater than the number of concurrent connections being tested.

A user script that only changes the HTTP method, path, adds headers or a body, will have no performance impact. If multiple HTTP requests are necessary they should be pre-generated and returned via a quick lookup in the request() call. Per-request actions, particularly building a new HTTP request, and use of response() will necessarily reduce the amount of load that can be generated.

## Wrk2 script hooks:
```
Lua script pipeline:
+-->[setup]-->(running)--+
/ \
(resolve ip) --(threads) +-->[done]
\ /
+-->[setup]-->[running]--+

+-(running)--------------------------------------------------+
| +-----------+-<yes>-->[done]
| / / |
[setup]----->[init]--+-->[request]-->[response]--+ finished? / |
| | / / |
| +------------<no>---------+-----------+ |
+------------------------------------------------------------+
```

We can use the following methods inside a Lua script:
- setup(thread): Executed when all threads have been initialized but not yet started. Used to pass data to thread.
- init(args): Called when each thread is initialized. This function receives extra command line arguments for the script which must be separated from wrk arguments with `--`. WARN: Scripts that override init() but not request() must call `wrk.init()`
- request(): Needs to return the HTTP object for each request. In this function we can modify the method, headers, path, and body. Use the wrk.format helper function to shape the request object. Example: `return wrk.format(method, path, headers, body)` NOTE: the function wrk.format(...) returns a HTTP request string containing the passed parameters merged with values from the following wrk table:
```
wrk = {
scheme = "http",
host = "localhost",
port = nil,
method = "GET",
path = "/",
headers = {},
body = nil
}
```
- response(status, headers, body): Optional function called when the response comes back with HTTP response data.
- done(summary, latency, requests): Optional function executed with results of run when all requests are finished and statistics are computed. The done() function receives a table containing result data, and two statistics objects representing the sampled per-request latency and per-thread request rate. Duration and latency are microsecond values and rate is measured in requests per second. NOTE: Match this with wrk2 that updates statistics.
```
Property Description
summary.duration run duration in microseconds
summary.requests total completed requests
summary.bytes total bytes received
summary.errors.connect total socket connection errors
summary.errors.read total socket read errors
summary.errors.write total socket write errors
summary.errors.status total HTTP status codes > 399
summary.errors.timeout total request timeouts
latency.min minimum latency value reached during test
latency.max maximum latency value reached during test
latency.mean average latency value reached during test
latency.stdev latency standard deviation
latency:percentile(99.0) 99th percentile value
latency[i] raw latency data of request i
```

## Benchmark / load testing:
We use load testing to determine the maximum throughput, measured in requests per second (RPS) under specified number of connections, where all response time satisfying the latency target. This means that generally before we load test we have an Service Level Agreement (SLA) that must be met. One way to break the SLA down is by Latency Percentiles of response times and load:
1) What is our good response time and the percent of expected requests to meet this level at a specified load.
2) What is the bad case response time and the percent of requests allowed at this level at a specified load.
3) What response time should we never go over and considered the max worse case. i.e. Zero requests should pass this level at a specified load.

## Benchmark terminology:
- Throughput: Throughput is how many requests the server can handle during a specific time interval. For HTTP request this is referred to as requests per second (RPS). As the number of connections increase generally the system throughput goes down and latency suffers.
- Connection: The number of simultaneous TCP connections. Sometimes referred to as Number of Users.
- Latency: This is a measure of how fast a server responds to requests from a client. This measurement is made on the client side and starts from the time the request is sent until the response is received. Note that network overhead is included in the measurement. For HTTP requests this is often referred to as response time.
- Latency Percentiles: This is a way of grouping resulting response/latency times by their percentage out of the whole sample set. This is the most common Quality of Service (QoS) metric and is meant to give a better representation of the client experience over a sustained rate of requests than average response times, which loose information. Latency percentiles can be used to verify SLA's are met for a specified load. If our 95th percentile response time is 100ms, that means 90% of our requests were returned in 100ms or less.

## Load testing answers the following questions:
- Does the server have enough resources (CPU, memory, etc.) to handle the anticipated load? Aside from discovering memory leaks, one example would show if an amount of memory per workload may be required by the system, or if the service may need to throttle traffic if such behavior eventually causes Out of Memory issues on the host. It may also show if more CPU resources could help.
- Does the server respond quickly enough to provide a good user experience? This varies from whether it is an API service endpoint, which may need lower latency times than a browser common sited 200ms page load response times.
- Is our application running efficiently? Some tools are just heavy in that they require a lot of resources for little results. A good example would be Virtual Machines vs containers. They serve different purposes and should be used accordingly. Another example is the language used to code the application. Often we find ourselves implementing caches as a band-aid, when the issue stems from the language used for a particular service.
- Where are the bottlenecks, and can we fix them by "scaling up" our server hardware, or "scaling out" to multiple servers? Sometimes the bottleneck is in the backend like a database, which can be seen by adding more forward facing services with no overall bandwidth increase. At this point we may need to change the architecture design in order to be able to "scale out" if "scaling up" the database is not an option or would be of limited benefit.
- Are there any pages or API calls that are particularly resource intensive? There may be parts of the code that need optimization if they are in hot spots.
- Do version upgrades, feature additions, or other code changes negatively impacted the performance of the server? This can be seen by keeping track of previous benchmarks and monitoring progress of service development.

## Hiccups:
Hiccups in a system are generally not load based (OS scheduling, OS swap space, application GC, ...). Glitches, hiccups, or semi random noise generally has a non avoidable effect on averages. Hiccups are strongly multi-modal, not evenly spread around, so as a result they look like periodic freezes in the service. Because of this the system has different modes of behavior. from "good" to "bad" to "terrible" behaviors. These modes are completely different shifts between behaviors. They are separate and thus do not exhibit a smooth distribution between them. Because of "hiccups" normal distribution load based tests are generally meaningless. For example as a sanity check should we still want to use standard deviation:
- The 99% is roughly 3 standard deviations away from the mean. If distance of maximum and median is more than 5 standard deviations then the standard deviation is meaningless.
- Always track max time. This is a good indicator that standard deviation is suspect. Max time is the number 1 warning sign that something in the report may be off, such as when the benchmark is affected by a coordinated omission problem. Note people have the tendency to throw the max time away when they can give valuable insight into the system and have an impact on the application SLA as surpassing the max worse case.

## Coordinated omission problem:
The coordinated omission problem is where we are omitting data because our response times are taking longer than request interval, and as such the tooling backs off on requests.
Example of common approach:
1) Client issues requests one by one at a certain rate.
2) Measure and log response time for each request
3) Results log used to produce histograms, percentiles, ...

What can go wrong with this technique?:
1) Only works if time interval for each and every request issued is shorter than interval would of taken between two requests. i.e. Works well only when all responses fit within rate interval.
2) Technique often includes implicit "automatic backoff" and coordination. During time of slow response we have no measurements and by ignoring the lost parts of the slow data, we dramatically affect the 99 percentile results making the system look faster.
3) But our requirements are interested in random, uncoordinated events, and thus we get inaccurate results.

## Why using wrk2 for benchmark / load testing:
- Like wrk, wrk2 allows for scripting to customize testing. In this case they use Lua which allows us to customize query requests when benchmarking systems like GrapghQL servers.
- wrk2 has a rate setting so that it will keep the rate of requests consistent and allows us to get good latency statistics while we are emulating a sustained load.
- wrk2 utilizes Histrograms and has improved precision over wrk and automatically compensates for the coordinated omission problem, thus giving more a acurate report.

Histrograms are good at measuring percentiles and we want to capture as many as we can. This gives a better picture of how our systems are performing.
A good opensource tool to do this with is: High Dynamic Range Histogram (HdrHistrogram):
- Covers a configurable dynamic range value (which is what you need to measure in percentiles)
- At configurable precision (expressed as a number of significant digits)
- Provides tools for iteration (Linear, Logarithmic, Percentile)

Examples of what HdrHistrogram can do:
- Can track values between 1us and 1hr
- With 3 decimal points of resolution
- Built in compensation for Coordinated Omission if you tell it the interval where you expect to see results.
- wrk2 uses a version of HdrHistrogram

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/txstate-etc/wrkbench

Awesome Lists containing this project

README