https://github.com/bigcommerce/nomad-workload-cpu-actuals-report-generator

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/bigcommerce/nomad-workload-cpu-actuals-report-generator
Owner: bigcommerce
Created: 2020-04-29T18:15:34.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-10-23T17:48:18.000Z (over 5 years ago)
Last Synced: 2025-04-02T00:51:18.371Z (over 1 year ago)
Language: Groovy
Size: 7.81 KB
Stars: 2
Watchers: 7
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # nomad-workload-cpu-actuals-report

## Summary

A tool used for balacning workloads -- finding over and under allocated jobs/tasks particularly when the Nomad Client cluster has non-uniform CPU classes / clockspeeds.

## Usage

```

./nomadWorkloadCPUActualsReport --help

error: Missing required option: e

usage: ./nomadWorkloadCPUActualsReport -e  

Nomad Workload CPU Actuals Report

 -adstats,--advancedStats              Writes advanced stats including Std Dev, Kurtosis, and Skewness

 -avg,--avgKnownNomadClients           Use instead of --nomadFallbackClock to average the known Nomad clients for historical workloads

 -d1d,--disableOneDayQuery             Disable 1 day query

 -d1h,--disableOneHourQuery            Disable 1 hour query

 -d30d,--disableThirtyDayQuery         Disable 30 day query

 -d7d,--disableSevenDayQuery           Disable 7 day query

 -e,--environments                Environments (comma delimited)

 -e60d,--enableSixytDayQuery           Enable 60 day query

 -e90d,--enableNinetyDayQuery          Enable 90 day query

 -fbc,--nomadFallbackClock        The approximate clockspeed of out-of-service Nomad nodes that were used to run historical workloads tracked in Prometheus [defaults to 2300]

 -h,--help                             Usage Information

 -jobs,--targetJobs               Target specific jobs (comma delimited)

 -jst,--taskSleepTime             Task sleep time in seconds [defaults to 5]

 -miss,--includeMissRate               Write the miss rate

 -nc,--nomadTLSCertFilename       Nomad TLS Key Filename [defaults to %env%.nomad.key]

 -nca,--nomadTLSCACertFilename    Nomad TLS CA Certificate Filename [defaults to nomadca.crt]

 -nh,--nomadHost                  Nomad host [defaults to https://nomad.service.%env%-dc1.consul:4646]

 -nk,--nomadTLSKeyFilename        Nomad TLS Certificate Filename [defaults to %env%.nomad.crt]

 -ph,--prometheusHost             Prometheus host [defaults to http://prometheus-read.service.%env%-dc1.consul:9090]

 -qst,--querySleepTime            Query sleep time in seconds [defaults to 5]

Environment queries are run in parallel to reduce report generation time. Use %env% to inject environment into --nomadTLSKeyFilename, --nomadTLSCertFilename, --nomadHost, --prometheusHost

```

## Prometheus Queries

There are two metrics pulled out of prometheus:

1. `nomad_client_allocs_cpu_total_percent` - A guage of the percentage of the total CPU resources consumed by the task across all cores.

2. `nomad_client_allocs_memory_usage` - A guage of the total amount of memory bytes used by the task. 

For more information see [the Nomad metrics](https://www.nomadproject.io/docs/telemetry/metrics) page.

## Queries

Depending on what intervals are enabled per the aforementioned list of arguments, a number of ranges are used. By default the following are enabled:

- 1 Hour

- 1 Day

- 7 Days

- 30 Days

These flags can be disabled. Also available are 60 day and 90 day windows, which require explicitly enabling (are disabled by default). The queries all use `avg_over_time` over the two metrics for various range vector selectors (for time duration of the input metrics) and windows/range selectors for the `avg_over_time` function:

| Period     | Range Selectors  |

| ---------- | ---------------- |

| 1 Hour     | `[5s])[1h:5s]`   |

| 1 Day      | `[1m])[1d:1m]`   |

| 7 Days     | `[1m])[7d:1m]`   |

| 30 Days    | `[1h])[30d:6h]`  |

| 60 Days    | `[1h])[60d:12h]` |

| 90 Days    | `[1h])[90d:24h]` |

Query examples as issued to `/api/v1/query?query=`:

```

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[5s])[1h:5s]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[5s])[1h:5s]

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[1m])[1d:1m]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[1m])[1d:1m]

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[1m])[7d:1m]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[1m])[7d:1m]

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[1h])[30d:6h]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[1h])[30d:6h]

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[1h])[60d:12h]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[1h])[60d:12h]

avg_over_time(nomad_client_allocs_cpu_total_percent{task="container-job-x"}[1h])[90d:24h]

avg_over_time(nomad_client_allocs_memory_usage{task="container-job-x"}[1h])[90d:24h]

```

## Understand the report

Reports are output with the timestamp from which the report generation process was started. Reports have a tab for each environment passed to the command line application. There are shaded or colored sections of the workbook that compare the following values to the `Nomad Job -> Task Group -> Tasks's` Allocated MHz:

1. Mean

2. 50th Percentile

3. 95th Percentile

4. Max

... with the colors:

- green: meaning a value is somewhere bewteen 0 and the Allocated MHz, richer meaning further (0 is darkest or richest green while closest to Allocated MHz is more green/white or white)

- yellow: meaning a value is somewhere between the Allocated MHz and 2x the Allocated MHz, richer meaning further (2x Allocated MHz is darkest or richest yellow, while closest to Allocated MHz is more yellow/white or white)

- red: meaning a value is or is greather than 2x the Allocated MHz

### See Also

- [Using Prometheus to Monitor Nomad Metrics](https://www.nomadproject.io/guides/operations/monitoring-and-alerting/prometheus-metrics.html)

- [Shape of data](https://brownmath.com/stat/shape.htm)

- [Understanding Descriptive Statistics](https://towardsdatascience.com/understanding-descriptive-statistics-c9c2b0641291)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bigcommerce/nomad-workload-cpu-actuals-report-generator

Awesome Lists containing this project

README