{"id":13583621,"url":"https://github.com/danielfm/prometheus-for-developers","last_synced_at":"2025-12-29T23:44:34.251Z","repository":{"id":66529120,"uuid":"139440171","full_name":"danielfm/prometheus-for-developers","owner":"danielfm","description":"Practical introduction to Prometheus for developers.","archived":false,"fork":false,"pushed_at":"2024-03-28T17:23:28.000Z","size":2002,"stargazers_count":453,"open_issues_count":2,"forks_count":25,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-11-06T00:39:19.785Z","etag":null,"topics":["docker","monitoring","prometheus","tutorial"],"latest_commit_sha":null,"homepage":"https://danielfm.github.io/prometheus-for-developers/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danielfm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"danielfm","liberapay":"danielfm"}},"created_at":"2018-07-02T12:23:46.000Z","updated_at":"2024-11-04T17:18:45.000Z","dependencies_parsed_at":"2024-01-16T23:45:50.696Z","dependency_job_id":"d13c66d4-a2a7-40e3-9296-5a7e6dc49915","html_url":"https://github.com/danielfm/prometheus-for-developers","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielfm%2Fprometheus-for-developers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielfm%2Fprometheus-for-developers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielfm%2Fprometheus-for-developers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielfm%2Fprometheus-for-developers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danielfm","download_url":"https://codeload.github.com/danielfm/prometheus-for-developers/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247556034,"owners_count":20957883,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","monitoring","prometheus","tutorial"],"created_at":"2024-08-01T15:03:39.108Z","updated_at":"2025-12-29T23:44:34.244Z","avatar_url":"https://github.com/danielfm.png","language":"JavaScript","funding_links":["https://github.com/sponsors/danielfm","https://liberapay.com/danielfm"],"categories":["JavaScript","Don't forget to give a :star: to make the project popular"],"sub_categories":[],"readme":"# Prometheus For Developers\n\nThis is an introductory tutorial I created for telling the software developers\nin my [company](https://descomplica.com.br) the basics about\n[Prometheus](https://prometheus.io).\n\nIf you have any suggestion to improve this content, don't hesitate to contact\nme. Pull Requests are welcome!\n\n## Table of Contents\n\n- [The Project](#the-project)\n  - [Pre-Requisites](#pre-requisites)\n  - [Running the Code](#running-the-code)\n  - [Cleaning Up](#cleaning-up)\n- [Prometheus Overview](#prometheus-overview)\n  - [Push vs Pull](#push-vs-pull)\n  - [Metrics Endpoint](#metrics-endpoint)\n  - [Duplicate Metrics Names?](#duplicate-metrics-names)\n  - [Monitoring Uptime](#monitoring-uptime)\n  - [A Basic Uptime Alert](#a-basic-uptime-alert)\n- [Instrumenting Your Applications](#instrumenting-your-applications)\n  - [Measuring Request Durations](#measuring-request-durations)\n    - [Quantile Estimation Errors](#quantile-estimation-errors)\n  - [Measuring Throughput](#measuring-throughput)\n  - [Measuring Memory/CPU Usage](#measuring-memorycpu-usage)\n  - [Measuring SLOs and Error Budgets](#measuring-slos-and-error-budgets)\n  - [Monitoring Applications Without a Metrics Endpoint](#monitoring-applications-without-a-metrics-endpoint)\n  - [Final Gotchas](#final-gotchas)\n- [References](#references)\n\n## The Project\n\nThis tutorial follows a more practical approach (with hopefully just the\nright amount of theory!), so we provide a simple Docker Compose configuration\nfor simplifying the project bootstrap.\n\n### Pre-Requisites\n\n- Docker + Docker Compose\n\n### Running the Code\n\nRun the following command to start everything up:\n\n```bash\n$ docker-compose up -d\n\n# Or, if you use podman:\n$ podman-compose up -d\n```\n\n- Alertmanager: \u003chttp://localhost:9093\u003e\n- Grafana: \u003chttp://localhost:3000\u003e (user/password: `admin`)\n- Prometheus: \u003chttp://localhost:9090\u003e\n- Sample Node.js Application: \u003chttp://localhost:4000\u003e\n\n### Cleaning Up\n\nRun the following command to stop all running containers from this project:\n\n```bash\n$ docker-compose rm -fs\n```\n\n## Prometheus Overview\n\nPrometheus is an open source monitoring and time-series database (TSDB)\ndesigned after\n[Borgmon](https://landing.google.com/sre/book/chapters/practical-alerting.html),\nthe monitoring tool created internally at Google for collecting metrics\nfrom jobs running in their cluster orchestration platform,\n[Borg](https://ai.google/research/pubs/pub43438).\n\nThe following image shows an overview of the Prometheus architecture.\n\n![Prometheus architecture](./img/prometheus-architecture.png)\nSource: [Prometheus documentation](https://prometheus.io/docs/introduction/overview/)\n\nIn the center we have a **Prometheus server**, which is the component\nresponsible for periodically collecting and storing metrics from various\n **targets** (e.g. the services you want to collect metrics from).\n\nThe list of **targets** can be statically defined in the Prometheus\nconfiguration file, or we can use other means to automatically discover\nthose targets via **Service discovery**. For instance, if you want to monitor\na service that's deployed in EC2 instances in AWS, you can configure Prometheus\nto use the AWS EC2 API to discover which instances are running a particular\nservice and then _scrape_ metrics from those servers; this is preferred over\nstatically listing the IP addresses for our application in the Prometheus\nconfiguration file, which will eventually get out of sync, especially in a\ndynamic environment such as a public cloud provider.\n\nPrometheus also provides a basic **Web UI** for running queries on the stored\ndata, as well as integrations with popular visualization tools, such as\n[Grafana](https://grafana.net).\n\n### Push vs Pull\n\nPreviously, we mentioned that the **Prometheus server** _scrapes_ (or pulls)\nmetrics from our **target** applications.\n\nThis means Prometheus took a different approach than other \"traditional\"\nmonitoring tools, such as [StatsD](https://github.com/etsy/statsd), in\nwhich applications _push_ metrics to the metrics server or aggregator,\ninstead of having the metrics server _pulling_ metrics from applications.\n\nThe consequence of this design is a better separation of concerns; when\nthe application pushes metrics to a metrics server or aggregator, it has\nto make decisions like: where to push the metrics to; how often to push the\nmetrics; should the application aggregate/consolidate any metrics before\npushing them; among other things.\n\nIn _pull-based_ monitoring systems like Prometheus, these decisions go\naway; for instance, we no longer have to re-deploy our applications if we want\nto change the metrics resolution (how many data points collected per minute) or\nthe monitoring server endpoint (we can architect the monitoring system in a\nway completely transparent to application developers).\n\n---\n\n**Want to know more?** The Prometheus documentation provides a\n[comparison](https://prometheus.io/docs/introduction/comparison/) with\nother tools in the monitoring space regarding scope, data model, and storage.\n\n---\n\nNow, if the application doesn't push metrics to the metrics server, how does\nthe applications metrics end up in Prometheus?\n\n### Metrics Endpoint\n\nApplications expose metrics to Prometheus via a _metrics endpoint_. To see how\nthis works, let's start everything by running `docker-compose up -d` if you\nhaven't already.\n\nVisit \u003chttp://localhost:3000\u003e to open Grafana and log in with the default\n`admin` user and password. Then, click on the top link \"Home\" and select the\n\"Prometheus 2.0 Stats\" dashboard.\n\n![Prometheus 2.0 Stats Dashboard](./img/dashboard-prometheus.png)\n\nYes, Prometheus is _scraping_ metrics from itself!\n\nLet's pause for a moment to understand what happened. First, Grafana is already\nconfigured with a\n[Prometheus data source](http://docs.grafana.org/features/datasources/prometheus/)\nthat points to the local Prometheus server. This is how Grafana is able to\ndisplay data from Prometheus. Also, if you look at the Prometheus configuration\nfile, you can see that we listed Prometheus itself as a target.\n\n```yaml\n# config/prometheus/prometheus.yml\n\n# Simple scrape configuration for each service\nscrape_configs:\n  - job_name: prometheus\n    static_configs:\n      - targets:\n          - localhost:9090\n```\n\nBy default, Prometheus gets metrics via the `/metrics` endpoint in each target,\nso if you hit \u003chttp://localhost:9090/metrics\u003e, you should see something like\nthis:\n\n```\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 5.95e-05\ngo_gc_duration_seconds{quantile=\"0.25\"} 0.0001589\ngo_gc_duration_seconds{quantile=\"0.5\"} 0.0002188\ngo_gc_duration_seconds{quantile=\"0.75\"} 0.0004158\ngo_gc_duration_seconds{quantile=\"1\"} 0.0090565\ngo_gc_duration_seconds_sum 0.0331214\ngo_gc_duration_seconds_count 47\n# HELP go_goroutines Number of goroutines that currently exist.\n# TYPE go_goroutines gauge\ngo_goroutines 39\n# HELP go_info Information about the Go environment.\n# TYPE go_info gauge\ngo_info{version=\"go1.10.3\"} 1\n# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n# TYPE go_memstats_alloc_bytes gauge\ngo_memstats_alloc_bytes 3.7429992e+07\n# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.\n# TYPE go_memstats_alloc_bytes_total counter\ngo_memstats_alloc_bytes_total 1.37005104e+08\n...\n```\n\nIn this snippet alone we can notice a few interesting things:\n\n1. Each metric has a user friendly description that explains its purpose\n2. Each metric may define additional dimensions, also known as **labels**. For\n   instance, the metric `go_info` has a `version` label\n   - Every time series is uniquely identified by its metric name and the set of\n     label-value pairs\n3. Each metric is defined as a specific type, such as `summary`, `gauge`,\n   `counter`, and `histogram`. More information on each data type can be found\n   [here](https://prometheus.io/docs/concepts/metric_types/)\n\nBut how does this text-based response turns into data points in a time series\ndatabase?\n\nThe best way to understand this is by running a few simple queries.\n\nOpen the Prometheus UI at \u003chttp://localhost:9090/graph\u003e, type\n`process_resident_memory_bytes` in the text field and hit _Execute_.\n\n![Prometheus Query Example](./img/prometheus-query.png)\n\nYou can use the graph controls to zoom into a specific region.\n\nThis first query is very simple as it only plots the value of the\n`process_resident_memory_bytes` gauge as time passes, and as you might\nhave guessed, that query displays the resident memory usage for each target,\nin bytes.\n\nSince our setup uses a 5-second scrape interval, Prometheus will hit the\n`/metrics` endpoint of our targets every 5 seconds to fetch the current\nmetrics and store those data points sequentially, indexed by timestamp.\n\n```yaml\n# In prometheus.yml\nglobal:\n  scrape_interval: 5s\n```\n\nYou can see the all samples from that metric in the past minute by querying\n`process_resident_memory_bytes{job=\"grafana\"}[1m]` (select _Console_ in the\nPrometheus UI):\n\n| Element | Value |\n|---------|-------|\n| `process_resident_memory_bytes{instance=\"grafana:3000\",job=\"grafana\"}` | `40861696@1530461477.446`\u003cbr/\u003e`43298816@1530461482.447`\u003cbr/\u003e`43778048@1530461487.451`\u003cbr/\u003e`44785664@1530461492.447`\u003cbr/\u003e`44785664@1530461497.447`\u003cbr/\u003e`45043712@1530461502.448`\u003cbr/\u003e`45043712@1530461507.44`\u003cbr/\u003e`45301760@1530461512.451`\u003cbr/\u003e`45301760@1530461517.448`\u003cbr/\u003e`45301760@1530461522.448`\u003cbr/\u003e`45895680@1530461527.448`\u003cbr/\u003e`45895680@1530461532.447` |\n\nQueries that have an appended range duration in square brackets after\nthe metric name (i.e. `\u003cmetric\u003e[\u003cduration\u003e]`) returns what is called a\n_range vector_, in which `\u003cduration\u003e` specify how far back in time\nvalues should be fetched for each resulting range vector element.\n\nIn this example, the value for the `process_resident_memory_bytes`\nmetric at the timestamp `1530461477.446` was `40861696`, and so on.\n\n### Duplicate Metrics Names?\n\nIf you inspect the contents of the `/metrics` endpoint at all our targets,\nyou'll see that multiple targets export metrics under the same name.\n\nBut isn't this a problem? If we are exporting metrics under the same name,\nhow can we be sure we are not mixing metrics between different applications\ninto the same time series data?\n\nConsider the previous metric, `process_resident_memory_bytes`: Grafana,\nPrometheus, and our sample application all export a gauge metric under the\nsame name. However, did you notice in the previous plot that somehow we were\nable to get a separate time series from each application?\n\nQuoting the\n[documentation](https://prometheus.io/docs/concepts/jobs_instances/):\n\n\u003e In Prometheus terms, an endpoint you can scrape is called an **instance**,\n\u003e usually corresponding to a single process. A collection of instances with\n\u003e the same purpose, a process replicated for scalability or reliability for\n\u003e example, is called a **job**.\n\u003e\n\u003e When Prometheus scrapes a target, it attaches some labels automatically to\n\u003e the scraped time series which serve to identify the scraped target:\n\u003e -  `job` - The configured job name that the target belongs to.\n\u003e - `instance` - The `\u003chost\u003e:\u003cport\u003e` part of the target's URL that was scraped.\n\nSince our configuration has three different targets (with one instance each)\nexposing this metric, we can see three lines in that plot.\n\n### Monitoring Uptime\n\nFor each instance scrape, Prometheus stores a `up` metric with the value `1`\nwhen the instance is healthy, i.e. reachable, or `0` if the scrape failed.\n\nTry plotting the query `up` in the Prometheus UI.\n\nIf you followed every instruction up until this point, you'll notice that\nso far all targets were reachable at all times.\n\nLet's change that. Run `docker-compose stop sample-app` and after a few\nseconds you should see the `up` metric reporting our sample application\nis down.\n\nNow run `docker-compose restart sample-app` and the `up` metric should\nreport the application is back up again.\n\n![Sample application downtime](./img/sample-app-downtime.png)\n\n---\n\n**Want to know more?** The Prometheus query UI provides a combo box with all\navailable metric names registered in its database. Do some exploring, try\nquerying different ones. For instance, can you plot the file descriptor\nhandles usage (in %) for all targets? **Tip:** the metric names end with\n`_fds`.\n\n---\n\n#### A Basic Uptime Alert\n\nWe don't want to keep staring at dashboards in a big TV screen all day\nto be able to quickly detect issues in our applications, after all, we have\nbetter things to do with our time, right?\n\nLuckily, Prometheus provides a facility for defining alerting rules that,\nwhen triggered, will notify\n[Alertmanager](https://prometheus.io/docs/alerting/alertmanager/), which is\nthe component that takes care of deduplicating, grouping, and routing them\nto the correct receiver integration (i.e. email, Slack, PagerDuty,\nOpsGenie). It also takes care of silencing and inhibition of alerts.\n\nConfiguring Alertmanager to send metrics to PagerDuty, or Slack, or whatever,\nis out of the scope of this tutorial, but we can still play around with alerts.\n\nWe already have the following alerting rule defined in\n`config/prometheus/prometheus.rules.yml`:\n\n```yaml\ngroups:\n  - name: uptime\n    rules:\n      # Uptime alerting rule\n      # Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/\n      - alert: ServerDown\n        expr: up == 0\n        for: 1m\n        labels:\n          severity: page\n        annotations:\n          summary: One or more targets are down\n          description: Instance {{ $labels.instance }} of {{ $labels.job }} is down\n```\n\n![Prometheus alerts](./img/prometheus-alerts-1.png)\n\nEach alerting rule in Prometheus is also a time series, so in this case you can\nquery `ALERTS{alertname=\"ServerDown\"}` to see the state of that alert at any\npoint in time; this metric will not return any data points now because so far\nno alerts have been triggered.\n\nLet's change that. Run `docker-compose stop grafana` to kill Grafana. After a\nfew seconds you should see the `ServerDown` alert transition to a yellow state,\nor `PENDING`.\n\n![Pending alert](./img/prometheus-alerts-2.png)\n\nThe alert will stay as `PENDING` for one minute, which is the threshold we\nconfigured in our alerting rule. After that, the alert will transition to a red\nstate, or `FIRING`.\n\n![Firing alert](./img/prometheus-alerts-3.png)\n\nAfter that point, the alert will show up in Alertmanager. Visit\n\u003chttp://localhost:9093\u003e to open the Alertmanager UI.\n\n![Alert in Alertmanager](./img/prometheus-alerts-4.png)\n\nLet's restore Grafana. Run `docker-compose restart grafana` and the alert\nshould go back to a green state after a few seconds.\n\n---\n\n**Want to know more?** There are several alerting rule examples in the\n[awesome-prometheus-alerts](https://github.com/samber/awesome-prometheus-alerts)\nrepository for common scenarios and popular systems.\n\n---\n\n## Instrumenting Your Applications\n\nLet's examine a sample Node.js application we created for this tutorial.\n\nOpen the `./sample-app/index.js` file in your favorite text editor. The\ncode is fully commented, so you should not have a hard time understanding\nit.\n\n### Measuring Request Durations\n\nWe can measure request durations with\n[percentiles](https://en.wikipedia.org/wiki/Quantile) or\n[averages](https://en.wikipedia.org/wiki/Arithmetic_mean). It's not\nrecommended relying on averages to track request durations because averages\ncan be very misleading (see the [References](#references) for a few posts on\nthe pitfalls of averages and how percentiles can help). A better way for\nmeasuring durations is via percentiles as it tracks the user experience\nmore closely:\n\n![Percentiles as a way to measure user satisfaction](./img/percentiles.jpg)\nSource: [Twitter](https://twitter.com/rakyll/status/1045075510538035200)\n\nIn Prometheus, we can generate percentiles with summaries or histograms.\n\nTo show the differences between these two, our sample application exposes\ntwo custom metrics for measuring request durations with, one via a summary\nand the other via a histogram:\n\n```js\n// Summary metric for measuring request durations\nconst requestDurationSummary = new prometheusClient.Summary({\n  // Metric name\n  name: 'sample_app_summary_request_duration_seconds',\n\n  // Metric description\n  help: 'Summary of request durations',\n\n  // Extra dimensions, or labels\n  // HTTP method (GET, POST, etc), and status code (200, 500, etc)\n  labelNames: ['method', 'status'],\n\n  // 50th (median), 75th, 90th, 95th, and 99th percentiles\n  percentiles: [0.5, 0.75, 0.9, 0,95, 0.99]\n});\n\n// Histogram metric for measuring request durations\nconst requestDurationHistogram = new prometheusClient.Histogram({\n  // Metric name\n  name: 'sample_app_histogram_request_duration_seconds',\n\n  // Metric description\n  help: 'Histogram of request durations',\n\n  // Extra dimensions, or labels\n  // HTTP method (GET, POST, etc), and status code (200, 500, etc)\n  labelNames: ['method', 'status'],\n\n  // Duration buckets, in seconds\n  // 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s\n  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]\n});\n```\n\nAs you can see, in a summary we specify the percentiles in which we want the\nPrometheus client to calculate and report latencies for, while in a histogram\nwe specify the duration buckets in which the observed durations will be stored\nas a counter (i.e. a 300ms observation will be stored by incrementing the\ncounter corresponding to the 250ms-500ms bucket).\n\nOur sample application introduces a one-second delay in approximately 5%\nof requests, just so we can compare the average response time with\n99th percentiles:\n\n```js\n// Main route\napp.get('/', async (req, res) =\u003e {\n  // Simulate a 1s delay in ~5% of all requests\n  if (Math.random() \u003c= 0.05) {\n    const sleep = (ms) =\u003e {\n      return new Promise((resolve) =\u003e {\n        setTimeout(resolve, ms);\n      });\n    };\n    await sleep(1000);\n  }\n  res.set('Content-Type', 'text/plain');\n  res.send('Hello, world!');\n});\n```\n\nLet's put some load on this server to generate some metrics for us to play\nwith:\n\n```bash\n$ docker run --rm -it --net host williamyeh/wrk -c 4 -t 2 -d 900  http://localhost:4000\nRunning 15m test @ http://localhost:4000\n  2 threads and 4 connections\n  Thread Stats   Avg      Stdev     Max   +/- Stdev\n    Latency   269.03ms  334.46ms   1.20s    78.31%\n    Req/Sec    85.61    135.58     1.28k    89.33%\n  72170 requests in 15.00m, 14.94MB read\nRequests/sec:     80.18\nTransfer/sec:     16.99KB\n```\n\nNow run the following queries in the Prometheus UI:\n\n```bash\n# Average response time\nrate(sample_app_summary_request_duration_seconds_sum[15s]) / rate(sample_app_summary_request_duration_seconds_count[15s])\n\n# 99th percentile (via summary)\nsample_app_summary_request_duration_seconds{quantile=\"0.99\"}\n\n# 99th percentile (via histogram)\nhistogram_quantile(0.99, sum(rate(sample_app_histogram_request_duration_seconds_bucket[15s])) by (le, method, status))\n```\n\nThe result of these queries may seem surprising.\n\n![Sample application response times](./img/sample-app-response-times-1.png)\n\nThe first thing to notice is how the average response time fails to\ncommunicate the actual behavior of the response duration distribution\n(avg: 50ms; p99: 1s); the second is how the 99th percentile reported by the\nthe summary (1s) is quite different than the one estimated by the\n`histogram_quantile()` function (~2.2s). How can this be?\n\n#### Quantile Estimation Errors\n\nQuoting the [documentation]():\n\n\u003e You can use both summaries and histograms to calculate so-called\n\u003e φ-quantiles, where 0 ≤ φ ≤ 1. The φ-quantile is the observation value\n\u003e that ranks at number φ*N among the N observations. Examples for\n\u003e φ-quantiles: The 0.5-quantile is known as the median. The\n\u003e 0.95-quantile is the 95th percentile.\n\u003e\n\u003e The essential difference between summaries and histograms is that\n\u003e summaries calculate streaming φ-quantiles on the client side and\n\u003e expose them directly, while histograms expose bucketed observation\n\u003e counts and the calculation of quantiles from the buckets of a\n\u003e histogram happens on the server side using the\n\u003e `histogram_quantile()` function.\n\nIn other words, for the quantile estimation from the buckets of a\nhistogram to be accurate, we need to be careful when choosing the bucket\nlayout; if it doesn't match the range and distribution of the actual\nobserved durations, you will get inaccurate quantiles as a result.\n\nRemembering our current histogram configuration:\n\n```js\n// Histogram metric for measuring request durations\nconst requestDurationHistogram = new prometheusClient.Histogram({\n  // Metric name\n  name: 'sample_app_histogram_request_duration_seconds',\n\n  // Metric description\n  help: 'Histogram of request durations',\n\n  // Extra dimensions, or labels\n  // HTTP method (GET, POST, etc), and status code (200, 500, etc)\n  labelNames: ['method', 'status'],\n\n  // Duration buckets, in seconds\n  // 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s\n  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]\n});\n```\n\nHere we are using a _exponential_ bucket configuration in which the buckets\ndouble in size at every step. This is a widely used pattern; since we\nalways expect our services to respond quickly (i.e. with response time\nbetween 0 and 300ms), we specify more buckets for that range, and fewer\nbuckets for request durations we think are less likely to occur.\n\nAccording to the previous plot, all slow requests from our application\nare falling into the 1s-2.5s bucket, resulting in this loss of precision\nwhen calculating the 99th percentile.\n\nSince we know our application will take at most ~1s to respond, we can\nchoose a more appropriate bucket layout:\n\n```js\n// Histogram metric for measuring request durations\nconst requestDurationHistogram = new prometheusClient.Histogram({\n  // ...\n\n  // Experimenting a different bucket layout\n  buckets: [0.005, 0.01, 0.02, 0.05, 0.1, 0.25, 0.5, 0.8, 1, 1.2, 1.5]\n});\n```\n\nLet's start a clean Prometheus server with the modified bucket configuration\nto see if the quantile estimation improves:\n\n```bash\n$ docker-compose rm -fs\n$ docker-compose up -d\n```\n\nIf you re-run the load test, now you should get something like this:\n\n![Sample application response times](./img/sample-app-response-times-2.png)\n\nNot quite there, but it's an improvement!\n\n---\n\n**Want to know more?** If all it takes for us to achieve high accuracy\nhistogram data is to use more buckets, why not use a large number of small\nbuckets?\n\nThe reason is efficiency. Remember:\n\n**more buckets == more time series == more space == slower queries**\n\nLet's say you have an SLO (more details on SLOs later) to serve 99% of\nrequests within 300ms. If all you want to know is whether you are\nhonoring your SLO or not, it doesn't really matter if the quantile\nestimation is not accurate for requests slower than 300ms.\n\nYou might also be wondering: if summaries are more precise, why not use\nsummaries instead of histograms?\n\nQuoting the\n[documentation](https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation):\n\n\u003e A summary would have had no problem calculating the correct percentile\n\u003e value in most cases. Unfortunately, we cannot use a summary if we need\n\u003e to aggregate the observations from a number of instances.\n\nHistograms are more versatile in this regard. If you have an application\nwith multiple replicas, you can safely use the `histogram_quantile()`\nfunction to calculate the 99th percentile across all requests to all\nreplicas. You cannot do this with summaries. I mean, you can `avg()` the\n99th percentiles of all replicas, or take the `max()`, but the value you\nget will be statistically incorrect and could not be used as a proxy to the\n99th percentile.\n\n---\n\n### Measuring Throughput\n\nIf you are using a histogram to measure request duration, you can use\nthe `\u003cbasename\u003e_count` timeseries to measure throughput without having to\nintroduce another metric.\n\nFor instance, if your histogram metric name is\n`sample_app_histogram_request_duration_seconds`, then you can use the\n`sample_app_histogram_request_duration_seconds_count` metric to measure\nthroughput:\n\n```bash\n# Number of requests per second (data from the past 30s)\nrate(sample_app_histogram_request_duration_seconds_count[30s])\n```\n\n![Sample app throughput](./img/sample-app-throughput.png)\n\n### Measuring Memory/CPU Usage\n\nMost Prometheus clients already provide a default set of metrics;\n[prom-client](https://github.com/siimon/prom-client), the Prometheus\nclient for Node.js, does this as well.\n\nTry these queries in the Prometheus UI:\n\n```bash\n# Gauge that provides the current memory usage, in bytes\nprocess_resident_memory_bytes\n\n# Gauge that provides the usage in CPU seconds per second\nrate(process_cpu_seconds_total[30s])\n```\n\nIf you use `wrk` to put some load into our sample application you might see\nsomething like this:\n\n![Sample app memory/CPU usage](./img/sample-app-memory-cpu-usage.png)\n\nYou can compare these metrics with the data given by `docker stats` to see if\nthey agree with each other.\n\n---\n\n**Want to know more?** Our sample application exports different metrics\nto expose some internal Node.js information, such as GC runs, heap usage\nby type, event loop lag, and current active handles/requests. Plot those\nmetrics in the Prometheus UI, and see how they behave when you put some\nload to the application.\n\nA sample dashboard containing all those metrics is also available in our\nGrafana server at \u003chttp://localhost:3000\u003e.\n\n---\n\n### Measuring SLOs and Error Budgets\n\n\u003e Managing service reliability is largely about managing risk, and managing risk\n\u003e can be costly.\n\u003e\n\u003e 100% is probably never the right reliability target: not only is it impossible\n\u003e to achieve, it's typically more reliability than a service's users want or\n\u003e notice.\n\nSLOs, or _Service Level Objectives_, is one of the main tools employed by\n[Site Reliability Engineers (SREs)](https://landing.google.com/sre/books/) for\nmaking data-driven decisions about reliability.\n\nSLOs are based on SLIs, or _Service Level Indicators_, which are the key metrics\nthat define how well (or how poorly) a given service is operating. Common SLIs\nwould be the number of failed requests, the number of requests slower than some\nthreshold, etc. Although different types of SLOs can be useful for different\ntypes of systems, most HTTP-based services will have SLOs that can be\nclassified into two categories: **availability** and **latency**.\n\nFor instance, let's say these are the SLOs for our sample application:\n\n| Category | SLI | SLO |\n|-|-|-|\n| Availability | The proportion of successful requests; any HTTP status other than 500-599 is considered successful | 95% successful requests |\n| Latency      | The proportion of requests with duration less than or equal to 100ms | 95% requests under 100ms |\n\nThe difference between 100% and the SLO is what we call the _Error Budget_.\nIn this example, the error budget for both SLOs is 5%; if the application\nreceives 1,000 requests during the SLO window (let's say one minute for the\npurposes of this tutorial), it means that 50 requests can fail and we'll\nstill meet our SLO.\n\nBut do we need additional metrics for keeping track of these SLOs? Probably\nnot. If you are tracking request durations with a histogram (as we are here),\nchances are you don't need to do anything else. You already got all the\nmetrics you need!\n\nLet's send a few requests to the server so we can play around with the metrics:\n\n```sh\n$ while true; do curl -s http://localhost:4000 \u003e /dev/null ; done\n```\n\n```sh\n# Number of requests served in the SLO window\nsum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)\n\n# Number of requests that violated the latency SLO (all requests that took more than 100ms to be served)\nsum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le=\"0.1\"}[1m])) by (job)\n\n# Number of requests in the error budget: (100% - [slo threshold]) * [number of requests served]\n(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job)\n\n# Remaining requests in the error budget: [number of requests in the error budget] - [number of requests that violated the latency SLO]\n(1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le=\"0.1\"}[1m])) by (job))\n\n# Remaining requests in the error budget as a ratio: ([number of requests in the error budget] - [number of requests that violated the SLO]) / [number of requests in the error budget]\n((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - (sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job) - sum(increase(sample_app_histogram_request_duration_seconds_bucket{le=\"0.1\"}[1m])) by (job))) / ((1 - 0.95) * sum(increase(sample_app_histogram_request_duration_seconds_count[1m])) by (job))\n```\n\nDue to the simulated scenario in which ~5% of requests takes 1s to complete,\nif you try the last query you should see that the average budget available\nis around 0%, that is, we have no more budget to spend and will inevitably\nbreak the latency SLO if more requests start to take more time to be served.\nThis is not a good place to be.\n\n![Error Budget Burn Rate of 1x](./img/slo-1.png)\n\nBut what if we had a more strict SLO, say, 99% instead of 95%? What would be\nthe impact of these slow requests on the error budget?\n\nJust replace all `0.95` by `0.99` in that query to see what would happen:\n\n![Error Budget Burn Rate of 3x](./img/slo-2.png)\n\nIn the previous scenario with the 95% SLO, the SLO _burn rate_ was ~1x, which\nmeans the whole error budget was being consumed during the SLO window, that is,\nin 60 seconds. Now, with the 99% SLO, the burn rate was ~3x, which means that\ninstead of taking one minute for the error budget to exhaust, it now takes\nonly ~20 seconds!\n\nNow change the `curl` to point to the `/metrics` endpoint, which do not have\nthe simulated long latency for 5% of the requests, and you should see the error\nbudget go back to 100% again:\n\n```bash\n$ while true; do curl -s http://localhost:4000/metrics \u003e /dev/null ; done\n```\n\n![Error Budget Replenished](./img/slo-3.png)\n\n---\n\n**Want to know more?** These queries are for calculating the error budget for\nthe **latency** SLO by measuring the number of requests slower than 100ms. Now\ntry to modify those queries to calculate the error budget for the\n**availability** SLO (requests with `status=~\"5..\"`), and modify the sample\napplication to return a HTTP 5xx error for some requests so you can validate\nthe queries.\n\nThe\n[Site Reliability Workbook](https://landing.google.com/sre/books/) is a great\nresource on this topic and includes more advanced concepts such as how to alert\nbased on SLO burn rate as a way to improve alert precision/recall and\ndetection/reset times.\n\n---\n\n### Monitoring Applications Without a Metrics Endpoint\n\nWe learned that Prometheus needs all applications to expose a `/metrics`\nHTTP endpoint for it to scrape metrics. But what if you want to monitor\na MySQL instance, which does not provide a Prometheus metrics endpoint?\nWhat can we do?\n\nThat's where _exporters_ come in. The\n[documentation](https://prometheus.io/docs/instrumenting/exporters/) lists a\ncomprehensive list of official and third-party exporters for a variety of\nsystems, such as databases, messaging systems, cloud providers, and so forth.\n\nFor a very simplistic example, check out the\n[aws-limits-exporter](https://github.com/danielfm/aws-limits-exporter)\nproject, which is about 200 lines of Go code.\n\n### Final Gotchas\n\nThe Prometheus documentation page on\n[instrumentation](https://prometheus.io/docs/practices/instrumentation/)\ndoes a pretty good job in laying out some of the things to watch out\nfor when instrumenting your applications.\n\nAlso, beware that there are\n[conventions](https://prometheus.io/docs/practices/naming/) on what makes\na good metric name; poorly (or wrongly) named metrics will cause you a\nhard time when creating queries later.\n\n## References\n\n- [Prometheus documentation](https://prometheus.io/docs/)\n- [Prometheus example queries](https://github.com/infinityworks/prometheus-example-queries)\n- [Prometheus client for Node.js](https://github.com/siimon/prom-client)\n- [Keynote: Monitoring, the Prometheus Way (DockerCon 2017)](https://www.youtube.com/watch?v=PDxcEzu62jk)\n- [Blog Post: Understanding Machine CPU usage](https://www.robustperception.io/understanding-machine-cpu-usage/)\n- [Blog Post: #LatencyTipOfTheDay: You can't average percentiles. Period.](http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cant-average.html)\n- [Blog Post: Why Averages Suck and Percentiles are Great](https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/)\n- [Site Reliability Engineering books](https://landing.google.com/sre/books/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielfm%2Fprometheus-for-developers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielfm%2Fprometheus-for-developers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielfm%2Fprometheus-for-developers/lists"}