{"id":27260922,"url":"https://github.com/voltdb/meshmonitor","last_synced_at":"2025-04-11T04:54:37.960Z","repository":{"id":226076558,"uuid":"748786891","full_name":"VoltDB/meshmonitor","owner":"VoltDB","description":"Low overhead tool for network monitoring and diagnosing issues like network delays and instability, mysterious timeouts, hangs, and scheduling problems that delay message passing.","archived":false,"fork":false,"pushed_at":"2025-04-03T19:23:39.000Z","size":158,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-11T04:54:34.866Z","etag":null,"topics":["java","monitoring","network","ping"],"latest_commit_sha":null,"homepage":"https://www.voltactivedata.com/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VoltDB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-26T19:04:29.000Z","updated_at":"2025-04-03T19:23:41.000Z","dependencies_parsed_at":"2024-05-20T13:53:07.735Z","dependency_job_id":"6222873e-6e33-4c59-a78a-4c756d3d2f87","html_url":"https://github.com/VoltDB/meshmonitor","commit_stats":null,"previous_names":["voltdb/meshmonitor"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltDB%2Fmeshmonitor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltDB%2Fmeshmonitor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltDB%2Fmeshmonitor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltDB%2Fmeshmonitor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VoltDB","download_url":"https://codeload.github.com/VoltDB/meshmonitor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248345270,"owners_count":21088243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","monitoring","network","ping"],"created_at":"2025-04-11T04:54:37.867Z","updated_at":"2025-04-11T04:54:37.943Z","avatar_url":"https://github.com/VoltDB.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/VoltDB/meshmonitor/blob/master/LICENSE)\n[![Tests](https://github.com/VoltDB/meshmonitor/actions/workflows/test.yml/badge.svg)](https://github.com/VoltDB/meshmonitor/actions/workflows/test.yml)\n[![GitHub Release](https://img.shields.io/github/v/release/VoltDB/meshmonitor)](https://github.com/VoltDB/meshmonitor/releases)\n\n# Table of Contents\n\n1. [Overview](#overview)\n    1. [What Meshmonitor Can (and Can't) Tell You](#what-meshmonitor-can-and-cant-tell-you)\n1. [What Meshmonitor does](#what-meshmonitor-does)\n    1. [Output](#output)\n    1. [Interpreting Results](#interpreting-results)\n1. [Obtaining Meshmonitor](#obtaining-meshmonitor)\n1. [Using Meshmonitor](#using-meshmonitor)\n    1. [Openmetrics / Prometheus](#openmetrics--prometheus)\n       1. [List of metrics](#list-of-metrics)\n    1. [Datadog monitoring](#datadog-monitoring)\n    1. [Running from a jar](#running-from-a-jar)\n\n# Overview\n\nMeshmonitor is a tool for monitoring network issues such as network delays and instability, mysterious timeouts, hangs,\nand scheduling problems that delay message passing.\n\nWhile it can be used as a general network monitoring tool it simulates the Volt heartbeat paths and is intended to\ndiagnose situations when sites are experiencing dead host timeouts without any obvious network event.\n\nMeshmonitor can be run alongside Volt. It has very low overhead on the CPUs and network. It can also be helpful to run\nwithout Volt - as a proof point that your environment is going to have adequate baseline stability for running Volt.\n\n## What Meshmonitor Can (and Can't) Tell You\n\nFirst - an analogy. Think of Volt as a fast car. Like all cars, the quality of the streets impacts the top speed at\nwhich you can travel. If the streets you drive on are poorly paved, have random lane closings, or a lot of traffic\nlights then your fast car goes no faster than any old jalopy.\n\nEnvironmental blips in scheduling, networking, CPU accesses are like potholes to Volt. If a Volt site thread can't\nrun for 50ms - then your application can experience long-tail latency problems. If the heart beating between cluster\nnodes is delayed for long enough, then some nodes may get ejected from the cluster.\n\nThe following is an ever-growing list of things that the Volt support team has seen when looking at customers' Mesh\nMonitor data:\n\n1. Batch network copies/backups that are doing a high IO load that linux decides is more important than scheduling\n   Volt (a solution is to throttle these jobs)\n2. Other processes on the system taking too much CPU\n3. VMs/Containers that are starved by neighbors\n4. VMs/Containers with incorrect/inadequate thread pinning\n5. VM/Containers that are migrating\n6. Power save modes that slow down \"idle\" processors\n7. Boot/grub kernel setting of idle=poll\n8. Network congestion between particular nodes/pods\n\nCauses of latency/timeouts that are not visible using meshmonitor (but may be visible in volt.log):\n\n1. GC stalls\n2. Volt memory compaction delays (for versions prior to V12.0)\n\n# What Meshmonitor does\n\nEach meshmonitor process has 2 threads: *send* and *receive* for every node it is connected to. These threads measure\nand report 3 metrics:\n\n1. **Receiving heartbeats.** A *receive* thread that is blocked reading the socket that receives messages sent from the\n   other servers. This is the main metric to look at. It should be close to the heartbeat interval (set using `-p`\n   option with default value of 5ms).\n2. **Scheduling jitter.** A *send* thread wakes up every 5 milliseconds and sends heartbeats to all the other servers\n   running meshmonitor. It reports time between wakeups which should be close to 5ms. This tracks the liveness of the\n   server (i.e. ability of a thread to get scheduled in a timely manner and send a message out.)\n3. **Timestamp differences**.The *receive* thread also measures the difference in time between the timestamp encoded in\n   the heartbeat by *send* thread and when the heartbeat was processed by *receive* thread.\n\n## Output\n\nAll messages printed by Meshmonitor contain event time (`HH:mm:ss`) and an IP address of the node that the message\npertains to. If a message has no such context then the IP column will be empty:\n\n```console\n09:08:51 [   172.31.10.72] New remote endpoint - establishing connection\n09:08:51 [   172.31.10.72] Connecting\n09:08:51 [   172.31.10.72] Connected\n09:08:51 [   172.31.10.72] Handshake sent\n09:08:51 [    172.31.14.3] New remote endpoint - establishing connection\n09:08:51 [    172.31.14.3] Connecting\n09:08:51 [    172.31.14.3] Connected\n09:08:51 [    172.31.14.3] Handshake sent\n09:08:51 [   172.31.9.146] Received connection\n09:08:51 [   172.31.9.146] New remote endpoint - establishing connection\n09:08:51 [   172.31.9.146] Connecting\n09:08:51 [   172.31.9.146] Connected\n09:08:51 [   172.31.9.146] Handshake sent\n09:08:52 [    172.31.14.3] Broken pipe\n```\n\nThere are 3 kinds of measurements:\n\n* ping (delta receive) - delta between receiving heartbeats\n* jitter (delta send) - the time that has passed between sending consecutive pings.\n* timestamp delta - delta between remotely recorded timestamp when ping\n  was generated and a locally recorded timestamp when it was received.\n\nMeshmonitor will print histograms of each of the three tracked values. All of these values need to be interpreted with\nthe `--ping` interval in mind (default 5ms) that is included in the measurement values. The values that are printed are\nmax, mean, and percentiles: 99th, 99.9th, and 99.99th:\n\n```console\n09:08:55 [               ] ----------ping-(ms)---------- ---------jitter-(ms)--------- ----timestamp-diff-(ms)------\n09:08:55 [               ]   Max  Mean    99  99.9 99.99|  Max  Mean    99  99.9 99.99 |  Max  Mean    99  99.9 99.99\n09:08:55 [   172.31.10.72]   5.2   5.1   5.1   5.2   5.2|  5.1   5.1   5.1   5.1   5.1 |  0.2   0.0   0.0   0.1   0.2\n09:08:55 [    172.31.14.3]   5.3   5.1   5.1   5.1   5.3|  5.8   5.1   5.1   5.5   5.8 |  0.4   0.2   0.2   0.2   0.4\n09:08:55 [   172.31.9.146]   5.1   5.1   5.1   5.1   5.1|  5.1   5.1   5.1   5.1   5.1 |  5.1   2.6   5.0   5.1   5.1\n09:08:55 [   172.31.5.177]   5.1   5.1   5.1   5.1   5.1|  5.1   5.1   5.1   5.1   5.1 |  5.2   2.8   5.2   5.2   5.2\n```\n\nMeasurements exceeding `--threshold` (default 20ms) will be printed in yellow. Those that exceed 1 second will be printed in\nred.\n\n## Interpreting Results\n\nLog files from all the nodes should be compared in order to establish where the problem lies. There can be delays in\nmany parts of the system. By comparing log files from different nodes you can often match deltas in send times on one\nnode to deltas in receive times on the others. This can indicate that a sender is not properly scheduling its threads.\nDeltas in receive times with no correlated deltas in send times can indicate a bottleneck in the network.\n\n# Obtaining Meshmonitor\n\nMeshmonitor is distributed as a compiled binary for Linux (x64). That is, a Java application\ncompiled to the native executable\nusing [GraalVM Community Edition](https://github.com/graalvm/graalvm-ce-builds/releases/). It has no additional\ndependencies and can be run as is.\n\nA pure Java version in jar form (meshmonitor.jar) is also available. The Java version should work on \nany platform with Java 8 or later installed (although it has only been tested on Linux).\n\n# Using Meshmonitor\n\nMeshmonitor strives to adhere to the industry standard [guidelines](https://clig.dev/#guidelines) on CLI design. The command\noptions can be printed using the `-h` parameter. This section describes basic usage and how the mesh is formed.\n\nThe central focus for this tool is the concept of a mesh: a set of connections between nodes such that each node\nis connected to all others forming a [complete graph](https://en.wikipedia.org/wiki/Complete_graph):\n\n```text\n\n ┌───(A)───┐\n │    │    │\n(B)───┼───(D)\n │    │    │\n └───(C)───┘\n```\n\nMeshmonitor processes include the list of all known nodes in the mesh in the \"ping\" message. Through this mechanism\neach node learns about all other nodes and a stable mesh is achieved after a few iterations of message exchange. The only\nrequirement is that each new meshmonitor needs to connect to at least one other that is already connected to the mesh.\n\nThe mesh is easy to create by simply starting all meshmonitor processes using a bind address by specifying the local machine’s *external* IP address (e.g., 192.161.0.3) and providing the IP address of one of the participating nodes as the first argument. For example:\n\n```shell\n# On the initial server (192.161.0.1), start meshmonitor with only the bind \n# address or List of servers to maintain permanent connection to:\n$ ./meshmonitor -b 192.161.0.1\n\n# On server 192.161.0.2 start meshmonitor and ask it to join 192.161.0.1\n$ ./meshmonitor -b 192.161.0.2 192.161.0.1\n\n# On server 192.161.0.3 start meshmonitor and ask it to join 192.161.0.1\n$ ./meshmonitor -b 192.161.0.3 192.161.0.1\n```\n\nThe meshmonitor processes can be killed and restarted and the mesh will heal. If a node goes down it is forgotten and\ndoes not have to ever be restarted - the mesh keeps working. Adding a new node can be done at any time by\npointing a new meshmonitor process at one of the existing nodes.\n\nNOTE: The IP passed to meshmonitor at startup is treated differently - meshmonitor will always try to reconnect to it.\n\n## Openmetrics / Prometheus\n\nMeshmonitor starts a simple web server on port 12223 that exposes Prometheus compatible metrics at the /metrics endpoint.\nOptionally, it can be configured to bind to a non-default network interface using the `-m` option.\n\nThe /metrics endpoint can be disabled using the `-d` option.\n\n### List of Metrics\n\nEach metric contains two basic labels:\n\n- `host_name` - the IP address of the host that meshmonitor is running on. This is the address passed to the `--bind`\n  or `-b` option.\n- `remote_host_name` - the IP address of the remote node that meshmonitor is communicating with. It's defined by the address passed to\n  the `--bind` or `-b` option of the meshmonitor process running on the remote end.\n\nMetrics contain three histograms for each host in the mesh and are encoded in\n[Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/). This means that each histogram is\ndefined by multiple metrics like `meshmonitor_receive_seconds_sum`, `..._count`, `..._bucket{}`.\n\nMonitoring systems and their frontends like Grafana or Datadog know how to interpret histograms and will typically hide\nindividual metrics that define buckets and just expose a general histogram derived from them. These would\ntypically look like the following:\n\n| Histogram | Metric as seen in Datadog/Grafana | Description                                                                                                                  |\n|-----------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------|\n| receive   | `meshmonitor_receive_seconds`     | Time between heartbeats. This is the main metric to look at. It should be close to the heartbeat interval.                   |\n| delta     | `meshmonitor_delta_seconds`       | The difference between the timestamp encoded in the heartbeat and when the heartbeat was received.                           |\n| send      | `meshmonitor_send_seconds`        | Time between *send* thread wakeups which should be close to 5ms. An ability of a thread to get scheduled in a timely manner. |\n\nHistograms contain the following buckets: `10µs, 100µs, 500µs, 1ms, 2ms, 3ms, 4ms, 5ms, 6ms, 7ms, 8ms, 9ms, 10ms, 20ms, 30ms, 40ms, 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s, 10s, Inf+`.\n\n## Datadog Monitoring\n\nTo use a locally running Datadog agent to scrape meshmonitor metrics create or\nedit `/etc/datadog-agent/conf.d/openmetrics.d/conf.yaml` with following contents:\n\n```yaml\ninit_config:\n  service: 'meshmonitor'\n\ninstances:\n  - openmetrics_endpoint: 'http://localhost:12223/metrics'\n    namespace: 'meshmonitor'\n    metrics: [ \".*\" ]\n    histogram_buckets_as_distributions: true\n```\n\nA Datadog dashboard specific to meshmonitor is available. You can import it\nfrom [json file](dashboards/datadog.json).\n\n## Running From a Jar\n\nUse the following command to run meshmonitor from the jar file:\n\n```shell\njava -jar meshmonitor.jar \u003cARGS\u003e \n```\n\nJava 8 is enough to run it but Java 11 is required to build and execute tests.\n\n## Building\n\nJava SDK is required to build and test the Meshmonitor. Version 11 or above.\nMaven is used as a build system but does not need to be installed locally.\n\nThe `mvnw` script (or `mvnw.cmd` on Windows) is used to bootstrap the build\nand download required Maven runtime files. To build Meshmonitor and run all tests:\n\n```shell\n./mvnw clean install\n```\n\nto skip tests run:\n\n```shell\n./mvnw clean install -DskipTests\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoltdb%2Fmeshmonitor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoltdb%2Fmeshmonitor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoltdb%2Fmeshmonitor/lists"}