{"id":21997189,"url":"https://github.com/nordichpc/sonar","last_synced_at":"2026-03-03T15:07:04.830Z","repository":{"id":57469298,"uuid":"159653121","full_name":"NordicHPC/sonar","owner":"NordicHPC","description":"Tool to profile usage of HPC resources by regularly probing processes.","archived":false,"fork":false,"pushed_at":"2024-11-27T08:09:35.000Z","size":1997,"stargazers_count":8,"open_issues_count":22,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-27T09:21:37.832Z","etag":null,"topics":["cluster","hpc","monitoring","profiling","usage"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NordicHPC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-29T11:07:12.000Z","updated_at":"2024-11-27T08:09:40.000Z","dependencies_parsed_at":"2023-02-19T00:15:57.969Z","dependency_job_id":"6cb3e72a-c14a-42d7-9d5e-e1efff82b866","html_url":"https://github.com/NordicHPC/sonar","commit_stats":{"total_commits":308,"total_committers":2,"mean_commits":154.0,"dds":0.08116883116883122,"last_synced_commit":"9888a86045423dfdf77a70b8ee9afc3bb5242168"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NordicHPC%2Fsonar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NordicHPC%2Fsonar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NordicHPC%2Fsonar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NordicHPC%2Fsonar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NordicHPC","download_url":"https://codeload.github.com/NordicHPC/sonar/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227223709,"owners_count":17750386,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","hpc","monitoring","profiling","usage"],"created_at":"2024-11-29T22:15:29.467Z","updated_at":"2026-03-03T15:07:04.743Z","avatar_url":"https://github.com/NordicHPC.png","language":"Rust","readme":"[![image](https://github.com/NordicHPC/sonar/workflows/Test/badge.svg)](https://github.com/NordicHPC/sonar/actions)\n[![image](https://img.shields.io/badge/license-%20GPL--v3.0-blue.svg)](LICENSE)\n\n\n# sonar\n\nTool to profile usage of HPC resources by regularly probing processes.\n\nSonar examines `/proc` and runs some diagnostic programs and filters and groups the output and\nprints it to stdout.  There are two output formats, [the old format](doc/OLD-FORMAT.md) and [the new\nformat](doc/NEW-FORMAT.md), currently coexisting but the old format will be phased out.  Sonar can\nalso probe the system and reports on its overall configuration.\n\n![image of a fish swarm](img/sonar-small.png)\n\nImage: [Midjourney](https://midjourney.com/), [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode)\n\n\n## Subcommands\n\nSonar has several subcommands that collect information about nodes, jobs, clusters, and processes\nand print it on stdout:\n\n- `sonar ps` takes a snapshot of the currently running processes\n- `sonar sysinfo` extracts hardware information about the node\n- `sonar slurm` extracts information about overall job state from the slurm databases\n- `sonar cluster` extracts information about partitions and node state from the slurm databases\n- `sonar help` prints some useful help\n\n\n## Compilation and installation\n\n- Make sure you have [Rust installed](https://www.rust-lang.org/learn/get-started) (I install Rust through `rustup`)\n- Clone this project\n- Build it: `cargo build --release`\n- The binary is then located at `target/release/sonar`\n- Copy it to wherever it needs to be\n\nIf the build results in a link error for `libsonar-\u003csomething\u003e.a` then your binutils are too old,\nthis can be a problem on eg RHEL9.  See comments in `gpuapi/Makefile` for how to resolve this.\n\n\n## Output format options\n\nThe recommended output format is the \"new\" JSON format.  Use the command line switch `--json` with\nall commands to force this format.  Most subcommands currently default to either CSV or an older\nJSON format.\n\n\n## Collect processes with `sonar ps`\n\nIt's sensible to run `sonar ps` every 5 minutes on every compute node.\n\n```console\n$ sonar ps --help\nTake a snapshot of the currently running processes\n\nUsage: sonar ps [OPTIONS]\n\nOptions:\n      --batchless\n          Synthesize a job ID from the process tree in which a process finds itself\n      --rollup\n          Merge process records that have the same job ID and command name\n      --min-cpu-percent \u003cMIN_CPU_PERCENT\u003e\n          Include records for jobs that have on average used at least this percentage of CPU, note this is nonmonotonic [default: none]\n      --min-mem-percent \u003cMIN_MEM_PERCENT\u003e\n          Include records for jobs that presently use at least this percentage of real memory, note this is nonmonotonic [default: none]\n      --min-cpu-time \u003cMIN_CPU_TIME\u003e\n          Include records for jobs that have used at least this much CPU time (in seconds) [default: none]\n      --exclude-system-jobs\n          Exclude records for system jobs (uid \u003c 1000)\n      --exclude-users \u003cEXCLUDE_USERS\u003e\n          Exclude records for these comma-separated user names [default: none]\n      --exclude-commands \u003cEXCLUDE_COMMANDS\u003e\n          Exclude records whose commands start with these comma-separated names [default: none]\n      --lockdir \u003cLOCKDIR\u003e\n          Create a per-host lockfile in this directory and exit early if the file exists on startup [default: none]\n  -h, --help\n          Print help\n```\n\n**NOTE** that if you use `--lockdir`, it should name a directory that is cleaned on reboot, such as\n`/var/run`, `/run`, or a tmpfs, and ideally it is a directory on a disk local to the node, not a\nshared disk.\n\nHere is an example output (with the default CSV output format):\n```console\n$ sonar ps --exclude-system-jobs --min-cpu-time=10 --rollup\n\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=fish,cpu%=2.1,cpukib=64400,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=138\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=sonar,cpu%=761,cpukib=372,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=137\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=brave,cpu%=14.6,cpukib=2907168,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=3532\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=alacritty,cpu%=0.8,cpukib=126700,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=51\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=pulseaudio,cpu%=0.7,cpukib=90640,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=399\nv=0.7.0,time=2023-08-10T11:09:41+02:00,host=somehost,cores=8,user=someone,job=0,cmd=slack,cpu%=3.9,cpukib=716924,gpus=none,gpu%=0,gpumem%=0,gpukib=0,cputime_sec=266\n```\n\n## Collect system information with `sonar sysinfo`\n\nThe `sysinfo` subcommand collects information about the system and prints it in JSON form on stdout\n(this is the older JSON format):\n\n```console\n$ sonar sysinfo\n{\n \"timestamp\": \"2024-02-26T00:00:02+01:00\",\n \"hostname\": \"ml1.hpc.uio.no\",\n \"description\": \"2x14 (hyperthreaded) Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 125 GB, 3x NVIDIA GeForce RTX 2080 Ti @ 11GB\",\n \"cpu_cores\": 56,\n \"mem_gb\": 125,\n \"gpu_cards\": 3,\n \"gpumem_gb\": 33\n}\n```\n\nTypical usage for `sysinfo` is to run the command after reboot and (for hot-swappable systems and\nVMs) once every 24 hours, and to aggregate the information in some database.\n\nThe `sysinfo` subcommand currently has no options.\n\n\n## Collecting job information with `sonar slurm`\n\nTo be written.\n\nThis command exists partly to allow clusters to always push data, partly to collect the data for\nlong-term storage, partly to offload the Slurm database manager during query processing.\n\n## Collecting partition and node information with `sonar cluster`\n\nTo be written.\n\nThis command exists partly to allow clusters to always push data, partly to collect the data for\nlong-term storage.\n\n## Collect and analyze results\n\nSonar data are used by two other tools:\n\n* [JobGraph](https://github.com/NordicHPC/jobgraph) provides high-level plots of system activity. Mapping\n  files for JobGraph can be found in the [data](data) folder.\n* [JobAnalyzer](https://github.com/NAICNO/Jobanalyzer) allows sonar logs to be queried and analyzed, and\n  provides dashboards, interactive and batch queries, and reporting of system activity, policy violations,\n  hung jobs, and more.\n\n\n## Output formats\n\nSee [doc/OLD-FORMAT.md](doc/OLD-FORMAT.md) and [doc/NEW-FORMAT.md](doc/NEW-FORMAT.md) for\nspecifications of the output data formats and the semantics of individual fields.\n\n\n## Versions and release procedures\n\n### Version numbers\n\nThe following basic versioning rules are new with v0.8.0.\n\nWe use semantic versioning.  The major version is expected to remain at zero for the foreseeable\nfuture, reflecting the experimental nature of Sonar.\n\nThe minor version is updated with changes that alter the output format deliberately: fields are\nadded, removed, or are given a new meaning (this has been avoided so far), or the record format\nitself changes.  For example, v0.8.0 both added fields and stopped printing fields that are zero.\n\nThe bugfix version is updated for changes that do not alter the output format per se but that might\naffect the output nevertheless, ie, most changes not covered by changes to the minor version number.\n\n\n### Release branches, uplifts and backports\n\nThe following branching scheme is new with v0.12.x.\n\nThe `main` branch is used for development and has a version number of the form `M.N.O-PRE` where\n\"PRE\" is some arbitrary string, eg \"devel\", \"rc4\".  Note that this version number form will also be\npresent in the output of `sonar ps`, to properly tag those data.  If clients are exposed to\nprerelease `ps` data they must be prepared to deal with this.\n\nFor every freeze of the the minor release number, a new release branch is created in the repo with\nthe name `release_\u003cmajor\u003e_\u003cminor\u003e`, again we expect `\u003cmajor\u003e` to remain `0` for the foreseeable\nfuture, ergo, `release_0_12` is the v0.12.x release branch.  At branching time, the minor release\nnumber is incremented on main (so when we created `release_0_12` for v0.12.1, the version number on\n`main` went to `0.13.0-devel`).  The version number on a release branch is strictly of the form\nM.N.O.\n\nWhen a release `M.N.O` is to be made from a release branch, a tag is created of the form\n`release_M_N_O` on that branch and the release is built from that changeset.  Once the release has\nshipped, the bugfix version number on the branch is incremented.\n\nWith the branches come some additional rules for how to move patches around:\n\n- If a bugfix is made to any release branch and the bug is present on main then the PR shall be\n  tagged \"uplift-required\"; the PR shall subsequently be uplifted main; and following uplift the tag\n  shall be changed to \"uplifted-to-main\".\n- If a bugfix is made to main it shall be considered whether it should be backported the most recent\n  release branch.  If so, the PR shall be tagged \"backport-required\"; the PR shall subsequently be\n  cherry-picked or backported to the release branch; and following backport the tag shall be changed\n  to \"backported-to-release\".  No older release branches shall automatically be considered for\n  backports.\n\n\n### Policies for changing Rust edition and minimum Rust version\n\nAt the time of writing we require:\n- 2021 edition of Rust\n- Rust 1.65.0 (can be found with `cargo msrv find`)\n\nPolicy for changing the minimum Rust version:\n- Open a GitHub issue and motivate the change\n- Once we reach agreement in the issue discussion:\n  - Update the version inside the test workflow [test-minimal.yml](.github/workflows/test-minimal.yml)\n  - Update the documentation (this section)\n\n\n## Authors\n\n- [Radovan Bast](https://bast.fr)\n- Mathias Bockwoldt\n- [Lars T. Hansen](https://github.com/lars-t-hansen)\n- Henrik Rojas Nagel\n\n\n## Early design goals and design decisions\n\n- Easy installation\n- Minimal overhead for recording\n- Can be used as health check tool\n- Does not need root permissions\n\n**Use `ps` instead of `top`**:\nWe started using `top` but it turned out that `top` is dependent on locale, so\nit displays floats with comma instead of decimal point in many non-English\nlocales. `ps` always uses decimal points. In addition, `ps` is (arguably) more\nversatile/configurable and does not print the header that `top` prints. All\nthese properties make the `ps` output easier to parse than the `top` output.\n\n**Do not interact with the Slurm database at all**:\nThe initial version correlated information we gathered from `ps` (what is\nactually running) with information from Slurm (what was requested). This was\nuseful and nice to have but became complicated to maintain since Slurm could\nbecome unresponsive and then processes were piling up.\n\n**Why not also recording the `pid`**?:\nBecause we sum over processes of the same name that may be running over many\ncores to have less output so that we can keep logs in plain text\n([csv](https://en.wikipedia.org/wiki/Comma-separated_values)) and don't have to\nmaintain a database or such.\n\n\n## Later design goals and design decisions\n\nThe needs of [Jobanalyzer](https://github.com/NAICNO/Jobanalyzer) and some bug fixes have led to\nsome feature creep (more data are reported), a bit of redesign (go directly to `/proc`, do not run\n`ps`), and some quirky semantics (`cpu%` is only a good number for the first data point but is still\nalways reported, and `cputime/sec` is reported to complement it; and there's a distinction between\nvirtual and real memory that is possibly more useful on GPU-full and interactive systems than on HPC\nCPU-only compute nodes).\n\n\n## Security and robustness\n\nThe tool does **not** need root permissions.  It does not modify anything and writes output to\nstdout (and errors to stderr).\n\nNo external commands are called by `sonar ps` or `sonar sysinfo`: Sonar reads `/proc` and probes the\nGPUs via their manufacturers' SMI libraries to collect all data.\n\nThe Slurm `sacct` command is currently run by `sonar slurm`.  A timeout mechanism is in place to\nprevent this command from hanging indefinitely.\n\nOptionally, `sonar` will use a lockfile to avoid a pile-up of processes.\n\n\n## Dependencies and updates\n\nSonar runs everywhere and all the time, and even though it currently runs without privileges it\nstrives to have as few dependencies as possible, so as not to become a target through a supply chain\nattack.  There are some rules:\n\n- It's OK to depend on libc and to incorporate new versions of libc\n- It's better to depend on something from the rust-lang organization than on something else\n- Every dependency needs to be justified\n- Every dependency must have a compatible license\n- Every dependency needs to be vetted as to active development, apparent quality, test cases\n- Every dependency update - even for security issues - is to be considered a code change that needs review\n- Remember that indirect dependencies are dependencies for us, too, and need to be treated the same way\n- If in doubt: copy the parts we need, vet them thoroughly, and maintain them separately\n\nThere is a useful discussion of these matters [here](https://research.swtch.com/deps).\n\n\n## How we run sonar on a cluster\n\nWe let cron execute the following script every 5 minutes on every compute node:\n\n```bash\n#!/usr/bin/env bash\n\nset -euf -o pipefail\n\nsonar_directory=/cluster/shared/sonar/data\n\npath=$(date '+%Y/%m/%d')\noutput_directory=${sonar_directory}/${path}\n\nmkdir -p ${output_directory}\n\n/cluster/bin/sonar ps \u003e\u003e ${output_directory}/${HOSTNAME}.csv\n```\n\nThis produces ca. 25-50 MB data per day on Saga (using mostly the old v0.5.0 output format), 5-20 MB\non Fox (including login and interactive nodes), using the new v0.8.0 output format), and 10-20MB per\nday on the UiO ML nodes (all interactive), with significant variation.  Being text data, it\ncompresses extremely well.\n\n\n## Similar and related tools\n\n- Reference implementation which serves as inspiration:\n  \u003chttps://github.com/UNINETTSigma2/appusage\u003e\n- [TACC Stats](https://github.com/TACC/tacc_stats)\n- [Ganglia Monitoring System](http://ganglia.info/)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnordichpc%2Fsonar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnordichpc%2Fsonar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnordichpc%2Fsonar/lists"}