https://github.com/digitalocean/vmtop
Real-time monitoring of KVM/Qemu VMs
https://github.com/digitalocean/vmtop
bcc ebpf kvm monitoring performance prometheus qemu virtualization
Last synced: 12 months ago
JSON representation
Real-time monitoring of KVM/Qemu VMs
- Host: GitHub
- URL: https://github.com/digitalocean/vmtop
- Owner: digitalocean
- License: apache-2.0
- Created: 2020-03-17T20:18:54.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-04-20T04:48:36.000Z (about 2 years ago)
- Last Synced: 2025-06-11T01:09:28.110Z (about 1 year ago)
- Topics: bcc, ebpf, kvm, monitoring, performance, prometheus, qemu, virtualization
- Language: Python
- Homepage:
- Size: 164 KB
- Stars: 66
- Watchers: 10
- Forks: 21
- Open Issues: 8
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# vmtop
Monitor various metrics for KVM/qemu guests and output in a top-like fashion.
Also aggregate the usage/allocation metrics at the NUMA node and host-level.
## Main features
It can output as text on the console and write CSV files. The `graph-vmtop.py`
script generates graphs from those CSV files (record with `--csv `).
This script can also be ran with a Prometheus exporter enabled, like so:
```
sudo ./vmtop.py --prometheus [host:port]
```
The host and port are optional, and will default to localhost:8000 if not
specified.
## Example output
Example output for the top 10 VMs experiencing the most vcpu steal per node:
```
$ sudo ./vmtop.py --vm --limit 10 --sort vcpu_steal
[...]
Node 0:
Name PID vcpu util vcpu steal vhost util vhost steal emu util emu steal disk read disk write rx tx
Guest01 48642 141.73 % 32.86 % 0.30 % 0.06 % 0.57 % 0.48 % 0.00 MB/s 0.00 MB/s 0.00 MB/s 0.00 MB/s
Guest02 16830 219.07 % 29.74 % 28.45 % 22.13 % 0.54 % 1.30 % 0.00 MB/s 0.01 MB/s 2.49 MB/s 2.43 MB/s
Guest03 17152 119.61 % 27.87 % 13.89 % 16.31 % 0.48 % 0.12 % 0.00 MB/s 0.00 MB/s 1.45 MB/s 1.39 MB/s
Guest04 60782 52.90 % 16.70 % 0.79 % 0.75 % 0.61 % 0.19 % 0.00 MB/s 0.01 MB/s 0.02 MB/s 0.03 MB/s
Guest05 33077 46.47 % 12.69 % 2.02 % 1.82 % 3.11 % 0.97 % 0.00 MB/s 2.06 MB/s 0.24 MB/s 0.14 MB/s
Guest06 39435 41.70 % 10.01 % 17.36 % 12.85 % 0.98 % 0.08 % 0.00 MB/s 0.00 MB/s 0.27 MB/s 0.30 MB/s
Guest07 5196 26.49 % 9.28 % 0.03 % 0.00 % 0.51 % 0.09 % 0.00 MB/s 0.00 MB/s 0.00 MB/s 0.00 MB/s
Guest08 56822 55.63 % 8.77 % 9.33 % 5.61 % 0.53 % 0.17 % 0.00 MB/s 0.00 MB/s 0.30 MB/s 0.30 MB/s
Guest09 65751 36.59 % 8.54 % 6.26 % 3.42 % 0.76 % 0.06 % 0.08 MB/s 0.04 MB/s 0.18 MB/s 0.18 MB/s
Guest10 25320 26.61 % 8.44 % 0.06 % 0.00 % 5.30 % 2.72 % 0.00 MB/s 0.06 MB/s 0.00 MB/s 0.00 MB/s
Node 0: vcpu util: 80.85%, vcpu steal: 6.49%, emulators util: 1.71%, emulators steal: 0.89%
Node 0: 79 VMs (168 vcpus, 306.00 GB mem allocated, 231.98 GB mem used)
Node 1:
Name PID vcpu util vcpu steal vhost util vhost steal emu util emu steal disk read disk write rx tx
Guest11 14506 29.49 % 0.70 % 0.00 % 0.00 % 0.45 % 0.01 % 0.00 MB/s 0.01 MB/s 0.00 MB/s 0.00 MB/s
Guest12 52580 28.69 % 0.63 % 4.55 % 0.26 % 0.60 % 0.08 % 0.12 MB/s 0.00 MB/s 0.16 MB/s 0.15 MB/s
Guest13 60864 36.45 % 0.56 % 0.06 % 0.00 % 4.25 % 0.06 % 0.00 MB/s 0.03 MB/s 0.00 MB/s 0.00 MB/s
Guest14 69426 64.98 % 0.54 % 11.06 % 0.79 % 0.09 % 0.00 % 0.00 MB/s 0.02 MB/s 4.20 MB/s 3.87 MB/s
Guest15 52138 52.45 % 0.39 % 10.66 % 0.64 % 0.75 % 0.02 % 0.22 MB/s 0.00 MB/s 0.43 MB/s 0.38 MB/s
Guest16 38308 57.05 % 0.25 % 12.03 % 0.93 % 1.18 % 0.03 % 0.51 MB/s 0.00 MB/s 0.47 MB/s 0.56 MB/s
Guest17 7700 11.87 % 0.25 % 0.08 % 0.00 % 4.16 % 0.05 % 0.00 MB/s 0.03 MB/s 0.00 MB/s 0.00 MB/s
Guest18 39542 39.46 % 0.24 % 7.61 % 0.51 % 0.77 % 0.02 % 0.16 MB/s 0.00 MB/s 0.27 MB/s 0.27 MB/s
Guest19 1381 49.03 % 0.24 % 9.01 % 0.72 % 0.76 % 0.05 % 0.25 MB/s 0.00 MB/s 0.35 MB/s 0.43 MB/s
Guest20 6945 36.15 % 0.24 % 6.42 % 0.45 % 0.77 % 0.04 % 0.21 MB/s 0.00 MB/s 0.23 MB/s 0.36 MB/s
Node 1: vcpu util: 41.88%, vcpu steal: 0.20%, emulators util: 1.87%, emulators steal: 0.08%
Node 1: 131 VMs (161 vcpus, 210.00 GB mem allocated, 193.42 GB mem used)
```
# Dependencies
The tool currently depends on `numastat` being in the `$PATH`.
Besides that dependency, one design rule of this tool is that it should be easy
to just download it and run without any additional setup. We use very common
Python modules and the advanced features such as Prometheus and eBPF are
optional.
This tool has been mostly tested on Ubuntu Bionic 18.04, but should be easy to
run on other platforms.
The optional dependencies are:
* `python3-daemon` for the the `--daemon` option
* `python3-bcc` for the `--vmexit` option
* `python3-prometheus-client` for the `--prometheus` option
## Assumptions
The node-level information assumes a VM mostly runs where most of its memory is
allocated, if a VM can float between NUMA nodes, the node-level information may
not be accurate (and a warning will be shown). The VM-level data is accurate
regardless of the pinning.
## Feedback
This tool is a quick script to solve the issue of live monitoring resource
allocation and usage by VMs. We plan to keep improving it over time.
## Extras
In the `extras` folder, there are other ad-hoc tools for monitoring various
performance aspects of KVM. The tools are in various states of maintenance,
feel free to reach out if you have questions or suggestions for improvements.
The features from those scripts may end up in vmtop at some point, but for now
they are tested outside.
### guesttime.bt
`bpftrace` tool to check statistically how long a vCPU spends inside the guest
when it is scheduled in.
### Core-scheduling and KVM
The `core-sched-stats.py` script ensures that the core scheduling
feature works properly and accounts for the time spent by a vCPU in various
scheduling modes (co-scheduled with idle, with a compatible task, or a foreign
task).
This script works with perf trace recorded:
```
sudo perf record -e kvm:kvm_entry -e kvm:kvm_exit -e sched:sched_switch -e sched:sched_waking -o perf.data -a sleep 60
```
and can be converted to CTF like so (requires perf compiled with CTF support):
```
perf data convert --to-ctf=./ctf -i perf.data
```
If you see this message and the number of chunks is greater than 1 or 2, consider writing your `perf.data` to a ramdisk instead
of your local disk:
```
Processed 7378132 events and lost 1 chunks!
Check IO/CPU overload!
```
It depends on Babeltrace compiled with the python library and Perf compiled with the CTF support. On bionic:
```
apt-get install libbabeltrace-dev libbabeltrace-ctf-dev python3-babeltrace babeltrace
```
Then rebuild `perf` so it will detect the new library.
In order to run `kvm_co_sched_stats.py`, the siblings list must be provided in a text file with the `--topology` flag.
To collect this data, run this on the target HV:
```
for i in /sys/devices/system/cpu/cpu*/topology/thread_siblings_list; do cat $i; done > out.txt
```