Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Scotchman0/GPU-heat-logging
Quick and dirty script to grep nvidia-SMI temps, and CPU temps, write them to log every X seconds for troubleshooting purposes
https://github.com/Scotchman0/GPU-heat-logging
heat lm-sensors logger logging nvidia nvidia-smi sensors temperature temperature-monitoring tracking ubuntu
Last synced: about 2 months ago
JSON representation
Quick and dirty script to grep nvidia-SMI temps, and CPU temps, write them to log every X seconds for troubleshooting purposes
- Host: GitHub
- URL: https://github.com/Scotchman0/GPU-heat-logging
- Owner: Scotchman0
- Created: 2020-11-18T20:31:07.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-11-15T18:44:44.000Z (about 2 years ago)
- Last Synced: 2024-04-09T12:00:46.159Z (9 months ago)
- Topics: heat, lm-sensors, logger, logging, nvidia, nvidia-smi, sensors, temperature, temperature-monitoring, tracking, ubuntu
- Language: Shell
- Homepage:
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GPU-heat-logging
Quick and dirty script to grep nvidia-SMI temps, and CPU temps, write them to log every X seconds for troubleshooting purposes\# supported systems:
Ubuntu 18.04LTS+, with NVIDIA gpu(s) installed
- requires: lm-sensors and moreutils installed to run properly (will prompt you to install them if you start without them)
- lm-sensors looks at CPU cores temperatures, and moreutils includes the 'ts' command needed for stamping date/time in the logs.# How to use:
1. clone this repository with 'sudo git clone https://scotchman0/GPU-heat-logging' or copy the contents of the gpu-heat-log.sh into a new script file on your endpoint2. make the script executable with chmod +x gpu-heat-logging.sh
3. run the script via terminal with:
> ./gpu-heat-logging.sh4. select interval length (sets sleep command between greps) - 1-5 seconds is recommended
5. select how many log lines you want to pull: 1-9999999 (*you can always press ctrl+c to cancel the script and exit at any time to review the logs)
6. choose whether or not you'd like to view the output as it writes to log, or if you'd just like a counter indicating how many passes have been written to file to keep in the corner while you try and recreate your problem
7. Review the files: ~/Desktop/CPU_HEAT.log and ~/Desktop/GPU_HEAT.log for output
# Why do I need this script?
You might not, but I was having a hard time figuring out if my GPU's were crashing out because of an overheat, and I wanted to be able to write to file the internal temps while I started to stress the systems. It's not a stress test, all it does is log out what the recorded temperatures are for your cores and your GPUs and timestamp it. You can spin up a job and if you can more or less point to in the log "oh weird my GPU spiked above 100C right before it turned off" you might can solve your problem.