https://github.com/viktor-shcherb/triage
Script running tool for optimizing GPU memory utilization
https://github.com/viktor-shcherb/triage
automation cli cuda deep-learning devops-tools experiment-runner gpu-monitoring gpu-scheduler hyperparameter-sweep job-queue machine-learning nvidia-smi pypi-package python resource-management script-runner
Last synced: about 1 month ago
JSON representation
Script running tool for optimizing GPU memory utilization
- Host: GitHub
- URL: https://github.com/viktor-shcherb/triage
- Owner: viktor-shcherb
- Created: 2022-07-14T18:54:36.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2025-05-02T11:33:23.000Z (11 months ago)
- Last Synced: 2025-09-20T11:45:39.753Z (6 months ago)
- Topics: automation, cli, cuda, deep-learning, devops-tools, experiment-runner, gpu-monitoring, gpu-scheduler, hyperparameter-sweep, job-queue, machine-learning, nvidia-smi, pypi-package, python, resource-management, script-runner
- Language: Python
- Homepage:
- Size: 43 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TRIAGE
Triage is a lightweight Python runner that turns a pile of training scripts into a self-managed queue, launching each job only when the GPUs have enough free memory. Give it one or more JSON “run configurations” with a memory budget and argument grid; Triage watches nvidia-smi, starts the job the moment resources are free, iterates through all parameter combinations, and records the run under a unique task name. It keeps shared servers from low VRAM utilization and lets you squeeze every last gigabyte out of your hardware during deep-learning experiments.
## Installation
```bash
pip install triage-runner
```
## Usage
See `--help` option for extended list of possible arguments.
Running one config:
```bash
triage run_config.json
```
Running several configs:
```bash
triage run_config1.json run_config2.json run_config3.json
```
Patterns can be used for config discovery as well:
```bash
triage run_config*.json
```
More on pattern syntax can be found here: https://docs.python.org/3.10/library/pathlib.html#pathlib.Path.glob
## Run configurations
Stored in JSON format. The sample run configuration looks like this:
```json
{
"memory_needed": 10.0,
"config_name": "sample_config",
"command": "python3 train.py",
"args": [
"arg1",
"--arg2",
["--seed=1", "--seed=2", "--seed=3"],
"--arg3=3"
]
}
```
Every entry in `args` list is an argument for `command`. An entry can be a list - in which case TRIAGE will iterate through all the possible combinations of all values in list entries. The example script above will be run 3 times with an argument `--seed` set to 1, 2 and 3.
Parameter `config_name` is optional and is used for logging the results (see `--logfile` option). Based on this parameter environment variable `TASK_NAME` is set in order to be used by running script.