Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/klieret/ray-tune-slurm-demo
Testing ray tune with slurm batch submission and optuna and wandb
https://github.com/klieret/ray-tune-slurm-demo
hyperparameter-optimization hyperparameter-tuning ml optuna ray slurm wandb
Last synced: 2 months ago
JSON representation
Testing ray tune with slurm batch submission and optuna and wandb
- Host: GitHub
- URL: https://github.com/klieret/ray-tune-slurm-demo
- Owner: klieret
- License: mit
- Created: 2022-08-26T17:11:36.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-02T23:26:00.000Z (4 months ago)
- Last Synced: 2024-10-04T10:24:29.946Z (3 months ago)
- Topics: hyperparameter-optimization, hyperparameter-tuning, ml, optuna, ray, slurm, wandb
- Language: Python
- Homepage:
- Size: 241 KB
- Stars: 5
- Watchers: 1
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
## 📝 Description
This repository demonstrates/tests hyperparameter optimization with the following frameworks:
* [Ray tune][tune] as parent framework and to start jobs with [SLURM][slurm]
* [Optuna][optuna] to suggest the hyperaprameters
* [Wandb (weights & measures)][wandb] to log and visualize the results> **Note**
> If you want to see this tech stack in an actual use case, see the [GNN tracking Hyperparameter Optimization repository][gnn-tracking-hpo].## 📦 Installation
Use the conda environment, THEN `pip` install the package.
## 🔥 Running it!
### First test without batch system
* First run `src/rtstest/dothetune.py` (no batch submission) to also download the data file
(because no internet connection on the compute nodes)### Option 1: All-in-one
For a single batch jobs that uses multiple nodes to start both the head node and the works, see
`slurm/all-in-one`. While this is the example used in the ray documentation, it might not be
the best for most use cases, as it relies on having enough available nodes directly available
for enough time to complete all requested trials.#### Live syncing to wandb
Because the compute nodes usually do not have internet, we need a separate tool for this.
See the documentation of [wandb-osh] for how to start the syncer on the head node.### Option 2: Head node and worker nodes
Here, we start the ray head on the head (login) node and then use batch submission to start
worker nodes asynchronously.
Follow the following steps1. Run `slurm/head_workers/start-on-headnode.sh` and note down the IP and redis password that are printed out
2. Submit several batch jobs `sbatch slurm/head_workers/start-on-worker.slurm `
3. Start your tuning script on the head node: `slurm/head_workers/start-program.sh `> **Note**
> In my HPO scripts at [my main ML project][gnn-tracking-hpo] I instead write out the IP
> and password to files in my home directory and have dependent scripts read from there
> rather than passing them around on the command line.Once the batch jobs for the workers start running, you should see activity in the tuning script output.
[tune]: https://docs.ray.io/en/master/tune/index.html
[tigergpu]: https://researchcomputing.princeton.edu/systems/tiger
[optuna]: https://optuna.org/
[wandb]: https://wandb.ai/site
[slurm]: https://slurm.schedmd.com/
[wandb-osh]: https://github.com/klieret/wandb-offline-sync-hook/
[gnn-tracking-hpo]: https://github.com/gnn-tracking/hyperparameter_optimization