Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gabriel-milan/scoach
A setup for training Tensorflow models on SLURM clusters
https://github.com/gabriel-milan/scoach
Last synced: about 1 month ago
JSON representation
A setup for training Tensorflow models on SLURM clusters
- Host: GitHub
- URL: https://github.com/gabriel-milan/scoach
- Owner: gabriel-milan
- License: gpl-3.0
- Created: 2021-08-15T19:10:35.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2021-10-18T13:22:24.000Z (about 3 years ago)
- Last Synced: 2024-09-14T13:29:35.944Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 83 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scoach
A setup for training Tensorflow models on SLURM clusters
## How does it work?
- Inputs needed (see examples directory):
- A `.json` file with parameters for training
- A `.json` file with the model definition
- A `.py` file with the training code.
- There's a CLI app for interacting with scoach
- Run `scoach init` for setting up your configuration file, such as in `config_example.yaml`
- On the login machine at the SLURM cluster, run `scoach start`. This will start a daemon that will then launch jobs as requested.
- On any machine, you can do `scoach run submit` to submit jobs.
- This will upload the Python script to MinIO and submit the configurations to the database.
- The new runs are consumed by the daemon process, which then uses Jinja2 to render the training script and submit it to the cluster.
- The training script is then run on the cluster, using Dask workers, that will grow as needed.## To do
- [x] Add option `--local` on `scoach start` for launching runs locally
- [ ] Add support for uploading/managing datasets
- [ ] No Python script duplicates