https://github.com/outerbounds/metaflow-slurm
SLURM integration for Metaflow
https://github.com/outerbounds/metaflow-slurm
Last synced: about 1 year ago
JSON representation
SLURM integration for Metaflow
- Host: GitHub
- URL: https://github.com/outerbounds/metaflow-slurm
- Owner: outerbounds
- License: apache-2.0
- Created: 2024-11-05T09:40:46.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-10T20:25:42.000Z (over 1 year ago)
- Last Synced: 2025-02-09T17:14:42.989Z (over 1 year ago)
- Language: Python
- Size: 40 KB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-metaflow - metaflow-slurm - Run steps on HPC Slurm clusters via SSH + `sbatch`. Beta. (Orchestration & Scheduling)
README
# SLURM extension for Metaflow
This extension adds support for executing steps in Metaflow Flows on SLURM clusters.
## Basic Usage
- Have a SLURM cluster that you have public access for.
- This includes the username, the IP address and the PEM file (at minimum)
- Simply add the `@slurm` decorator to the step you want to run on the SLURM cluster.
```py
@slurm(
username="ubuntu",
address="A.B.C.D",
ssh_key_file="~/path/to/ssh/pem/file.pem"
)
```
Note that the above parameters can also be configured via the following environment variables:
- `METAFLOW_SLURM_USERNAME`
- `METAFLOW_SLURM_ADDRESS`
- `METAFLOW_SLURM_SSH_KEY_FILE`
The step that is decorated with `@slurm` will create the following directory structure on the cluster.
```bash
metaflow/
├── assets
│ └── madhurMovies218892mid13433160
│ └── metaflow
│ ├── INFO
│ ├── demo.py
│ ├── job.tar
│ ├── linux-64
│ ├── metaflow
│ ├── metaflow_extensions
│ └── micromamba
├── madhurMovies218892mid13433160.sh
├── stderr
│ └── madhurMovies218892mid13433160.stderr
└── stdout
└── madhurMovies218892mid13433160.stdout
```
In the above output, `demo.py` was the name of our flow file.
One can pass `cleanup=True` in the decorator to clear up the contents of the `assets` folder.
This clears up all the `artifacts` created by Metaflow.
Using `cleanup=True` will not delete:
- `stdout` folder
- `stderr` folder
- the generated shell script i.e. `madhurMovies218892mid13433160.sh`
This is useful for debugging later and may be manually deleted by logging into the slurm cluster.
## Supplying Credentials
Credentials need to be supplied to be able to download the code package. They can:
- either exist on the Slurm cluster itself, i.e. compute instances have access to the blob store
- supplied via the `@environment` decorator
```py
@environment(vars={
"AWS_ACCESS_KEY_ID": "XXXX",
"AWS_SECRET_ACCESS_KEY": "YYYY"
})
```
Note that this will expose the credentials in the shell script that is generated i.e.
`madhurMovies218892mid13433160.sh` will have the following contents present:
```bash
export AWS_ACCESS_KEY_ID='XXXX'
export AWS_SECRET_ACCESS_KEY='YYYY'
```
- hydrating environment variables with the @secrets decorator from a secret manager.
PS -- If you are on the [Outerbounds](https://outerbounds.com/) platform, the auth is taken care of and there is no need to fiddle with it.
## Things to be taken care of
- The extension runs workloads via shell scripts and `sbatch` in a linux native environment
- i.e. the workloads are NOT run inside docker containers
- As such, the compute instances should have `python3` installed (above 3.8 preferrably)
- If the default `python` points to `python2`, one can use the `path_to_python3` argument of the decorator i.e.
```py
@slurm(
path_to_python3="/usr/bin/python3",
)
```
### Fin.