Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/mle-infrastructure/mle-scheduler

Lightweight Cluster/Cloud VM Job Management 🚀
https://github.com/mle-infrastructure/mle-scheduler

Last synced: 2 months ago
JSON representation

Lightweight Cluster/Cloud VM Job Management 🚀

Lists

README

        

# Lightweight Cluster/Cloud VM Job Management 🚀
[![Pyversions](https://img.shields.io/pypi/pyversions/mle-scheduler.svg?style=flat-square)](https://pypi.python.org/pypi/mle-scheduler)
[![PyPI version](https://badge.fury.io/py/mle-scheduler.svg)](https://badge.fury.io/py/mle-scheduler)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![codecov](https://codecov.io/gh/mle-infrastructure/mle-scheduler/branch/main/graph/badge.svg)](https://codecov.io/gh/mle-infrastructure/mle-scheduler)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-scheduler/blob/main/examples/getting_started.ipynb)

Are you looking for a tool to manage your training runs locally, on [Slurm](https://slurm.schedmd.com/)/[Open Grid Engine](http://gridscheduler.sourceforge.net/documentation.html) clusters, SSH servers or [Google Cloud Platform VMs](https://cloud.google.com/gcp/)? `mle-scheduler` provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

- **`MLEJob`**: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
- **`MLEQueue`**: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the [notebook blog](https://github.com/mle-infrastructure/mle-hyperopt/blob/main/examples/getting_started.ipynb) or the example scripts 📖

| [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-scheduler/blob/main/examples/getting_started.ipynb)| [Local](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_local.py) | [Slurm](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_cluster.py) | [Grid Engine](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_cluster.py) | [SSH](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_ssh.py) | [GCP](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_gcp.py) |
|:----: |:----:|:----: | :----: | :----:| :----:|

## Installation ⏳

A PyPI installation is available via:

```
pip install mle-scheduler
```

If you want to get the most recent commit, please install directly from the repository:

```
pip install git+https://github.com/mle-infrastructure/mle-hyperopt.git@main
```

## Managing a Single Job with `MLEJob` Locally 🚀

```python
from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
resource_to_run="local",
job_filename="train.py",
config_filename="base_config_1.yaml",
experiment_dir="logs_single",
seed_id=1
)

_ = job.run()
```

## Managing a Queue of Jobs with `MLEQueue` Locally 🚀...🚀

```python
from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/_base_config_1
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/_base_config_1
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/_base_config_2
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/_base_config_2
queue = MLEQueue(
resource_to_run="local",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_queue"
)

queue.run()
```

## Launching Slurm Cluster-Based Jobs 🐒

```python
# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
"partition": "", # Partition to schedule jobs on
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True, # Whether to use anaconda venv
"num_logical_cores": 5, # Number of requested CPU cores per job
"num_gpus": 1, # Number of requested GPUs per job
"gpu_type": "V100S", # GPU model requested for each job
"modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up
}

queue = MLEQueue(
resource_to_run="slurm-cluster",
job_filename="train.py",
job_arguments=job_args,
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
experiment_dir="logs_slurm",
random_seeds=[0, 1]
)
queue.run()
```

## Launching GridEngine Cluster-Based Jobs 🐘

```python
# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
"queue": "", # Queue to schedule jobs on
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True, # Whether to use anaconda venv
"num_logical_cores": 5, # Number of requested CPU cores per job
"num_gpus": 1, # Number of requested GPUs per job
"gpu_type": "V100S", # GPU model requested for each job
"gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
resource_to_run="sge-cluster",
job_filename="train.py",
job_arguments=job_args,
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
experiment_dir="logs_grid_engine",
random_seeds=[0, 1]
)
queue.run()
```

## Launching SSH Server-Based Jobs 🦊

```python
ssh_settings = {
"user_name": "", # SSH server user name
"pkey_path": "", # Private key path (e.g. ~/.ssh/id_rsa)
"main_server": "", # SSH Server address
"jump_server": '', # Jump host address
"ssh_port": 22, # SSH port
"remote_dir": "mle-code-dir", # Dir to sync code to on server
"start_up_copy_dir": True, # Whether to copy code to server
"clean_up_remote_dir": True # Whether to delete remote_dir on exit
}

job_args = {
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True # Whether to use anaconda venv
}

queue = MLEQueue(
resource_to_run="ssh-node",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_ssh_queue",
job_arguments=job_args,
ssh_settings=ssh_settings)

queue.run()
```

## Launching GCP VM-Based Jobs 🦄

```python
cloud_settings = {
"project_name": "", # Name of your GCP project
"bucket_name": "", # Name of your GCS bucket
"remote_dir": "", # Name of code dir in bucket
"start_up_copy_dir": True, # Whether to copy code to bucket
"clean_up_remote_dir": True # Whether to delete remote_dir on exit
}

job_args = {
"num_gpus": 0, # Number of requested GPUs per job
"gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100"
"num_logical_cores": 1, # Number of requested CPU cores per job
}

queue = MLEQueue(
resource_to_run="gcp-cloud",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_gcp_queue",
job_arguments=job_args,
cloud_settings=cloud_settings,
)
queue.run()
```

### Citing the MLE-Infrastructure ✏️

If you use `mle-scheduler` in your research, please cite it as follows:

```
@software{mle_infrastructure2021github,
author = {Robert Tjarko Lange},
title = {{MLE-Infrastructure}: A Set of Lightweight Tools for Distributed Machine Learning Experimentation},
url = {http://github.com/mle-infrastructure},
year = {2021},
}
```

## Development 👷

You can run the test suite via `python -m pytest -vv tests/`. If you find a bug or are missing your favourite feature, feel free to create an issue and/or start [contributing](CONTRIBUTING.md) 🤗.