Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/mle-infrastructure/mle-scheduler

Lightweight Cluster/Cloud VM Job Management 🚀
https://github.com/mle-infrastructure/mle-scheduler

Last synced: 2 months ago
JSON representation

Lightweight Cluster/Cloud VM Job Management 🚀

Host: GitHub
URL: https://github.com/mle-infrastructure/mle-scheduler
Owner: mle-infrastructure
License: mit
Created: 2021-10-29T07:14:18.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-03-08T14:17:46.000Z (over 1 year ago)
Last Synced: 2024-01-25T21:36:38.536Z (5 months ago)
Language: Python
Homepage: https://mle-infrastructure.github.io/mle_scheduler
Size: 3.43 MB
Stars: 32
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Lists

awesome-stars - mle-infrastructure/mle-scheduler - A Lightweight Cluster/Cloud VM Job Management Tool 🚀 (Python)
awesome-stars - mle-scheduler - infrastructure | 36 | (Python)

README

        # Lightweight Cluster/Cloud VM Job Management 🚀

[![Pyversions](https://img.shields.io/pypi/pyversions/mle-scheduler.svg?style=flat-square)](https://pypi.python.org/pypi/mle-scheduler)

[![PyPI version](https://badge.fury.io/py/mle-scheduler.svg)](https://badge.fury.io/py/mle-scheduler)

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[![codecov](https://codecov.io/gh/mle-infrastructure/mle-scheduler/branch/main/graph/badge.svg)](https://codecov.io/gh/mle-infrastructure/mle-scheduler)

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-scheduler/blob/main/examples/getting_started.ipynb)



Are you looking for a tool to manage your training runs locally, on [Slurm](https://slurm.schedmd.com/)/[Open Grid Engine](http://gridscheduler.sourceforge.net/documentation.html) clusters, SSH servers or [Google Cloud Platform VMs](https://cloud.google.com/gcp/)? `mle-scheduler` provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

- **`MLEJob`**: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).

- **`MLEQueue`**: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the [notebook blog](https://github.com/mle-infrastructure/mle-hyperopt/blob/main/examples/getting_started.ipynb) or the example scripts 📖

| [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-scheduler/blob/main/examples/getting_started.ipynb)| [Local](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_local.py) | [Slurm](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_cluster.py) | [Grid Engine](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_cluster.py) | [SSH](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_ssh.py) | [GCP](https://github.com/mle-infrastructure/mle-scheduler/blob/main/examples/run_gcp.py) |

|:----: |:----:|:----: | :----: | :----:| :----:|



## Installation ⏳

A PyPI installation is available via:

```

pip install mle-scheduler

```

If you want to get the most recent commit, please install directly from the repository:

```

pip install git+https://github.com/mle-infrastructure/mle-hyperopt.git@main

```

## Managing a Single Job with `MLEJob` Locally 🚀

```python

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1

job = MLEJob(

    resource_to_run="local",

    job_filename="train.py",

    config_filename="base_config_1.yaml",

    experiment_dir="logs_single",

    seed_id=1

)

_ = job.run()

```

## Managing a Queue of Jobs with `MLEQueue` Locally 🚀...🚀

```python

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/_base_config_1

# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/_base_config_1

# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/_base_config_2

# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/_base_config_2

queue = MLEQueue(

    resource_to_run="local",

    job_filename="train.py",

    config_filenames=["base_config_1.yaml",

                      "base_config_2.yaml"],

    random_seeds=[0, 1],

    experiment_dir="logs_queue"

)

queue.run()

```

## Launching Slurm Cluster-Based Jobs 🐒

```python

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0

job_args = {

    "partition": "",  # Partition to schedule jobs on

    "env_name": "mle-toolbox",  # Env to activate at job start-up

    "use_conda_venv": True,  # Whether to use anaconda venv

    "num_logical_cores": 5,  # Number of requested CPU cores per job

    "num_gpus": 1,  # Number of requested GPUs per job

    "gpu_type": "V100S",  # GPU model requested for each job

    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up

}

queue = MLEQueue(

    resource_to_run="slurm-cluster",

    job_filename="train.py",

    job_arguments=job_args,

    config_filenames=["base_config_1.yaml",

                      "base_config_2.yaml"],

    experiment_dir="logs_slurm",

    random_seeds=[0, 1]

)

queue.run()

```

## Launching GridEngine Cluster-Based Jobs 🐘

```python

# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded

job_args = {

    "queue": "",  # Queue to schedule jobs on

    "env_name": "mle-toolbox",  # Env to activate at job start-up

    "use_conda_venv": True,  # Whether to use anaconda venv

    "num_logical_cores": 5,  # Number of requested CPU cores per job

    "num_gpus": 1,  # Number of requested GPUs per job

    "gpu_type": "V100S",  # GPU model requested for each job

    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"

}

queue = MLEQueue(

    resource_to_run="sge-cluster",

    job_filename="train.py",

    job_arguments=job_args,

    config_filenames=["base_config_1.yaml",

                      "base_config_2.yaml"],

    experiment_dir="logs_grid_engine",

    random_seeds=[0, 1]

)

queue.run()

```

## Launching SSH Server-Based Jobs 🦊

```python

ssh_settings = {

    "user_name": "",  # SSH server user name

    "pkey_path": "",  # Private key path (e.g. ~/.ssh/id_rsa)

    "main_server": "",  # SSH Server address

    "jump_server": '',  # Jump host address

    "ssh_port": 22,  # SSH port

    "remote_dir": "mle-code-dir",  # Dir to sync code to on server

    "start_up_copy_dir": True,  # Whether to copy code to server

    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit

}

job_args = {

    "env_name": "mle-toolbox",  # Env to activate at job start-up

    "use_conda_venv": True  # Whether to use anaconda venv

}

queue = MLEQueue(

    resource_to_run="ssh-node",

    job_filename="train.py",

    config_filenames=["base_config_1.yaml",

                      "base_config_2.yaml"],

    random_seeds=[0, 1],

    experiment_dir="logs_ssh_queue",

    job_arguments=job_args,

    ssh_settings=ssh_settings)

queue.run()

```

## Launching GCP VM-Based Jobs 🦄

```python

cloud_settings = {

    "project_name": "",  # Name of your GCP project

    "bucket_name": "", # Name of your GCS bucket

    "remote_dir": "",  # Name of code dir in bucket

    "start_up_copy_dir": True,  # Whether to copy code to bucket

    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit

}

job_args = {

    "num_gpus": 0,  # Number of requested GPUs per job

    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"

    "num_logical_cores": 1,  # Number of requested CPU cores per job

}

queue = MLEQueue(

    resource_to_run="gcp-cloud",

    job_filename="train.py",

    config_filenames=["base_config_1.yaml",

                      "base_config_2.yaml"],

    random_seeds=[0, 1],

    experiment_dir="logs_gcp_queue",

    job_arguments=job_args,

    cloud_settings=cloud_settings,

)

queue.run()

```

### Citing the MLE-Infrastructure ✏️

If you use `mle-scheduler` in your research, please cite it as follows:

```

@software{mle_infrastructure2021github,

  author = {Robert Tjarko Lange},

  title = {{MLE-Infrastructure}: A Set of Lightweight Tools for Distributed Machine Learning Experimentation},

  url = {http://github.com/mle-infrastructure},

  year = {2021},

}

```

## Development 👷

You can run the test suite via `python -m pytest -vv tests/`. If you find a bug or are missing your favourite feature, feel free to create an issue and/or start [contributing](CONTRIBUTING.md) 🤗.