https://github.com/withsmilo/when-ml-pipeline-meets-hydra
:cyclone:
https://github.com/withsmilo/when-ml-pipeline-meets-hydra
facebook-hydra machine-learning-systems ml-pipeline
Last synced: about 1 month ago
JSON representation
:cyclone:
- Host: GitHub
- URL: https://github.com/withsmilo/when-ml-pipeline-meets-hydra
- Owner: withsmilo
- License: mit
- Created: 2019-10-11T13:08:06.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-10-13T23:15:37.000Z (over 5 years ago)
- Last Synced: 2025-03-24T15:41:29.374Z (about 2 months ago)
- Topics: facebook-hydra, machine-learning-systems, ml-pipeline
- Language: Python
- Homepage:
- Size: 29.3 KB
- Stars: 23
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.rst
- License: LICENSE.txt
- Authors: AUTHORS.rst
Awesome Lists containing this project
README
# What happens when ML pipeline meets Hydra?
[Hydra](https://github.com/facebookresearch/hydra) is a handy and powerful tool that can dramatically reduce our boilerplate codes and combine dynamically various configurations. I started out with the idea that `Hydra` could be used in **ML Pipeline** as well, and this is a Python app in the form of a template that I quickly implemented. Feedback is always welcome.
## Assumption
Our ML pipeline consists of the following three steps. I think this is the minimum steps for the ML pipeline, and you can add other steps as you need.
* `preprocessing` : prepare data
* `modeling` : train, validate model
* `deployment` : deploy model to serving cluster## Command Architecture
This app has a two-level command architecture. ([c](https://github.com/withsmilo/When-ML-pipeline-meets-Hydra/tree/master/src/when_ml_pipeline_meets_hydra/config/c)) The command line arguments for executing each command are as follows:```
├── preprocessing
│ ├── foo -> c=preprocessing c/preprocessing_sub=foo
│ └── bar -> c=preprocessing c/preprocessing_sub=bar
├── modeling
│ ├── foo -> c=modeling c/modeling_sub=foo
│ └── bar -> c=modeling c/modeling_sub=bar
├── deployment
│ ├── foo -> c=deployment c/deployment_sub=foo
│ └── bar -> c=deployment c/deployment_sub=foo
└── help -> c=help
```## Prepared Configuration
Here are the configurations prepared for this app. ([preprocessing](https://github.com/withsmilo/When-ML-pipeline-meets-Hydra/tree/master/src/when_ml_pipeline_meets_hydra/config/preprocessing), [modeling](https://github.com/withsmilo/When-ML-pipeline-meets-Hydra/tree/master/src/when_ml_pipeline_meets_hydra/config/modeling/model), [deployment](https://github.com/withsmilo/When-ML-pipeline-meets-Hydra/tree/master/src/when_ml_pipeline_meets_hydra/config/deployment)) The command line arguments for using each configuration are as follows:```
├── preprocessing
│ ├── dataset
│ │ ├── dataset_1.yaml -> preprocessing/dataset=dataset_1
│ │ └── dataset_2.yaml -> preprocessing/dataset=dataset_2
│ └── param
│ ├── param_1.yaml -> preprocessing/param=param_1
│ └── param_2.yaml -> preprocessing/param=param_2
├── modeling
│ ├── model
│ │ ├── model_1.yaml -> modeling/model=model_1
│ │ └── model_1.yaml -> modeling/model=model_2
│ └── param
│ ├── param_1.yaml -> modeling/param=param_1
│ └── param_2.yaml -> modeling/param=param_2
└── deployment
└── cluster
├── cluster_1.yaml -> deployment/cluster=cluster_1
└── cluster_1.yaml -> deployment/cluster=cluster_2
```## How to Install
```
# Create a new Anaconda environment if needed.
$ conda create --name when_ml_pipeline_meets_hydra python=3.6 -y
$ conda activate when_ml_pipeline_meets_hydra# Clone this repo.
$ git clone https://github.com/withsmilo/When-ML-pipeline-meets-Hydra.git
$ cd When-ML-pipeline-meets-Hydra# Install this app.
$ python setup.py develop
$ when_ml_pipeline_meets_hydra --help
```## ML Pipeline Test
### 1. First taste
I will construct a new ML pipeline dynamically using all `*_1.yaml` configurations and executing the same `foo` subcommand per each step. The command you need is simple and structured.```
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_1 \
deployment/cluster=cluster_1 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
``````
[2019-10-13 22:12:22,032] - Launching 3 jobs locally
[2019-10-13 22:12:22,032] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:12:22,032] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1Do something here!
[2019-10-13 22:12:22,175] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1m_param:
hyperparam_key_1_1: hyperparam_value_1_1
hyperparam_key_1_2: hyperparam_value_1_2
name: param_1Do something here!
[2019-10-13 22:12:22,314] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_1
name: cluster_1
pw: pw_1
url: https://cluster/1/urlDo something here!
```### 2. Change hyperparameters and serving cluster for your model.
After then, if you'd like to deploy a model that has only changed the hyperparameter settings to another serving cluster, you can simply change `modeling/param` to `param_2.yaml` and `deployment/cluster` to `cluster_2.yaml` before executing your command. That's it!```
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
``````
[2019-10-13 22:13:13,898] - Launching 3 jobs locally
[2019-10-13 22:13:13,898] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:13,898] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1Do something here!
[2019-10-13 22:13:14,040] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2Do something here!
[2019-10-13 22:13:14,179] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_3 # For testing purposes, assume that this data is wrong
url: https://cluster/2/urlDo something here!!
```### 3. Fix wrong configuration dynamically.
Oops. You found wrong configuration(`"pw": "pw_3"`) and want to fix it quickly. To do this, you only need to add `cluster.pw=pw_2` to you command line.```
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
cluster.pw=pw_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
``````
[2019-10-13 22:13:43,246] - Launching 3 jobs locally
[2019-10-13 22:13:43,246] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:43,246] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing cluster.pw=pw_2
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1Do something here!
[2019-10-13 22:13:43,391] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling cluster.pw=pw_2
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2Do something here!
[2019-10-13 22:13:43,531] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment cluster.pw=pw_2
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_2
url: https://cluster/2/urlDo something here!
```
As well as this scenario, you can think of various cases.## Note
This project has been set up using PyScaffold 3.2.2. For details and usage information on PyScaffold see https://pyscaffold.org/.## License
This app is licensed under [MIT License](https://github.com/withsmilo/When-ML-pipeline-meets-Hydra/blob/master/LICENSE.txt).