https://github.com/drorata/ds-dask_cluster_example
Tutorial on setting EC2 based dask cluster
https://github.com/drorata/ds-dask_cluster_example
Last synced: 4 months ago
JSON representation
Tutorial on setting EC2 based dask cluster
- Host: GitHub
- URL: https://github.com/drorata/ds-dask_cluster_example
- Owner: drorata
- License: mit
- Created: 2018-01-04T08:45:54.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-03-07T10:00:23.000Z (over 8 years ago)
- Last Synced: 2025-03-05T10:46:37.323Z (over 1 year ago)
- Language: HCL
- Size: 7.45 MB
- Stars: 2
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Moving from local machine to Dask cluster using Terraform
Author: Dror Atariah
## Introduction
As part of the never-ending effort to improve [reBuy](https://www.rebuy.de/) and turn it into a market leader, we recently decided to tackle the challenges of our customer services agents.
As a first step, a dump of tagged emails was created and the first goal was set: build a POC that tags the emails automatically.
To that end, NLP had to be used and a lengthy (and greedy) [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) had to be executed.
So lengthy, that 4 cores of a notebook were working for couple of hours with no results.
This was the point when I decided to explore [`dask`](http://dask.pydata.org/en/latest/) and its sibling [`distributed`](https://distributed.readthedocs.io/en/latest/).
In this tutorial/post we shall discuss how to take a local code doing grid search using Scikit-Learn to a cluster of AWS (EC2) nodes.
## Start locally
We start with a minimal example of data loading and grid search the hyperparameters.
The project's structure might be:
```
.
├── data
├── models
└── src
```
In `./src` we may include some special tools, functions and classes that we would like to use in the project or in a more complicated pipeline.
We will show later how to include these tools in the distributed environment.
Note that it is having the structure of a python project and should include a `setup.py` at the project's root.
We start with a simple example:
```python
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# from src import myfoo # An example included from `src`
param_space = {'C': [1e-4, 1, 1e4],
'gamma': [1e-3, 1, 1e3],
'class_weight': [None, 'balanced']}
model = SVC(kernel='rbf')
digits = load_digits()
search = GridSearchCV(model, param_space, cv=3)
search.fit(digits.data, digits.target)
```
A little more elaborated version of this example can be found in the docker image defined [here](./Dockerfile).
You can try it out by cloning this repository and running the following:
```bash
docker build . -t dask-example
docker run --rm dask-example ./gridsearch_local.py
```
So far, so good.
But, imagine the data set is larger and the hyperparameters' space is more complicated.
Things will turn virtually impossible to run on a local machine.
At this point there are at least two possible courses of action:
1. Use more computing power
2. Optimize the search and/or be smarter
In this post we take the former.
A seemingly easy way to scale out the local machine to a cluster is [`dask`](http://dask.pydata.org/en/latest/).
To start with, staying on the local machine, let's try out the [`LocalCluster`](https://distributed.readthedocs.io/en/latest/local-cluster.html).
Checkout [`gridsearch_local_dask.py`](./gridsearch_local_dask.py) which you can try out by
```bash
docker run -it --rm dask-example ./gridsearch_local_dask.py
```
This already feels a little faster, isn't it?
But, we *need* to scale out and to that end we want to have a cluster of EC2 nodes that can be used.
There are two main steps:
1. Bundle the computation environment in a Docker image
2. Run a `dask` cluster where each node has the computation environment
## Bundle the computation environment
For the `dask` cluster to function, each node has to have the same computation environment.
Docker is a straightforward way to make this happen.
The way to go is to define a `Dockerfile`:
```docker
FROM continuumio/miniconda3
RUN mkdir project
COPY requirements.txt /project/requirements.txt
COPY src/ /project/src
COPY setup.py /project/setup.py
WORKDIR /project
RUN pip install -r requirements.txt
```
The local `requirements.txt` and `setup.py` are loaded to the image.
It is recommendad to include `bokeh` in `requirements.txt`; otherwise the web dashboard of `dask` won't work.
The `Dockerfile` can include further steps like `RUN apt-get update && apt-get install -y build-essential freetds-dev` or `RUN python -m nltk.downloader punkt`.
If `./src` includes needed classes, functions etc., then make sure you include something like `-e .` or merely `.` in `requirements.txt`; this way these dependencies will be available in the image.
It is important to include in the `Dockerfile` all the components needed for the computation environment!
Next, the image should be placed in a location accessible to EC2 instances.
It is time to push the image to a Docker registry.
In this tutorial, we use the AWS service - ECS but you can use other options like `DockerHub`.
I assume you have [`awscli`](https://aws.amazon.com/cli/) installed and the credentials are known.
You can log in to the registry simply by
```bash
# Execute from the project's root
$(aws ecr get-login --no-include-email)
docker build -t image-name .
docker tag image-name:latest repo.url/image-name:latest
docker push repo.url/image-name:latest
```
It is time to setup the nodes of the cluster.
## Defining the Dask cluster
We take a declarative approach and use [`terraform`](www.terraform.io) to setup the nodes of the cluster.
Note that in this example we utilize the AWS Spots; you can easily change the code and use the regular on-demand instances.
This is left as an exercise.
We use two groups of file to define the cluster:
- `.tf` instructions: parsed by `terraform` and defining what instances to use, what tags, regions, etc.
- Provisioning shell scripts: installing needed tools on the nodes
### `.tf` files
When using `terraform` all `.tf` files are read and concatenated.
There are more details of course; a good entry point would be [this](https://www.terraform.io/docs/configuration/index.html).
In our example we organize the `.tf` files as follows:
- `terraform.tf`: general settings
- `vars.tf`: variables definitions which can be used from the CLI
- `provision.tf`: instructions how to call the provisioning scripts
- `resources.tf`: definition of the resources
- `output.tf`: definition of outputs provided by `terraform`
#### `terraform.tf`
```
provider "aws" {
region = "eu-west-1"
}
```
#### `vars.tf`
```
variable "instanceType" {
type = "string"
default = "c5.2xlarge"
}
variable "spotPrice" {
# Not needed for on-demand instances
default = "0.1"
}
variable "contact" {
type = "string"
default = "d.atariah"
}
variable "department" {
type = "string"
default = "My wonderful department"
}
variable "subnet" {
default = "subnet-007"
}
variable "securityGroup" {
type = "string"
default = "sg-42"
}
variable "workersNum" {
default = "4"
}
variable "schedulerPrivateIp" {
# We predefine a private IP for the scheduler; it will be used by the workers
default = "172.31.36.190"
}
variable "dockerRegistry" {
default = ""
}
# By defining the AWS keys as variables we can get them from the command line
# and pass them to the provisioning scripts
variable "awsKey" {}
variable "awsPrivateKey" {}
```
#### `provision.tf`
```
data "template_file" "scheduler_setup" {
template = "${file("scheulder_setup.sh")}" # see the shell script bellow
vars {
# Use the AWS keys passed from the terraform CLI
AWS_KEY = "${var.awsKey}"
AWS_PRIVATE_KEY = "${var.awsPrivateKey}"
DOCKER_REG = "${var.dockerRegistry}"
}
}
data "template_file" "worker_setup" {
template = "${file("worker_setup.sh")}" # see the shell script bellow
vars {
AWS_KEY = "${var.awsKey}"
AWS_PRIVATE_KEY = "${var.awsPrivateKey}"
DOCKER_REG = "${var.dockerRegistry}"
SCHEDULER_IP = "${var.schedulerPrivateIp}"
}
}
```
#### `resources.tf`
This is the core of the settings, here we put everything together and define the requests for the AWS spots.
```
resource "aws_spot_instance_request" "dask-scheduler" {
ami = "ami-4cbe0935" # [1]
instance_type = "${var.instanceType}"
spot_price = "${var.spotPrice}"
wait_for_fulfillment = true
key_name = "dask_poc"
security_groups = ["${var.securityGroup}"]
subnet_id = "${var.subnet}"
associate_public_ip_address = true
private_ip = "${var.schedulerPrivateIp}" # [2]
user_data = "${data.template_file.scheduler_setup.rendered}"
tags {
Name = "${terraform.workspace}-dask-scheduler",
Department = "${var.department}",
contact = "${var.contact}"
}
}
resource "aws_spot_instance_request" "dask-worker" {
count = "${var.workersNum}" # [3]
ami = "ami-4cbe0935" # [1]
instance_type = "${var.instanceType}"
spot_price = "${var.spotPrice}"
wait_for_fulfillment = true
key_name = "dask_poc"
subnet_id = "${var.subnet}"
security_groups = ["${var.securityGroup}"]
associate_public_ip_address = true
user_data = "${data.template_file.worker_setup.rendered}"
tags {
Name = "${terraform.workspace}-dask-worker${count.index}",
Department = "${var.department}",
contact = "${var.contact}"
}
}
```
Here are some important elements to note:
1. The [AMI](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html) I use is the one for `eu-west-1` which is optimized for Docker and provided by Amazon. It is possible to use other images, but it is important that they will support `docker`.
2. Define the private IP of the scheduler. You will need to use it when starting the workers and it is easier to *know* the IP than to *find* it
3. Indicate how many workers should be used
#### `output.tf`
`terraform` allows the definition of various outputs.
As always, more details can be found [here](https://www.terraform.io/intro/getting-started/outputs.html).
```
output "scheduler-info" {
value = "${aws_spot_instance_request.dask-scheduler.public_ip}"
}
output "workers-info" {
value = "${join(",",aws_spot_instance_request.dask-worker.*.public_ip)}"
}
output "scheduler-status" {
value = "http://${aws_spot_instance_request.dask-scheduler.public_ip}:8787/status"
}
```
### Provisioning scripts
The `user_data` fields in `resources.tf` indicate what script should be used for the provisioning on the nodes.
We provide two templates of scripts which will be filled with the needed variables from `terraform`; one script for the scheduler and one for the workers.
```bash
#!/bin/bash
# scheduler_setup.sh
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
set -x
echo "Installing pip"
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py --user
~/.local/bin/pip install awscli --upgrade --user
echo "Logging in to ECS registry"
export AWS_ACCESS_KEY_ID=${AWS_KEY}
export AWS_SECRET_ACCESS_KEY=${AWS_PRIVATE_KEY}
export AWS_DEFAULT_REGION=eu-west-1
$(~/.local/bin/aws ecr get-login --no-include-email)
# Assigning tags to instance derived from spot request
# See https://github.com/hashicorp/terraform/issues/3263#issuecomment-284387578
REGION=eu-west-1
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
SPOT_REQ_ID=$(~/.local/bin/aws --region $REGION ec2 describe-instances --instance-ids "$INSTANCE_ID" --query 'Reservations[0].Instances[0].SpotInstanceRequestId' --output text)
if [ "$SPOT_REQ_ID" != "None" ] ; then
TAGS=$(~/.local/bin/aws --region $REGION ec2 describe-spot-instance-requests --spot-instance-request-ids "$SPOT_REQ_ID" --query 'SpotInstanceRequests[0].Tags')
~/.local/bin/aws --region $REGION ec2 create-tags --resources "$INSTANCE_ID" --tags "$TAGS"
fi
echo "Starting docker container from image"
docker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-scheduler
```
The scripts for the workers and for the scheduler are identical, except the last line.
For the workers we should have
```
docker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-worker ${SCHEDULER_IP}:8786
```
Note that we start `dask-worker` instead of `dask-scheduler` and we indeicate the private IP of the scheduler.
**Important** to note the `--network host`.
Intuitively, this makes sure that the containers' networks and their corresponding hosts will be the same and therefore the different containers on different hosts will be able to communicate.
## Running the cluster
We can now run the cluster.
To that end, we need to execute two commands.
First, `terraform init`.
This one prepares the tool and make it ready to start the nodes.
Next we have to `apply` the instructions.
This we do by invoking:
```bash
TF_VAR_awsKey=YOUR_AWS_KEY \
TF_VAR_awsPrivateKey=YOUR_AWS_PRIVATE_KEY \
terraform apply -var 'workersNum=2' -var 'instanceType="t2.small"' \
-var 'spotPrice=0.2' -var 'schedulerPrivateIp="172.31.36.170"' \
-var 'dockerRegistry="repo.url/image-name:latest"'
```
Note that we use two environment variables for the AWS keys.
Other variables defined in `var.tf` are passed as parameters.
Once finished, you can access the newly created scheduler node by: `ssh -i ~/.aws/key.pem ec2-user@$(terraform output scheduler-info)`.
In the cluster you can check the log at `/var/log/user-data.log`.
You can also check the status of the running Docker containers using `docker ps`.
Lastly, if everything went well, you should be able to access the web interface of the cluster.
Its address can be found by invoking `terraform output scheduler-status`.
## Grid search on the cluster
The moment we have been waiting for: run our hyperparameters grid search on the `dask` cluster.
To do so, we can use a code similar to [`./gridsearch_local_dask.py`](./gridsearch_local_dask.py).
Only changing the client's address is needed:
```python
#!/usr/bin/env python
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import classification_report
from distributed import Client, LocalCluster
from dask_searchcv import GridSearchCV
# from src import myfoo # An example included from `src`
def main():
param_space = {'C': [1e-4, 1, 1e4],
'gamma': [1e-3, 1, 1e3],
'class_weight': [None, 'balanced']}
model = SVC(kernel='rbf')
digits = load_digits()
X_train, X_test, y_train, y_test = tts(digits.data, digits.target,
test_size=0.3)
print("Starting local cluster")
client = Client(x.y.z.w:8786)
print(client)
print("Start searching")
search = GridSearchCV(model, param_space, cv=3)
search.fit(X_train, y_train)
print("Prepare report")
print(classification_report(
y_true=y_test, y_pred=search.best_estimator_.predict(X_test))
)
if __name__ == '__main__':
main()
```
Running this script would start the grid search on the `dask` cluster.
This can be monitored on the web dashboard.
If you have a running cluster at `x.y.z.w`, you can try it out:
```bash
docker run -it --rm -p 8786:8786 dask-example ./gridsearch_cluster_dask.py x.y.z.w
```
## Yet to be discussed
* You might want to explore `terraform workspace`; this can help you run several clusters from the same directory. For example when running different experiements at the same time.
* Enable a node with Jupyter server so the local notebook won't be needed