https://github.com/drorata/ds-dask_cluster_example

Tutorial on setting EC2 based dask cluster
https://github.com/drorata/ds-dask_cluster_example
Last synced: 4 months ago
JSON representation
Tutorial on setting EC2 based dask cluster
Host: GitHub
URL: https://github.com/drorata/ds-dask_cluster_example
Owner: drorata
License: mit
Created: 2018-01-04T08:45:54.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-03-07T10:00:23.000Z (over 8 years ago)
Last Synced: 2025-03-05T10:46:37.323Z (over 1 year ago)
Language: HCL
Size: 7.45 MB
Stars: 2
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

          # Moving from local machine to Dask cluster using Terraform

Author: Dror Atariah

## Introduction

As part of the never-ending effort to improve [reBuy](https://www.rebuy.de/) and turn it into a market leader, we recently decided to tackle the challenges of our customer services agents.

As a first step, a dump of tagged emails was created and the first goal was set: build a POC that tags the emails automatically.

To that end, NLP had to be used and a lengthy (and greedy) [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) had to be executed.

So lengthy, that 4 cores of a notebook were working for couple of hours with no results.

This was the point when I decided to explore [`dask`](http://dask.pydata.org/en/latest/) and its sibling [`distributed`](https://distributed.readthedocs.io/en/latest/).

In this tutorial/post we shall discuss how to take a local code doing grid search using Scikit-Learn to a cluster of AWS (EC2) nodes.

## Start locally

We start with a minimal example of data loading and grid search the hyperparameters.

The project's structure might be:

```

.

├── data

├── models

└── src

```

In `./src` we may include some special tools, functions and classes that we would like to use in the project or in a more complicated pipeline.

We will show later how to include these tools in the distributed environment.

Note that it is having the structure of a python project and should include a `setup.py` at the project's root.

We start with a simple example:

```python

from sklearn.datasets import load_digits

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

# from src import myfoo # An example included from `src`

param_space = {'C': [1e-4, 1, 1e4],

               'gamma': [1e-3, 1, 1e3],

               'class_weight': [None, 'balanced']}

model = SVC(kernel='rbf')

digits = load_digits()

search = GridSearchCV(model, param_space, cv=3)

search.fit(digits.data, digits.target)

```

A little more elaborated version of this example can be found in the docker image defined [here](./Dockerfile).

You can try it out by cloning this repository and running the following:

```bash

docker build . -t dask-example

docker run --rm dask-example ./gridsearch_local.py

```

So far, so good.

But, imagine the data set is larger and the hyperparameters' space is more complicated.

Things will turn virtually impossible to run on a local machine.

At this point there are at least two possible courses of action:

1. Use more computing power

2. Optimize the search and/or be smarter

In this post we take the former.

A seemingly easy way to scale out the local machine to a cluster is [`dask`](http://dask.pydata.org/en/latest/).

To start with, staying on the local machine, let's try out the [`LocalCluster`](https://distributed.readthedocs.io/en/latest/local-cluster.html).

Checkout [`gridsearch_local_dask.py`](./gridsearch_local_dask.py) which you can try out by

```bash

docker run -it --rm dask-example ./gridsearch_local_dask.py

```

This already feels a little faster, isn't it?

But, we *need* to scale out and to that end we want to have a cluster of EC2 nodes that can be used.

There are two main steps:

1. Bundle the computation environment in a Docker image

2. Run a `dask` cluster where each node has the computation environment

## Bundle the computation environment

For the `dask` cluster to function, each node has to have the same computation environment.

Docker is a straightforward way to make this happen.

The way to go is to define a `Dockerfile`:

```docker

FROM continuumio/miniconda3

RUN mkdir project

COPY requirements.txt /project/requirements.txt

COPY src/ /project/src

COPY setup.py /project/setup.py

WORKDIR /project

RUN pip install -r requirements.txt

```

The local `requirements.txt` and `setup.py` are loaded to the image.

It is recommendad to include `bokeh` in `requirements.txt`; otherwise the web dashboard of `dask` won't work.

The `Dockerfile` can include further steps like `RUN apt-get update && apt-get install -y build-essential freetds-dev` or `RUN python -m nltk.downloader punkt`.

If `./src` includes needed classes, functions etc., then make sure you include something like `-e .` or merely `.` in `requirements.txt`; this way these dependencies will be available in the image.

It is important to include in the `Dockerfile` all the components needed for the computation environment!

Next, the image should be placed in a location accessible to EC2 instances.

It is time to push the image to a Docker registry.

In this tutorial, we use the AWS service - ECS but you can use other options like `DockerHub`.

I assume you have [`awscli`](https://aws.amazon.com/cli/) installed and the credentials are known.

You can log in to the registry simply by

```bash

# Execute from the project's root

$(aws ecr get-login --no-include-email)

docker build -t image-name .

docker tag image-name:latest repo.url/image-name:latest

docker push repo.url/image-name:latest

```

It is time to setup the nodes of the cluster.

## Defining the Dask cluster

We take a declarative approach and use [`terraform`](www.terraform.io) to setup the nodes of the cluster.

Note that in this example we utilize the AWS Spots; you can easily change the code and use the regular on-demand instances.

This is left as an exercise.

We use two groups of file to define the cluster:

- `.tf` instructions: parsed by `terraform` and defining what instances to use, what tags, regions, etc.

- Provisioning shell scripts: installing needed tools on the nodes

### `.tf` files

When using `terraform` all `.tf` files are read and concatenated.

There are more details of course; a good entry point would be [this](https://www.terraform.io/docs/configuration/index.html).

In our example we organize the `.tf` files as follows:

- `terraform.tf`: general settings

- `vars.tf`: variables definitions which can be used from the CLI

- `provision.tf`: instructions how to call the provisioning scripts

- `resources.tf`: definition of the resources

- `output.tf`: definition of outputs provided by `terraform`

#### `terraform.tf`

```

provider "aws" {

  region = "eu-west-1"

}

```

#### `vars.tf`

```

variable "instanceType" {

  type    = "string"

  default = "c5.2xlarge"

}

variable "spotPrice" {

  # Not needed for on-demand instances

  default = "0.1"

}

variable "contact" {

  type = "string"

  default = "d.atariah"

}

variable "department" {

  type = "string"

  default = "My wonderful department"

}

variable "subnet" {

  default = "subnet-007"

}

variable "securityGroup" {

  type = "string"

  default = "sg-42"

}

variable "workersNum" {

  default = "4"

}

variable "schedulerPrivateIp" {

  # We predefine a private IP for the scheduler; it will be used by the workers

  default = "172.31.36.190"

}

variable "dockerRegistry" {

  default = ""

}

# By defining the AWS keys as variables we can get them from the command line

# and pass them to the provisioning scripts

variable "awsKey" {}

variable "awsPrivateKey" {}

```

#### `provision.tf`

```

data "template_file" "scheduler_setup" {

  template = "${file("scheulder_setup.sh")}" # see the shell script bellow

  vars {

    # Use the AWS keys passed from the terraform CLI

    AWS_KEY = "${var.awsKey}"

    AWS_PRIVATE_KEY = "${var.awsPrivateKey}"

    DOCKER_REG = "${var.dockerRegistry}"

  }

}

data "template_file" "worker_setup" {

  template = "${file("worker_setup.sh")}" # see the shell script bellow

  vars {

    AWS_KEY = "${var.awsKey}"

    AWS_PRIVATE_KEY = "${var.awsPrivateKey}"

    DOCKER_REG = "${var.dockerRegistry}"

    SCHEDULER_IP = "${var.schedulerPrivateIp}"

  }

}

```

#### `resources.tf`

This is the core of the settings, here we put everything together and define the requests for the AWS spots.

```

resource "aws_spot_instance_request" "dask-scheduler" {

  ami                         = "ami-4cbe0935" # [1]

  instance_type               = "${var.instanceType}"

  spot_price                  = "${var.spotPrice}"

  wait_for_fulfillment        = true

  key_name                    = "dask_poc"

  security_groups             = ["${var.securityGroup}"]

  subnet_id                   = "${var.subnet}"

  associate_public_ip_address = true

  private_ip                  = "${var.schedulerPrivateIp}" # [2]

  user_data                   = "${data.template_file.scheduler_setup.rendered}"

  tags {

    Name = "${terraform.workspace}-dask-scheduler",

    Department = "${var.department}",

    contact = "${var.contact}"

  }

}

resource "aws_spot_instance_request" "dask-worker" {

  count                       = "${var.workersNum}" # [3]

  ami                         = "ami-4cbe0935" # [1]

  instance_type               = "${var.instanceType}"

  spot_price                  = "${var.spotPrice}"

  wait_for_fulfillment        = true

  key_name                    = "dask_poc"

  subnet_id                   = "${var.subnet}"

  security_groups             = ["${var.securityGroup}"]

  associate_public_ip_address = true

  user_data                   = "${data.template_file.worker_setup.rendered}"

  tags {

    Name = "${terraform.workspace}-dask-worker${count.index}",

    Department = "${var.department}",

    contact = "${var.contact}"

  }

}

```

Here are some important elements to note:

1. The [AMI](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html) I use is the one for `eu-west-1` which is optimized for Docker and provided by Amazon. It is possible to use other images, but it is important that they will support `docker`.

2. Define the private IP of the scheduler. You will need to use it when starting the workers and it is easier to *know* the IP than to *find* it

3. Indicate how many workers should be used

#### `output.tf`

`terraform` allows the definition of various outputs.

As always, more details can be found [here](https://www.terraform.io/intro/getting-started/outputs.html).

```

output "scheduler-info" {

  value = "${aws_spot_instance_request.dask-scheduler.public_ip}"

}

output "workers-info" {

  value = "${join(",",aws_spot_instance_request.dask-worker.*.public_ip)}"

}

output "scheduler-status" {

  value = "http://${aws_spot_instance_request.dask-scheduler.public_ip}:8787/status"

}

```

### Provisioning scripts

The `user_data` fields in `resources.tf` indicate what script should be used for the provisioning on the nodes.

We provide two templates of scripts which will be filled with the needed variables from `terraform`; one script for the scheduler and one for the workers.

```bash

#!/bin/bash

# scheduler_setup.sh

exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

set -x

echo "Installing pip"

curl -O https://bootstrap.pypa.io/get-pip.py

python get-pip.py --user

~/.local/bin/pip install awscli --upgrade --user

echo "Logging in to ECS registry"

export AWS_ACCESS_KEY_ID=${AWS_KEY}

export AWS_SECRET_ACCESS_KEY=${AWS_PRIVATE_KEY}

export AWS_DEFAULT_REGION=eu-west-1

$(~/.local/bin/aws ecr get-login --no-include-email)

# Assigning tags to instance derived from spot request

# See https://github.com/hashicorp/terraform/issues/3263#issuecomment-284387578

REGION=eu-west-1

INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

SPOT_REQ_ID=$(~/.local/bin/aws --region $REGION ec2 describe-instances --instance-ids "$INSTANCE_ID"  --query 'Reservations[0].Instances[0].SpotInstanceRequestId' --output text)

if [ "$SPOT_REQ_ID" != "None" ] ; then

  TAGS=$(~/.local/bin/aws --region $REGION ec2 describe-spot-instance-requests --spot-instance-request-ids "$SPOT_REQ_ID" --query 'SpotInstanceRequests[0].Tags')

  ~/.local/bin/aws --region $REGION ec2 create-tags --resources "$INSTANCE_ID" --tags "$TAGS"

fi

echo "Starting docker container from image"

docker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-scheduler

```

The scripts for the workers and for the scheduler are identical, except the last line.

For the workers we should have

```

docker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-worker ${SCHEDULER_IP}:8786

```

Note that we start `dask-worker` instead of `dask-scheduler` and we indeicate the private IP of the scheduler.

**Important** to note the `--network host`.

Intuitively, this makes sure that the containers' networks and their corresponding hosts will be the same and therefore the different containers on different hosts will be able to communicate.

## Running the cluster

We can now run the cluster.

To that end, we need to execute two commands.

First, `terraform init`.

This one prepares the tool and make it ready to start the nodes.

Next we have to `apply` the instructions.

This we do by invoking:

```bash

TF_VAR_awsKey=YOUR_AWS_KEY \

TF_VAR_awsPrivateKey=YOUR_AWS_PRIVATE_KEY \

terraform apply -var 'workersNum=2' -var 'instanceType="t2.small"' \

-var 'spotPrice=0.2' -var 'schedulerPrivateIp="172.31.36.170"' \

-var 'dockerRegistry="repo.url/image-name:latest"'

```

Note that we use two environment variables for the AWS keys.

Other variables defined in `var.tf` are passed as parameters.

Once finished, you can access the newly created scheduler node by: `ssh -i ~/.aws/key.pem ec2-user@$(terraform output scheduler-info)`.

In the cluster you can check the log at `/var/log/user-data.log`.

You can also check the status of the running Docker containers using `docker ps`.

Lastly, if everything went well, you should be able to access the web interface of the cluster.

Its address can be found by invoking `terraform output scheduler-status`.

## Grid search on the cluster

The moment we have been waiting for: run our hyperparameters grid search on the `dask` cluster.

To do so, we can use a code similar to [`./gridsearch_local_dask.py`](./gridsearch_local_dask.py).

Only changing the client's address is needed:

```python

#!/usr/bin/env python

from sklearn.datasets import load_digits

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split as tts

from sklearn.metrics import classification_report

from distributed import Client, LocalCluster

from dask_searchcv import GridSearchCV

# from src import myfoo # An example included from `src`

def main():

    param_space = {'C': [1e-4, 1, 1e4],

                   'gamma': [1e-3, 1, 1e3],

                   'class_weight': [None, 'balanced']}

    model = SVC(kernel='rbf')

    digits = load_digits()

    X_train, X_test, y_train, y_test = tts(digits.data, digits.target,

                                           test_size=0.3)

    print("Starting local cluster")

    client = Client(x.y.z.w:8786)

    print(client)

    print("Start searching")

    search = GridSearchCV(model, param_space, cv=3)

    search.fit(X_train, y_train)

    print("Prepare report")

    print(classification_report(

        y_true=y_test, y_pred=search.best_estimator_.predict(X_test))

    )

if __name__ == '__main__':

    main()

```

Running this script would start the grid search on the `dask` cluster.

This can be monitored on the web dashboard.

If you have a running cluster at `x.y.z.w`, you can try it out:

```bash

docker run -it --rm -p 8786:8786 dask-example ./gridsearch_cluster_dask.py x.y.z.w

```

## Yet to be discussed

* You might want to explore `terraform workspace`; this can help you run several clusters from the same directory. For example when running different experiements at the same time.

* Enable a node with Jupyter server so the local notebook won't be needed
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/drorata/ds-dask_cluster_example

Awesome Lists containing this project

README