{"id":16567918,"url":"https://github.com/drorata/ds-dask_cluster_example","last_synced_at":"2026-03-08T11:37:03.042Z","repository":{"id":150667560,"uuid":"116236575","full_name":"drorata/ds-dask_cluster_example","owner":"drorata","description":"Tutorial on setting EC2 based dask cluster","archived":false,"fork":false,"pushed_at":"2018-03-07T10:00:23.000Z","size":7811,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-05T10:46:37.323Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HCL","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/drorata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-04T08:45:54.000Z","updated_at":"2018-06-07T21:02:05.000Z","dependencies_parsed_at":"2023-04-29T10:30:43.555Z","dependency_job_id":null,"html_url":"https://github.com/drorata/ds-dask_cluster_example","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/drorata/ds-dask_cluster_example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fds-dask_cluster_example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fds-dask_cluster_example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fds-dask_cluster_example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fds-dask_cluster_example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/drorata","download_url":"https://codeload.github.com/drorata/ds-dask_cluster_example/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fds-dask_cluster_example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30254598,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-08T08:59:44.879Z","status":"ssl_error","status_checked_at":"2026-03-08T08:58:02.867Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T21:07:47.434Z","updated_at":"2026-03-08T11:37:03.026Z","avatar_url":"https://github.com/drorata.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Moving from local machine to Dask cluster using Terraform\n\nAuthor: Dror Atariah\n\n## Introduction\n\nAs part of the never-ending effort to improve [reBuy](https://www.rebuy.de/) and turn it into a market leader, we recently decided to tackle the challenges of our customer services agents.\nAs a first step, a dump of tagged emails was created and the first goal was set: build a POC that tags the emails automatically.\nTo that end, NLP had to be used and a lengthy (and greedy) [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) had to be executed.\nSo lengthy, that 4 cores of a notebook were working for couple of hours with no results.\nThis was the point when I decided to explore [`dask`](http://dask.pydata.org/en/latest/) and its sibling [`distributed`](https://distributed.readthedocs.io/en/latest/).\nIn this tutorial/post we shall discuss how to take a local code doing grid search using Scikit-Learn to a cluster of AWS (EC2) nodes.\n\n## Start locally\n\nWe start with a minimal example of data loading and grid search the hyperparameters.\nThe project's structure might be:\n\n```\n.\n├── data\n├── models\n└── src\n```\n\nIn `./src` we may include some special tools, functions and classes that we would like to use in the project or in a more complicated pipeline.\nWe will show later how to include these tools in the distributed environment.\nNote that it is having the structure of a python project and should include a `setup.py` at the project's root.\nWe start with a simple example:\n\n```python\nfrom sklearn.datasets import load_digits\nfrom sklearn.svm import SVC\nfrom sklearn.model_selection import GridSearchCV\n# from src import myfoo # An example included from `src`\n\nparam_space = {'C': [1e-4, 1, 1e4],\n               'gamma': [1e-3, 1, 1e3],\n               'class_weight': [None, 'balanced']}\n\nmodel = SVC(kernel='rbf')\n\ndigits = load_digits()\n\nsearch = GridSearchCV(model, param_space, cv=3)\nsearch.fit(digits.data, digits.target)\n```\n\nA little more elaborated version of this example can be found in the docker image defined [here](./Dockerfile).\nYou can try it out by cloning this repository and running the following:\n\n```bash\ndocker build . -t dask-example\ndocker run --rm dask-example ./gridsearch_local.py\n```\n\nSo far, so good.\nBut, imagine the data set is larger and the hyperparameters' space is more complicated.\nThings will turn virtually impossible to run on a local machine.\nAt this point there are at least two possible courses of action:\n\n1. Use more computing power\n2. Optimize the search and/or be smarter\n\nIn this post we take the former.\nA seemingly easy way to scale out the local machine to a cluster is [`dask`](http://dask.pydata.org/en/latest/).\nTo start with, staying on the local machine, let's try out the [`LocalCluster`](https://distributed.readthedocs.io/en/latest/local-cluster.html).\nCheckout [`gridsearch_local_dask.py`](./gridsearch_local_dask.py) which you can try out by\n\n```bash\ndocker run -it --rm dask-example ./gridsearch_local_dask.py\n```\n\nThis already feels a little faster, isn't it?\nBut, we *need* to scale out and to that end we want to have a cluster of EC2 nodes that can be used.\nThere are two main steps:\n\n1. Bundle the computation environment in a Docker image\n2. Run a `dask` cluster where each node has the computation environment\n\n\n\n## Bundle the computation environment\n\nFor the `dask` cluster to function, each node has to have the same computation environment.\nDocker is a straightforward way to make this happen.\nThe way to go is to define a `Dockerfile`:\n\n```docker\nFROM continuumio/miniconda3\n\nRUN mkdir project\n\nCOPY requirements.txt /project/requirements.txt\nCOPY src/ /project/src\nCOPY setup.py /project/setup.py\nWORKDIR /project\nRUN pip install -r requirements.txt\n```\n\nThe local `requirements.txt` and `setup.py` are loaded to the image.\nIt is recommendad to include `bokeh` in `requirements.txt`; otherwise the web dashboard of `dask` won't work.\nThe `Dockerfile` can include further steps like `RUN apt-get update \u0026\u0026 apt-get install -y build-essential freetds-dev` or `RUN python -m nltk.downloader punkt`.\nIf `./src` includes needed classes, functions etc., then make sure you include something like `-e .` or merely `.` in `requirements.txt`; this way these dependencies will be available in the image.\nIt is important to include in the `Dockerfile` all the components needed for the computation environment!\n\nNext, the image should be placed in a location accessible to EC2 instances.\nIt is time to push the image to a Docker registry.\nIn this tutorial, we use the AWS service - ECS but you can use other options like `DockerHub`.\nI assume you have [`awscli`](https://aws.amazon.com/cli/) installed and the credentials are known.\nYou can log in to the registry simply by\n\n```bash\n# Execute from the project's root\n$(aws ecr get-login --no-include-email)\ndocker build -t image-name .\ndocker tag image-name:latest repo.url/image-name:latest\ndocker push repo.url/image-name:latest\n```\n\nIt is time to setup the nodes of the cluster.\n\n## Defining the Dask cluster\n\nWe take a declarative approach and use [`terraform`](www.terraform.io) to setup the nodes of the cluster.\nNote that in this example we utilize the AWS Spots; you can easily change the code and use the regular on-demand instances.\nThis is left as an exercise.\nWe use two groups of file to define the cluster:\n\n- `.tf` instructions: parsed by `terraform` and defining what instances to use, what tags, regions, etc.\n- Provisioning shell scripts: installing needed tools on the nodes\n\n\n### `.tf` files\n\nWhen using `terraform` all `.tf` files are read and concatenated.\nThere are more details of course; a good entry point would be [this](https://www.terraform.io/docs/configuration/index.html).\nIn our example we organize the `.tf` files as follows:\n\n- `terraform.tf`: general settings\n- `vars.tf`: variables definitions which can be used from the CLI\n- `provision.tf`: instructions how to call the provisioning scripts\n- `resources.tf`: definition of the resources\n- `output.tf`: definition of outputs provided by `terraform`\n\n#### `terraform.tf`\n\n```\nprovider \"aws\" {\n  region = \"eu-west-1\"\n}\n```\n\n#### `vars.tf`\n\n```\nvariable \"instanceType\" {\n  type    = \"string\"\n  default = \"c5.2xlarge\"\n}\n\nvariable \"spotPrice\" {\n  # Not needed for on-demand instances\n  default = \"0.1\"\n}\n\nvariable \"contact\" {\n  type = \"string\"\n  default = \"d.atariah\"\n}\n\nvariable \"department\" {\n  type = \"string\"\n  default = \"My wonderful department\"\n}\n\nvariable \"subnet\" {\n  default = \"subnet-007\"\n}\n\nvariable \"securityGroup\" {\n  type = \"string\"\n  default = \"sg-42\"\n}\n\nvariable \"workersNum\" {\n  default = \"4\"\n}\n\nvariable \"schedulerPrivateIp\" {\n  # We predefine a private IP for the scheduler; it will be used by the workers\n  default = \"172.31.36.190\"\n}\n\nvariable \"dockerRegistry\" {\n  default = \"\"\n}\n\n# By defining the AWS keys as variables we can get them from the command line\n# and pass them to the provisioning scripts\nvariable \"awsKey\" {}\nvariable \"awsPrivateKey\" {}\n```\n\n#### `provision.tf`\n\n```\ndata \"template_file\" \"scheduler_setup\" {\n  template = \"${file(\"scheulder_setup.sh\")}\" # see the shell script bellow\n  vars {\n    # Use the AWS keys passed from the terraform CLI\n    AWS_KEY = \"${var.awsKey}\"\n    AWS_PRIVATE_KEY = \"${var.awsPrivateKey}\"\n    DOCKER_REG = \"${var.dockerRegistry}\"\n  }\n}\n\ndata \"template_file\" \"worker_setup\" {\n  template = \"${file(\"worker_setup.sh\")}\" # see the shell script bellow\n  vars {\n    AWS_KEY = \"${var.awsKey}\"\n    AWS_PRIVATE_KEY = \"${var.awsPrivateKey}\"\n    DOCKER_REG = \"${var.dockerRegistry}\"\n    SCHEDULER_IP = \"${var.schedulerPrivateIp}\"\n  }\n}\n```\n\n#### `resources.tf`\n\nThis is the core of the settings, here we put everything together and define the requests for the AWS spots.\n\n```\nresource \"aws_spot_instance_request\" \"dask-scheduler\" {\n  ami                         = \"ami-4cbe0935\" # [1]\n  instance_type               = \"${var.instanceType}\"\n  spot_price                  = \"${var.spotPrice}\"\n  wait_for_fulfillment        = true\n  key_name                    = \"dask_poc\"\n  security_groups             = [\"${var.securityGroup}\"]\n  subnet_id                   = \"${var.subnet}\"\n  associate_public_ip_address = true\n  private_ip                  = \"${var.schedulerPrivateIp}\" # [2]\n  user_data                   = \"${data.template_file.scheduler_setup.rendered}\"\n  tags {\n    Name = \"${terraform.workspace}-dask-scheduler\",\n    Department = \"${var.department}\",\n    contact = \"${var.contact}\"\n  }\n}\n\nresource \"aws_spot_instance_request\" \"dask-worker\" {\n  count                       = \"${var.workersNum}\" # [3]\n  ami                         = \"ami-4cbe0935\" # [1]\n  instance_type               = \"${var.instanceType}\"\n  spot_price                  = \"${var.spotPrice}\"\n  wait_for_fulfillment        = true\n  key_name                    = \"dask_poc\"\n  subnet_id                   = \"${var.subnet}\"\n  security_groups             = [\"${var.securityGroup}\"]\n  associate_public_ip_address = true\n  user_data                   = \"${data.template_file.worker_setup.rendered}\"\n  tags {\n    Name = \"${terraform.workspace}-dask-worker${count.index}\",\n    Department = \"${var.department}\",\n    contact = \"${var.contact}\"\n  }\n}\n```\n\nHere are some important elements to note:\n\n1. The [AMI](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html) I use is the one for `eu-west-1` which is optimized for Docker and provided by Amazon. It is possible to use other images, but it is important that they will support `docker`.\n2. Define the private IP of the scheduler. You will need to use it when starting the workers and it is easier to *know* the IP than to *find* it\n3. Indicate how many workers should be used\n\n#### `output.tf`\n\n`terraform` allows the definition of various outputs.\nAs always, more details can be found [here](https://www.terraform.io/intro/getting-started/outputs.html).\n\n```\noutput \"scheduler-info\" {\n  value = \"${aws_spot_instance_request.dask-scheduler.public_ip}\"\n}\n\noutput \"workers-info\" {\n  value = \"${join(\",\",aws_spot_instance_request.dask-worker.*.public_ip)}\"\n}\n\noutput \"scheduler-status\" {\n  value = \"http://${aws_spot_instance_request.dask-scheduler.public_ip}:8787/status\"\n}\n```\n\n### Provisioning scripts\n\nThe `user_data` fields in `resources.tf` indicate what script should be used for the provisioning on the nodes.\nWe provide two templates of scripts which will be filled with the needed variables from `terraform`; one script for the scheduler and one for the workers.\n\n```bash\n#!/bin/bash\n\n# scheduler_setup.sh\n\nexec \u003e \u003e(tee /var/log/user-data.log|logger -t user-data -s 2\u003e/dev/console) 2\u003e\u00261\nset -x\n\necho \"Installing pip\"\ncurl -O https://bootstrap.pypa.io/get-pip.py\npython get-pip.py --user\n~/.local/bin/pip install awscli --upgrade --user\necho \"Logging in to ECS registry\"\nexport AWS_ACCESS_KEY_ID=${AWS_KEY}\nexport AWS_SECRET_ACCESS_KEY=${AWS_PRIVATE_KEY}\nexport AWS_DEFAULT_REGION=eu-west-1\n$(~/.local/bin/aws ecr get-login --no-include-email)\n\n# Assigning tags to instance derived from spot request\n# See https://github.com/hashicorp/terraform/issues/3263#issuecomment-284387578\nREGION=eu-west-1\nINSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)\nSPOT_REQ_ID=$(~/.local/bin/aws --region $REGION ec2 describe-instances --instance-ids \"$INSTANCE_ID\"  --query 'Reservations[0].Instances[0].SpotInstanceRequestId' --output text)\nif [ \"$SPOT_REQ_ID\" != \"None\" ] ; then\n  TAGS=$(~/.local/bin/aws --region $REGION ec2 describe-spot-instance-requests --spot-instance-request-ids \"$SPOT_REQ_ID\" --query 'SpotInstanceRequests[0].Tags')\n  ~/.local/bin/aws --region $REGION ec2 create-tags --resources \"$INSTANCE_ID\" --tags \"$TAGS\"\nfi\n\necho \"Starting docker container from image\"\ndocker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-scheduler\n```\n\nThe scripts for the workers and for the scheduler are identical, except the last line.\nFor the workers we should have\n\n```\ndocker run -d -it --network host ${DOCKER_REG} /opt/conda/bin/dask-worker ${SCHEDULER_IP}:8786\n```\n\nNote that we start `dask-worker` instead of `dask-scheduler` and we indeicate the private IP of the scheduler.\n**Important** to note the `--network host`.\nIntuitively, this makes sure that the containers' networks and their corresponding hosts will be the same and therefore the different containers on different hosts will be able to communicate.\n\n\n\n## Running the cluster\n\nWe can now run the cluster.\nTo that end, we need to execute two commands.\nFirst, `terraform init`.\nThis one prepares the tool and make it ready to start the nodes.\nNext we have to `apply` the instructions.\nThis we do by invoking:\n\n```bash\nTF_VAR_awsKey=YOUR_AWS_KEY \\\nTF_VAR_awsPrivateKey=YOUR_AWS_PRIVATE_KEY \\\nterraform apply -var 'workersNum=2' -var 'instanceType=\"t2.small\"' \\\n-var 'spotPrice=0.2' -var 'schedulerPrivateIp=\"172.31.36.170\"' \\\n-var 'dockerRegistry=\"repo.url/image-name:latest\"'\n```\n\nNote that we use two environment variables for the AWS keys.\nOther variables defined in `var.tf` are passed as parameters.\nOnce finished, you can access the newly created scheduler node by: `ssh -i ~/.aws/key.pem ec2-user@$(terraform output scheduler-info)`.\nIn the cluster you can check the log at `/var/log/user-data.log`.\nYou can also check the status of the running Docker containers using `docker ps`.\nLastly, if everything went well, you should be able to access the web interface of the cluster.\nIts address can be found by invoking `terraform output scheduler-status`.\n\n\n## Grid search on the cluster\n\nThe moment we have been waiting for: run our hyperparameters grid search on the `dask` cluster.\nTo do so, we can use a code similar to [`./gridsearch_local_dask.py`](./gridsearch_local_dask.py).\nOnly changing the client's address is needed:\n\n```python\n#!/usr/bin/env python\n\nfrom sklearn.datasets import load_digits\nfrom sklearn.svm import SVC\nfrom sklearn.model_selection import train_test_split as tts\nfrom sklearn.metrics import classification_report\nfrom distributed import Client, LocalCluster\nfrom dask_searchcv import GridSearchCV\n# from src import myfoo # An example included from `src`\n\n\ndef main():\n    param_space = {'C': [1e-4, 1, 1e4],\n                   'gamma': [1e-3, 1, 1e3],\n                   'class_weight': [None, 'balanced']}\n\n    model = SVC(kernel='rbf')\n\n    digits = load_digits()\n\n    X_train, X_test, y_train, y_test = tts(digits.data, digits.target,\n                                           test_size=0.3)\n\n    print(\"Starting local cluster\")\n    client = Client(x.y.z.w:8786)\n    print(client)\n\n    print(\"Start searching\")\n    search = GridSearchCV(model, param_space, cv=3)\n    search.fit(X_train, y_train)\n\n    print(\"Prepare report\")\n    print(classification_report(\n        y_true=y_test, y_pred=search.best_estimator_.predict(X_test))\n    )\n\n\nif __name__ == '__main__':\n    main()\n```\n\nRunning this script would start the grid search on the `dask` cluster.\nThis can be monitored on the web dashboard.\nIf you have a running cluster at `x.y.z.w`, you can try it out:\n\n```bash\ndocker run -it --rm -p 8786:8786 dask-example ./gridsearch_cluster_dask.py x.y.z.w\n```\n\n## Yet to be discussed\n\n* You might want to explore `terraform workspace`; this can help you run several clusters from the same directory. For example when running different experiements at the same time.\n* Enable a node with Jupyter server so the local notebook won't be needed\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorata%2Fds-dask_cluster_example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrorata%2Fds-dask_cluster_example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorata%2Fds-dask_cluster_example/lists"}