https://github.com/hammerlab/dask-distributed-on-kubernetes
Deploy dask-distributed on google container engine using kubernetes
https://github.com/hammerlab/dask-distributed-on-kubernetes
Last synced: 12 months ago
JSON representation
Deploy dask-distributed on google container engine using kubernetes
- Host: GitHub
- URL: https://github.com/hammerlab/dask-distributed-on-kubernetes
- Owner: hammerlab
- License: apache-2.0
- Created: 2016-07-12T17:59:05.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2019-04-15T08:24:57.000Z (about 7 years ago)
- Last Synced: 2025-06-12T02:06:14.715Z (12 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.59 MB
- Stars: 40
- Watchers: 11
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Running on kubernetes on google container engine
This small repo gives an example Kubernetes configuration for running [dask.distributed](https://github.com/dask/distributed) on Google Container Engine.
## Start a cluster if needed
If you don't already have a cluster running, use a command like the following to start one (here it is called "daskd-cluster"):
```
gcloud container clusters create daskd-cluster \
--zone us-east1-b \
--num-nodes=2 \
--enable-autoscaling --min-nodes=1 --max-nodes=100 \
--machine-type=n1-highmem-16
```
You should see your cluster:
https://console.cloud.google.com/kubernetes/list
Then run this to set it as the default for your session:
```
gcloud config set container/cluster daskd-cluster
gcloud container clusters get-credentials daskd-cluster
```
## Deploy dask distributed
You will want to edit [spec.yaml](spec.yaml) to use the docker image appropriate for your task. You may also want to customize the CPU and memory thresholds requested based on what's required for your task.
This will launch a dask.distributed scheduler and one worker:
```
kubectl create -f spec.yaml
```
You can check how many workers are running with:
```
kubectl get pods
```
Now, scale up the deployment. Here we request 100 workers:
```
kubectl scale deployment daskd-worker --replicas=100
```
You can now run `kubectl get pods` again to check when the workers are started.
You can check on a worker's stdin/stdout with (replace the name with a pod name from `kubectl get pods`):
```
kubectl logs daskd-scheduler-3680716393-j19xr
```
## Run your analysis
First, get the IP of the scheduler (you want the external ip of daskd-scheduler):
```
$ kubectl get service
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
daskd-scheduler 10.3.249.60 104.196.185.187 8786/TCP 4m
kubernetes 10.3.240.1 443/TCP 17h
```
For scripting, here's a one-liner for getting the IP:
```
DASK_IP=$(kubectl get service | grep daskd-scheduler | tr -s ' ' | cut -d ' ' -f 3)
```
When you instantiate your dask Executor, just pass in the IP and port:
```python
from math import sqrt
from dask.distributed import Executor
from dask import delayed
client = Executor("104.196.185.187:8786")
tasks = [dask.delayed(sqrt)(i) for i in range(100)]
results = client.compute(tasks, sync=True)
print(results)
```
## Tearing it down
When you're done, shut down the service and cluster:
```
kubectl delete -f spec.yaml
gcloud container clusters delete daskd-cluster
```
## Running a benchmark
We also include a simple [benchmark](benchmarking/benchmark.py) script that will test performance of the cluster with varying numbers of workers (it issues `kubectl` calls itself to change the number of workers). See the script for details. Here's an example invocation:
```
DASK_IP=$(kubectl get service | grep daskd-scheduler | tr -s ' ' | cut -d ' ' -f 3)
python benchmark.py \
--tasks 5000 \
--task-time .05 \
--dask-scheduler $DASK_IP:8786 \
--jobs-range 200 800 200 \
--replicas 1 \
--out results2.csv
```