An open API service indexing awesome lists of open source software.

https://github.com/tomwhite/dask-executor-scheduler

A Dask scheduler that uses a Python concurrent.futures.Executor to run tasks
https://github.com/tomwhite/dask-executor-scheduler

dask pywren serverless

Last synced: 8 months ago
JSON representation

A Dask scheduler that uses a Python concurrent.futures.Executor to run tasks

Awesome Lists containing this project

README

          

# Dask Executor Scheduler

A Dask scheduler that uses a Python _concurrent.futures.Executor_ to run tasks.

The motivation for building this was as a way to get Dask use serverless cloud functions for executing tasks.
Using serverless cloud functions allows scaling to thousands of concurrent workers, with no cluster to set up and manage.
This code has been used with [Pywren](https://github.com/pywren), see instructions below.

The implementation is fairly naive - tasks are placed on an in-memory queue and processed by the [executor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) in batches. Tasks are accumulated in a batch until they reach a certain size, or a timeout occurs - whichever happens first.

The tasks are generated by the Dask local scheduler, so there is no guarantee that they will be produced in an order that works well for this style of execution. However, batch-style parallel processing is generally a good fit for this scheduler.

Bookkeeping tasks (i.e. those that don't do any real work) are executed locally.

For testing, it's useful to use a [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor). This is the default if no executor is specified.

Have a look in the examples directory to see how to use the scheduler.

### Upstream discussion/implementation

See also [dask#6220](https://github.com/dask/dask/issues/6220) for discussion about including this in Dask; and [dask#6322](https://github.com/dask/dask/pull/6322) for an implementation.

### Installation

```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e .
```

Or using Conda (easier to install Zarr):

```bash
conda env create -f environment.yml
conda activate dask_executor_scheduler
pip install -e .
```

### Running locally

Local thread pool:

```bash
python examples/threadpool_executor.py
```

Pywren using a local executor:

```bash
python examples/pywren_local_executor.py
```

### Configuring Pywren for Google Cloud

I've created a branch of `pywren-ibm-cloud` with support for Google Cloud Storage and Google Cloud Run here: https://github.com/tomwhite/pywren-ibm-cloud.

Edit your _~/.pywren_config_ file as follows, where `` is the name of a newly-created bucket:

```
pywren:
storage_bucket:
storage_backend: gcsfs
compute_backend: cloudrun

gcsfs:
project_id:

cloudrun:
project_id:
region:
```

Run using the Cloud Run executor:

```bash
python examples/pywren_cloudrun_executor.py
```

### Pywren runtimes

The default runtime will be automatically built for you when you first run Pywren. To run examples using Zarr you will need to build a custom conda runtime
(since zarr installation via pip requires compilation of numcodecs). Note that this requires that https://github.com/pywren/pywren-ibm-cloud is checked out in the parent directory).

```bash
PROJECT_ID=...
PYWREN_LOGLEVEL=DEBUG pywren-ibm-cloud runtime build -f ../pywren-ibm-cloud/runtime/cloudrun/Dockerfile.conda37 "$PROJECT_ID/pywren-cloudrun-conda-v37:latest"
```

You can run this repeatedly to rebuild the runtime. You can create (or update) the Cloud Run function that uses the runtime with

```bash
pywren-ibm-cloud runtime create "$PROJECT_ID/pywren-cloudrun-conda-v37:latest"
```

The full docs on runtimes are here: https://github.com/pywren/pywren-ibm-cloud/tree/master/runtime

### Example: Rechunking Zarr files

Rechunking Zarr files is a common, but surprisingly difficult problem to get right. [This thread](https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588) has an excellent discussion of the problem, and lots of suggested approaches and solutions.

The [rechunker](https://github.com/pangeo-data/rechunker) library is a general purpose solution, and one that is well suited to Pywren, since the Dask graph is small and the IO can be offloaded to the cloud without starting a dedicated Dask cluster.

The examples directory has a few examples of running rechunker on Zarr files using Pywren.

To run it you will need to create a conda runtime as explained in the previous section; and you will need to create a GCS bucket for the Zarr files.

Run using local files and local compute (local Dask and Pywren):

```bash
python examples/rechunk_local_storage_local_compute.py delete
python examples/rechunk_local_storage_local_compute.py create
python examples/rechunk_local_storage_local_compute.py rechunk
```

Run using Cloud storage and compute:

```bash
PROJECT_ID=...
BUCKET=...
python examples/rechunk_cloud_storage_cloud_compute.py delete $PROJECT_ID $BUCKET
python examples/rechunk_cloud_storage_cloud_compute.py create $PROJECT_ID $BUCKET
python examples/rechunk_cloud_storage_cloud_compute.py rechunk $PROJECT_ID $BUCKET
```

You can inspect the files in the bucket using regular CLI tools or cloud console.

Delete the files from the bucket after you have finished:

```bash
python examples/rechunk_cloud_storage_cloud_compute.py delete $PROJECT_ID $BUCKET
```

### Related projects

The idea for this came from the work I did in [Zappy](https://github.com/lasersonlab/zappy) to run NumPy processing on Pywren.