{"id":16619407,"url":"https://github.com/tomwhite/dask-executor-scheduler","last_synced_at":"2025-07-24T23:34:21.697Z","repository":{"id":66918106,"uuid":"263943661","full_name":"tomwhite/dask-executor-scheduler","owner":"tomwhite","description":"A Dask scheduler that uses a Python concurrent.futures.Executor to run tasks","archived":false,"fork":false,"pushed_at":"2020-08-07T11:44:37.000Z","size":17,"stargazers_count":7,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-09T11:22:06.918Z","etag":null,"topics":["dask","pywren","serverless"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomwhite.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-14T14:51:15.000Z","updated_at":"2024-08-28T10:19:21.000Z","dependencies_parsed_at":"2023-05-14T00:45:14.519Z","dependency_job_id":null,"html_url":"https://github.com/tomwhite/dask-executor-scheduler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tomwhite/dask-executor-scheduler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomwhite%2Fdask-executor-scheduler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomwhite%2Fdask-executor-scheduler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomwhite%2Fdask-executor-scheduler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomwhite%2Fdask-executor-scheduler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomwhite","download_url":"https://codeload.github.com/tomwhite/dask-executor-scheduler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomwhite%2Fdask-executor-scheduler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266922848,"owners_count":24006984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask","pywren","serverless"],"created_at":"2024-10-12T02:25:20.001Z","updated_at":"2025-07-24T23:34:21.686Z","avatar_url":"https://github.com/tomwhite.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dask Executor Scheduler\n\nA Dask scheduler that uses a Python _concurrent.futures.Executor_ to run tasks.\n\nThe motivation for building this was as a way to get Dask use serverless cloud functions for executing tasks.\nUsing serverless cloud functions allows scaling to thousands of concurrent workers, with no cluster to set up and manage.\nThis code has been used with [Pywren](https://github.com/pywren), see instructions below.\n\nThe implementation is fairly naive - tasks are placed on an in-memory queue and processed by the [executor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) in batches. Tasks are accumulated in a batch until they reach a certain size, or a timeout occurs - whichever happens first.\n\nThe tasks are generated by the Dask local scheduler, so there is no guarantee that they will be produced in an order that works well for this style of execution. However, batch-style parallel processing is generally a good fit for this scheduler.\n\nBookkeeping tasks (i.e. those that don't do any real work) are executed locally.\n\nFor testing, it's useful to use a [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor). This is the default if no executor is specified.\n\nHave a look in the examples directory to see how to use the scheduler.\n\n### Upstream discussion/implementation\n\nSee also [dask#6220](https://github.com/dask/dask/issues/6220) for discussion about including this in Dask; and [dask#6322](https://github.com/dask/dask/pull/6322) for an implementation.\n\n### Installation\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\npip install -e .\n```\n\nOr using Conda (easier to install Zarr):\n\n```bash\nconda env create -f environment.yml \nconda activate dask_executor_scheduler\npip install -e .\n```\n\n### Running locally\n\nLocal thread pool:\n\n```bash\npython examples/threadpool_executor.py\n```\n\nPywren using a local executor:\n\n```bash\npython examples/pywren_local_executor.py\n```\n\n### Configuring Pywren for Google Cloud\n\nI've created a branch of `pywren-ibm-cloud` with support for Google Cloud Storage and Google Cloud Run here: https://github.com/tomwhite/pywren-ibm-cloud.\n\nEdit your _~/.pywren_config_ file as follows, where `\u003cBUCKET\u003e` is the name of a newly-created bucket:\n\n```\npywren:\n    storage_bucket: \u003cBUCKET\u003e\n    storage_backend: gcsfs\n    compute_backend: cloudrun\n\ngcsfs:\n    project_id: \u003cPROJECT_ID\u003e\n\ncloudrun:\n    project_id: \u003cPROJECT_ID\u003e\n    region: \u003cREGION\u003e\n```\n\nRun using the Cloud Run executor:\n\n```bash\npython examples/pywren_cloudrun_executor.py\n```\n\n### Pywren runtimes\n\nThe default runtime will be automatically built for you when you first run Pywren. To run examples using Zarr you will need to build a custom conda runtime\n(since zarr installation via pip requires compilation of numcodecs). Note that this requires that https://github.com/pywren/pywren-ibm-cloud is checked out in the parent directory).\n\n```bash\nPROJECT_ID=...\nPYWREN_LOGLEVEL=DEBUG pywren-ibm-cloud runtime build -f ../pywren-ibm-cloud/runtime/cloudrun/Dockerfile.conda37 \"$PROJECT_ID/pywren-cloudrun-conda-v37:latest\"\n```\n\nYou can run this repeatedly to rebuild the runtime. You can create (or update) the Cloud Run function that uses the runtime with\n\n```bash\npywren-ibm-cloud runtime create \"$PROJECT_ID/pywren-cloudrun-conda-v37:latest\"\n```\n\nThe full docs on runtimes are here: https://github.com/pywren/pywren-ibm-cloud/tree/master/runtime\n\n### Example: Rechunking Zarr files\n\nRechunking Zarr files is a common, but surprisingly difficult problem to get right. [This thread](https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588) has an excellent discussion of the problem, and lots of suggested approaches and solutions.\n\nThe [rechunker](https://github.com/pangeo-data/rechunker) library is a general purpose solution, and one that is well suited to Pywren, since the Dask graph is small and the IO can be offloaded to the cloud without starting a dedicated Dask cluster.\n\nThe examples directory has a few examples of running rechunker on Zarr files using Pywren.\n\nTo run it you will need to create a conda runtime as explained in the previous section; and you will need to create a GCS bucket for the Zarr files.\n\nRun using local files and local compute (local Dask and Pywren):\n\n```bash\npython examples/rechunk_local_storage_local_compute.py delete\npython examples/rechunk_local_storage_local_compute.py create\npython examples/rechunk_local_storage_local_compute.py rechunk\n```\n\nRun using Cloud storage and compute:\n\n```bash\nPROJECT_ID=...\nBUCKET=...\npython examples/rechunk_cloud_storage_cloud_compute.py delete $PROJECT_ID $BUCKET\npython examples/rechunk_cloud_storage_cloud_compute.py create $PROJECT_ID $BUCKET\npython examples/rechunk_cloud_storage_cloud_compute.py rechunk $PROJECT_ID $BUCKET\n```\n\nYou can inspect the files in the bucket using regular CLI tools or cloud console.\n\nDelete the files from the bucket after you have finished:\n\n```bash\npython examples/rechunk_cloud_storage_cloud_compute.py delete $PROJECT_ID $BUCKET\n```\n\n### Related projects\n\nThe idea for this came from the work I did in [Zappy](https://github.com/lasersonlab/zappy) to run NumPy processing on Pywren.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomwhite%2Fdask-executor-scheduler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomwhite%2Fdask-executor-scheduler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomwhite%2Fdask-executor-scheduler/lists"}