{"id":19091924,"url":"https://github.com/allencellmodeling/aics_dask_utils","last_synced_at":"2025-08-06T16:13:33.813Z","repository":{"id":57408655,"uuid":"258364603","full_name":"AllenCellModeling/aics_dask_utils","owner":"AllenCellModeling","description":"Utility functions commonly used by AICS projects for interacting with Dask","archived":false,"fork":false,"pushed_at":"2023-08-10T17:20:20.000Z","size":8179,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-03T05:15:43.223Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://allencellmodeling.github.io/aics_dask_utils","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AllenCellModeling.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-24T00:31:07.000Z","updated_at":"2023-08-01T19:27:15.000Z","dependencies_parsed_at":"2025-01-02T23:13:35.508Z","dependency_job_id":"0c7877df-ec61-43f0-b0ee-3bf1b7952c6b","html_url":"https://github.com/AllenCellModeling/aics_dask_utils","commit_stats":{"total_commits":6,"total_committers":1,"mean_commits":6.0,"dds":0.0,"last_synced_commit":"5bef90bd6d6f00bf5a5927fe1fce1cf4ca8d793c"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2Faics_dask_utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2Faics_dask_utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2Faics_dask_utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2Faics_dask_utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AllenCellModeling","download_url":"https://codeload.github.com/AllenCellModeling/aics_dask_utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240139489,"owners_count":19754123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T03:17:34.240Z","updated_at":"2025-02-22T07:27:38.774Z","avatar_url":"https://github.com/AllenCellModeling.png","language":"Python","readme":"# AICS Dask Utils\n\n[![Build Status](https://github.com/AllenCellModeling/aics_dask_utils/workflows/Build%20Master/badge.svg)](https://github.com/AllenCellModeling/aics_dask_utils/actions)\n[![Documentation](https://github.com/AllenCellModeling/aics_dask_utils/workflows/Documentation/badge.svg)](https://AllenCellModeling.github.io/aics_dask_utils)\n[![Code Coverage](https://codecov.io/gh/AllenCellModeling/aics_dask_utils/branch/master/graph/badge.svg)](https://codecov.io/gh/AllenCellModeling/aics_dask_utils)\n\nDocumentation related to Dask, Distributed, and related packages.\nUtility functions commonly used by AICS projects.\n\n---\n\n## Features\n* Distributed handler to manage various debugging or cluster configurations\n* Documentation on example cluster deployments\n\n## Basics\nBefore we jump into quick starts there are some basic definitions to understand.\n\n#### Task\nA task is a single static function to be processed. Simple enough. However, relevant to\nAICS, is that when using `aicsimageio` (and / or `dask.array.Array`), your image (or\n`dask.array.Array`) is split up into _many_ tasks. This is dependent on the image reader\nand the size of the file you are reading. But in general it is safe to assume that each\nimage you read is split many thousands of tasks. If you want to see how many tasks your\nimage is split into you can either compute:\n\n1. Psuedo-code: `sum(2 * size(channel) for channel if channel not in [\"Y\", \"X\"])`\n2. Dask graph length: `len(AICSImage.dask_data.__dask_graph__())`\n\n#### Map\nApply a given function to the provided iterables as used as parameters to the function.\nGiven `lambda x: x + 1` and `[1, 2, 3]`, the result of `map(func, *iterables)` in this\ncase would be `[2, 3, 4]`. Usually, you are provided back an iterable of `future`\nobjects back from a `map` operation. The results from the map operation are not\nguaranteed to be in the order of the iterable that went in as operations are started as\nresources become available and item to item variance may result in different output\nordering.\n\n#### Future\nAn object that will become available but is currently not defined. There is no guarantee\nthat the object is a valid result or an error and you should handle errors once the\nfuture's state has resolved (usually this means after a `gather` operation).\n\n#### Gather\nBlock the process from moving forward until all futures are resolved. Control flow here\nwould mean that you could potentially generate thousands of futures and keep moving on\nlocally while those futures slowly resolve but if you ever want a hard stop and wait for\nsome set of futures to complete, you would need gather them.\n\n##### Other Comments\nDask tries to mirror the standard library `concurrent.futures` wherever possible which\nis what allows for this library to have simple wrappers around Dask to allow for easy\ndebugging as we are simply swapping out `distributed.Client.map` with\n`concurrent.futures.ThreadPoolExecutor.map` for example. If at any point in your code\nyou don't want to use `dask` for some reason or another, it is equally valid to use\n`concurrent.futures.ThreadPoolExecutor` or `concurrent.futures.ProcessPoolExecutor`.\n\n### Basic Mapping with Distributed Handler\nIf you have an iterable (or iterables) that would result in less than hundreds of\nthousands of tasks, it you can simply use the normal `map` provided by the\n`DistributedHandler.client`.\n\n**Important Note:** Notice, \"... iterable that would _result_ in less than hundreds\nof thousands of tasks...\". This is important because what happens when you try to `map`\nover a thousand image paths, each which spawns an `AICSImage` object. Each one adds\nthousands more tasks to the scheduler to complete. This will break and you should look\nto [Large Iterable Batching](#large-iterable-batching) instead.\n\n```python\nfrom aics_dask_utils import DistributedHandler\n\n# `None` address provided means use local machine threads\nwith DistributedHandler(None) as handler:\n    futures = handler.client.map(\n        lambda x: x + 1,\n        [1, 2, 3]\n    )\n\n    results = handler.gather(futures)\n\nfrom distributed import LocalCluster\ncluster = LocalCluster()\n\n# Actual address provided means use the dask scheduler\nwith DistributedHandler(cluster.scheduler_address) as handler:\n    futures = handler.client.map(\n        lambda x: x + 1,\n        [1, 2, 3]\n    )\n\n    results = handler.gather(futures)\n```\n\n### Large Iterable Batching\nIf you have an iterable (or iterables) that would result in more than hundreds of\nthousands of tasks, you should use `handler.batched_map` to reduce the load on the\nclient. This will batch your requests rather than send than all at once.\n\n```python\nfrom aics_dask_utils import DistributedHandler\n\n# `None` address provided means use local machine threads\nwith DistributedHandler(None) as handler:\n    results = handler.batched_map(\n        lambda x: x + 1,\n        range(1e9) # 1 billion\n    )\n\nfrom distributed import LocalCluster\ncluster = LocalCluster()\n\n# Actual address provided means use the dask scheduler\nwith DistributedHandler(cluster.scheduler_address) as handler:\n    results = handler.batched_map(\n        lambda x: x + 1,\n        range(1e9) # 1 billion\n    )\n```\n\n**Note:** Notice that there is no `handler.gather` call after `batched_map`. This is\nbecause `batched_map` gathers results at the end of each batch rather than simply\nreturning their future's.\n\n## Installation\n**Stable Release:** `pip install aics_dask_utils`\u003cbr\u003e\n**Development Head:** `pip install git+https://github.com/AllenCellModeling/aics_dask_utils.git`\n\n## Documentation\nFor full package documentation please visit\n[AllenCellModeling.github.io/aics_dask_utils](https://AllenCellModeling.github.io/aics_dask_utils).\n\n## Development\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for information related to developing the code.\n\n## Additional Comments\nThis README, provided tooling, and documentation are not meant to be all encompassing\nof the various operations you can do with `dask` and other similar computing systems.\nFor further reading go to [dask.org](https://dask.org/).\n\n**Free software: Allen Institute Software License**\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallencellmodeling%2Faics_dask_utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallencellmodeling%2Faics_dask_utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallencellmodeling%2Faics_dask_utils/lists"}