{"id":13738378,"url":"https://github.com/saturncloud/dask-pytorch-ddp","last_synced_at":"2025-12-14T07:48:25.661Z","repository":{"id":51795511,"uuid":"295251516","full_name":"saturncloud/dask-pytorch-ddp","owner":"saturncloud","description":"dask-pytorch-ddp is a Python package that makes it easy to train PyTorch models on dask clusters using distributed data parallel. ","archived":false,"fork":false,"pushed_at":"2021-04-05T21:38:30.000Z","size":66,"stargazers_count":59,"open_issues_count":5,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-25T15:11:14.046Z","etag":null,"topics":["computer-vision","dask","deep-learning","distributed-computing","machine-learning","nlp","pytorch"],"latest_commit_sha":null,"homepage":"https://saturncloud.io/docs/examples/python/pytorch/qs-03-pytorch-gpu-dask-single-model/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saturncloud.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-13T22:49:53.000Z","updated_at":"2025-01-19T15:26:14.000Z","dependencies_parsed_at":"2022-08-20T03:11:48.846Z","dependency_job_id":null,"html_url":"https://github.com/saturncloud/dask-pytorch-ddp","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/saturncloud/dask-pytorch-ddp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saturncloud%2Fdask-pytorch-ddp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saturncloud%2Fdask-pytorch-ddp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saturncloud%2Fdask-pytorch-ddp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saturncloud%2Fdask-pytorch-ddp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saturncloud","download_url":"https://codeload.github.com/saturncloud/dask-pytorch-ddp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saturncloud%2Fdask-pytorch-ddp/sbom","scorecard":{"id":801595,"data":{"date":"2025-08-11","repo":{"name":"github.com/saturncloud/dask-pytorch-ddp","commit":"1dac8c60e3574e99d2b2c79d403f3e8d1f1984fc"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":4.4,"checks":[{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Code-Review","score":6,"reason":"Found 6/9 approved changesets -- score normalized to 6","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/main.yml:1","Warn: no topLevel permission defined: .github/workflows/publish-to-pypi.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/main.yml:30: update your workflow using https://app.stepsecurity.io/secureworkflow/saturncloud/dask-pytorch-ddp/main.yml/main?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/main.yml:33: update your workflow using https://app.stepsecurity.io/secureworkflow/saturncloud/dask-pytorch-ddp/main.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:11: update your workflow using https://app.stepsecurity.io/secureworkflow/saturncloud/dask-pytorch-ddp/publish-to-pypi.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/saturncloud/dask-pytorch-ddp/publish-to-pypi.yml/main?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:31: update your workflow using https://app.stepsecurity.io/secureworkflow/saturncloud/dask-pytorch-ddp/publish-to-pypi.yml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:40","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:46","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:53","Warn: pipCommand not pinned by hash: .github/workflows/publish-to-pypi.yml:18","Info:   0 out of   3 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   2 third-party GitHubAction dependencies pinned","Info:   0 out of   4 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: BSD 3-Clause \"New\" or \"Revised\" License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 28 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-23T10:38:18.255Z","repository_id":51795511,"created_at":"2025-08-23T10:38:18.255Z","updated_at":"2025-08-23T10:38:18.255Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27722163,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-14T02:00:11.348Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","dask","deep-learning","distributed-computing","machine-learning","nlp","pytorch"],"created_at":"2024-08-03T03:02:20.575Z","updated_at":"2025-12-14T07:48:25.600Z","avatar_url":"https://github.com/saturncloud.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# dask-pytorch-ddp\n\n\u003c!-- ![GitHub Actions](https://github.com/saturncloud/dask-pytorch-ddp/workflows/GitHub%20Actions/badge.svg) [![PyPI Version](https://img.shields.io/pypi/v/prefect-saturn.svg)](https://pypi.org/project/prefect-saturn) --\u003e\n\n`dask-pytorch-ddp` is a Python package that makes it easy to train PyTorch models on Dask clusters using distributed data parallel.  The intended scope of the project is\n- bootstrapping PyTorch workers on top of a Dask cluster\n- Using distributed data stores (e.g., S3) as normal PyTorch datasets\n- mechanisms for tracking and logging intermediate results, training statistics, and checkpoints.\n\nAt this point, this library and examples provided are tailored to computer vision tasks, but this library is intended to be useful for any sort of PyTorch tasks. The only thing really specific to image processing is the `S3ImageFolder` dataset class. Implementing a PyTorch dataset (assuming map style random access) outside of images currently requires implementing `__getitem__(self, idx: int):` and `__len__(self):` We plan to add more varied examples for other use cases in the future, and welcome PRs extending functionality.\n\n## Typical non-dask workflow\n\nA typical example of non-dask PyTorch usage is as follows:\n\n### Loading Data\nCreate an dataset (`ImageFolder`), and wrap it in a `DataLoader`\n\n```python\ntransform = transforms.Compose([\n    transforms.Resize(256),\n    transforms.CenterCrop(250),\n    transforms.ToTensor()\n])\n\nwhole_dataset = ImageFolder(path, transform=transform)\n\nbatch_size = 100\nnum_workers = 64\nindices = list(range(len(data)))\nnp.random.shuffle(indices)\ntrain_idx = indices[:num]\ntest_idx = indices[num:num+num]\n\ntrain_sampler = SubsetRandomSampler(train_idx)\ntrain_loader = DataLoader(data, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers)\n```\n\n### Training a Model\nLoop over the dataset, and train the model by stepping the optimizer\n\n```python\ndevice = torch.device(0)\nnet = models.resnet18(pretrained=False)\nmodel = net.to(device)\ndevice_ids = [0]\n\ncriterion = nn.CrossEntropyLoss().cuda()\nlr = 0.001\noptimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)\ncount = 0\nfor epoch in range(n_epochs):\n    model.train()  # Set model to training mode\n    for inputs, labels in train_loader:\n        inputs = inputs.to(device)\n        labels = labels.to(device)\n        outputs = model(inputs)\n        _, preds = torch.max(outputs, 1)\n        loss = criterion(outputs, labels)\n\n        # zero the parameter gradients\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n        count += 1\n```\n\n## Now on Dask\n\nWith dask_pytorch_ddp and PyTorch Distributed Data Parallel, we can train on multiple workers as follows:\n\n### Loading Data\nLoad the dataset from S3, and explicitly set the multiprocessing context (Dask defaults to spawn, but pytorch is generally configured to use fork)\n\n```python\nfrom dask_pytorch_ddp.data import S3ImageFolder\n\nwhole_dataset = S3ImageFolder(bucket, prefix, transform=transform)\ntrain_loader = torch.utils.data.DataLoader(\n    whole_dataset, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers, multiprocessing_context=mp.get_context('fork')\n)\n```\n\n### Training in Parallel\n\nWrap the training loop in a function (and add metrics logging.  Not necessary, but very useful).  Convert the model into a PyTorch Distributed Data Parallel (`DDP`) model which knows how to sync gradients together across workers.\n\n```python\nimport uuid\nimport pickle\nimport logging\nimport json\n\n\nkey = uuid.uuid4().hex\nrh = DaskResultsHandler(key)\n\ndef run_transfer_learning(bucket, prefix, samplesize, n_epochs, batch_size, num_workers, train_sampler):\n    worker_rank = int(dist.get_rank())\n    device = torch.device(0)\n    net = models.resnet18(pretrained=False)\n    model = net.to(device)\n    model = DDP(model, device_ids=[0])\n\n    criterion = nn.CrossEntropyLoss().cuda()\n    lr = 0.001\n    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)\n    whole_dataset = S3ImageFolder(bucket, prefix, transform=transform)\n    \n    train_loader = torch.utils.data.DataLoader(\n        whole_dataset,\n        sampler=train_sampler,\n        batch_size=batch_size,\n        num_workers=num_workers,\n        multiprocessing_context=mp.get_context('fork')\n    )\n    \n    count = 0\n    for epoch in range(n_epochs):\n        # Each epoch has a training and validation phase\n        model.train()  # Set model to training mode\n        for inputs, labels in train_loader:\n            dt = datetime.datetime.now().isoformat()\n            inputs = inputs.to(device)\n            labels = labels.to(device)\n            outputs = model(inputs)\n            _, preds = torch.max(outputs, 1)\n            loss = criterion(outputs, labels)\n\n            # zero the parameter gradients\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n            count += 1\n\n            # statistics\n            rh.submit_result(\n                f\"worker/{worker_rank}/data-{dt}.json\",\n                json.dumps({'loss': loss.item(), 'epoch': epoch, 'count': count, 'worker': worker_rank})\n            )\n            if (count % 100) == 0 and worker_rank == 0:\n                rh.submit_result(f\"checkpoint-{dt}.pkl\", pickle.dumps(model.state_dict()))\n\n```\n\n## How does it work?\n\n`dask-pytorch-ddp` is largely a wrapper around existing `pytorch` functionality.  `pytorch.distributed` provides infrastructure for [Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) (DDP).\n\nIn DDP, you create N workers, and the 0th worker is the \"master\", and coordinates the synchronization of buffers and gradients.  In SGD, gradients are normally averaged between all data points in a batch.  By running batches on multiple workers, and averaging the gradients, DDP enables you to run SGD with a much bigger batch size `(N * batch_size)`\n\n`dask-pytorch-ddp` sets some environment variables to configure the \"master\" host and port, and then calls `init_process_group` before training, and calls `destroy_process_group` after training.  This is the same process normally done manually by the data scientist.\n\n### Multi GPU machines\n`dask_cuda_worker` automatically rotates `CUDA_VISIBLE_DEVICES` for each worker it creates (typically one per GPU).  As a result, your PyTorch code should always start with the 0th GPU.\n\nFor example, if I have an 8 GPU machine, the 3rd worker will have `CUDA_VISIBLE_DEVICES` set to `2,3,4,5,6,7,0,1`.  On that worker, if I call `torch.device(0)`, I will get GPU 2.\n\n## What else?\n\n`dask-pytorch-ddp` also implements an S3 based `ImageFolder`.  More distributed friendly datasets are planned.  `dask-pytorch-ddp` also implements a basic results aggregation framework so that it is easy to collect training metrics across different workers.  Currently, only `DaskResultsHandler` which leverages [Dask pub-sub communication protocols][1] is implemented, but an S3 based result handler is planned.\n\n[1]:https://docs.dask.org/en/latest/futures.html#publish-subscribe\n\n## Some Notes\n\nDask generally spawns processes.  PyTorch generally forks.  When using a multiprocessing enabled data loader, it is a good idea to pass the `Fork` multiprocessing context to force the use of Forking in the data loader.\n\nSome Dask deployments do not permit spawning processes.  To override this, you can change the [distributed.worker.daemon](https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.daemon) setting.\n\nEnvironment variables are a convenient way to do this:\n\n```\nDASK_DISTRIBUTED__WORKER__DAEMON=False\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaturncloud%2Fdask-pytorch-ddp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaturncloud%2Fdask-pytorch-ddp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaturncloud%2Fdask-pytorch-ddp/lists"}