{"id":19066151,"url":"https://github.com/epfml/relaysgd","last_synced_at":"2025-04-28T12:28:55.431Z","repository":{"id":69549054,"uuid":"400097303","full_name":"epfml/relaysgd","owner":"epfml","description":"Code for the paper “RelaySum for Decentralized Deep Learning on Heterogeneous Data”","archived":false,"fork":false,"pushed_at":"2023-04-21T07:54:29.000Z","size":790,"stargazers_count":9,"open_issues_count":0,"forks_count":2,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-18T16:16:32.728Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-08-26T08:31:42.000Z","updated_at":"2024-11-12T04:59:18.000Z","dependencies_parsed_at":"2023-05-15T22:30:12.932Z","dependency_job_id":null,"html_url":"https://github.com/epfml/relaysgd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Frelaysgd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Frelaysgd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Frelaysgd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Frelaysgd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/relaysgd/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251313101,"owners_count":21569359,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T00:54:45.780Z","updated_at":"2025-04-28T12:28:55.424Z","avatar_url":"https://github.com/epfml.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RelaySGD\n\nRelaySum for Decentralized Deep Learning on Heterogeneous Data\n\nAbstract: Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network.\nThis paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers.\nA key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions.\nTo tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning.\nRelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes.\nIn contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum.\nWe prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data.\n\n- [Paper](https://papers.nips.cc/paper/2021/file/ebbdfea212e3a756a1fded7b35578525-Paper.pdf)\n- [Slides](https://thijs.link/relaysgd-slides/index.html#0)\n\n## Assumed environment\n\n- Python 3.7 from Anaconda 2021.05 (with numpy, pandas, matplotlib, seaborn)\n- PyTorch 1.8.1\n- NetworkX 2.4\n\n\nWe ran deep learning experiments with PyTorch Distributed using MPI. [environment/files/setup.sh](environment/files/setup.sh) describes our runtime environment and code we use to compile PyTorch with MPI support. For 16-worker Cifar-10 experiments, we used 4 Nvidia K80 GPUs on Google Cloud, each with 4 worker processes.\n\n## Code organization\n\n- The entrypoint for our deep learning code is [train.py](train.py).\n- You manually need to start multiple instances of [train.py](train.py). This could be through MPI, Slurm or a simple script such as [dispatch.py](dispatch.py)\n- It can run different experiments based on its global `config` variable. \n- All the `config`'s used in our experiments are listed in the scheduling code under [experiments](experiments).\n- The __RelaySGD__ algorithm is implemented starting from line 225 in [algorithms.py](algorithms.py).\n- The __RelaySum__ communication mechanism is in [utils/communication.py](utils/communication.py) from line 85.\n- Hyperparameters (`config` overrides) used in our experiments can be found in the [experiments/](experiments) directory.\n\n## Academic version of the algorithm\n\nThe real implementation of RelaySGD is in line 225 in [algorithms.py](algorithms.py). \nIn the real version, the code represents a single worker, and communication is explicit.\nBelow, we include an ‘academic’ version of the algorithm that simulates all workers in a single process.\n\n```python\nfrom collections import defaultdict\nfrom typing import Mapping, NamedTuple\nimport torch\n\ndef relaysgd(task, world, learning_rate, num_steps):\n    state: torch.Tensor = task.init_state()  # shape [num_workers, ...]\n\n    class Edge(NamedTuple):\n        src: int\n        dst: int\n\n    # Initialize worker's memory, one entry for each edge in the network\n    messages: Mapping[Edge, float] = defaultdict(float)  # default value 0.0\n    counts: Mapping[Edge, int] = defaultdict(int)\n\n    for step in range(num_steps):\n        # Execute a model update on each worker\n        g: torch.Tensor = task.grad(state)  # shape [num_workers, ...]\n        state = state - learning_rate * g\n\n        # Send messages\n        new_messages = defaultdict(float)\n        new_counts = defaultdict(int)\n\n        for worker in world.workers:\n            neighbors = world.neighbors(worker)\n            for neighbor in neighbors:\n                new_messages[worker, neighbor] = (\n                    state[worker] +\n                    sum(messages[n, worker] for n in neighbors if n != neighbor)\n                )\n                new_counts[worker, neighbor] = (\n                    1 + \n                    sum(counts[n, worker] for n in neighbors if n != neighbor)\n                )\n\n        messages = new_messages\n        counts = new_counts\n\n        # Apply RelaySGD averaging\n        for worker in world.workers:\n            neighbors = world.neighbors(worker)\n            num_messages = sum(counts[n, worker] for n in neighbors)\n            state[worker] = (\n                state[worker] * (world.num_workers - num_messages) \n                + sum(messages[n, worker] for n in neighbors)\n            ) / world.num_workers\n```\n\n## Paper figures\n\nThe directory [paper-figures](paper-figures) contains scripts used to generate all tables and figures in the paper and appendix. \n\nA few files here that might be of interest: \n- [paper-figures/algorithms.py](paper-figures/algorithms.py) contains simulation code for decentralized learning on a single node. It has implementations of many algorithms (RelaySGD, RelaySGD/Grad, D2, DPSGD, Gradient tracking)\n- [paper-figures/random_quadratics.py](paper-figures/random_quadratics.py) implements the synthetic functions we test the algorithms with (B.4)\n- [paper-figures/tuning.py](paper-figures/tuning.py) contains the logic we use to automatically tune learning rates for experiments with random quadratics.\n\n## Reference\nIf you use this code, please cite the following paper\n\n```\n@inproceedings{vogels2021relaysum,\n  title={Relaysum for decentralized deep learning on heterogeneous data},\n  author={Vogels, Thijs and He, Lie and Koloskova, Anastasia and Karimireddy, Sai Praneeth and Lin, Tao and Stich, Sebastian U and Jaggi, Martin},\n  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},\n  year={2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Frelaysgd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Frelaysgd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Frelaysgd/lists"}