{"id":13605419,"url":"https://github.com/MachineLearningSystem/CAGNET","last_synced_at":"2025-04-12T05:33:00.710Z","repository":{"id":185461691,"uuid":"515138356","full_name":"MachineLearningSystem/CAGNET","owner":"MachineLearningSystem","description":null,"archived":false,"fork":true,"pushed_at":"2022-07-14T21:15:34.000Z","size":20761,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-11-07T10:41:27.107Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"PASSIONLab/CAGNET","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-07-18T10:32:30.000Z","updated_at":"2022-06-19T04:25:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/CAGNET","commit_stats":null,"previous_names":["machinelearningsystem/cagnet"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FCAGNET","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FCAGNET/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FCAGNET/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FCAGNET/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/CAGNET/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248524254,"owners_count":21118609,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:58.519Z","updated_at":"2025-04-12T05:33:00.416Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# CAGNET: Communication-Avoiding Graph Neural nETworks\n\n## Description\n\nCAGNET is a family of parallel algorithms for training GNNs that can asymptotically reduce communication compared to previous parallel GNN training methods. CAGNET algorithms are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, and are implemented with `torch.distributed` on GPU-equipped clusters. We also implement these parallel algorithms on a 2-layer GCN.\n\n\nFor more information, please read our ACM/IEEE SC'20 paper [Reducing Communication in Graph Neural Network Training](https://arxiv.org/pdf/2005.03300.pdf).\n\n**Contact:** Alok Tripathy (\u003calokt@berkeley.edu\u003e)\n\n## Dependencies\n- Python 3.6.10\n- PyTorch 1.3.1\n- PyTorch Geometric (PyG) 1.3.2\n- CUDA 10.1\n- GCC 6.4.0\n\nOn OLCF Summit, all of these dependencies can be accessed with the following\n```bash\nmodule load cuda # CUDA 10.1\nmodule load gcc # GCC 6.4.0\nmodule load ibm-wml-ce/1.7.0-3 # PyTorch 1.3.1, Python 3.6.10\n\n# PyG and its dependencies\nconda create --name gnn --clone ibm-wml-ce-1.7.0-3\nconda activate gnn\npip install --no-cache-dir torch-scatter==1.4.0\npip install --no-cache-dir torch-sparse==0.4.3\npip install --no-cache-dir torch-cluster==1.4.5\npip install --no-cache-dir torch-geometric==1.3.2\n```\n\n## Compiling\n\nThis code uses C++ extensions. To compile these, run\n\n```bash\ncd sparse-extension\npython setup.py install\n```\n\n## Documentation\n\nEach algorithm in CAGNET is implemented in a separate file.\n- `gcn_distr.py` : 1D algorithm\n- `gcn_distr_15d.py` : 1.5D algorithm\n- `gcn_distr_2d.py` : 2D algorithm\n- `gcn_distr_3d.py` : 3D algorithm\n\nEach file also as the following flags:\n\n- `--accperrank \u003cint\u003e` : Number of GPUs on each node\n- `--epochs \u003cint\u003e`  : Number of epochs to run training\n- `--graphname \u003cReddit/Amazon/subgraph3\u003e` : Graph dataset to run training on\n- `--timing \u003cTrue/False\u003e` : Enable timing barriers to time phases in training\n- `--midlayer \u003cint\u003e` : Number of activations in the hidden layer\n- `--runcount \u003cint\u003e` : Number of times to run training\n- `--normalization \u003cTrue/False\u003e` : Normalize adjacency matrix in preprocessing\n- `--activations \u003cTrue/False\u003e` : Enable activation functions between layers\n- `--accuracy \u003cTrue/False\u003e` : Compute and print accuracy metrics (Reddit only)\n- `--replication \u003cint\u003e` : Replication factor (1.5D algorithm only)\n- `--download \u003cTrue/False\u003e` : Download the Reddit dataset\n\nSome of these flags do not currently exist for the 3D algorithm.\n\nAmazon/Protein datasets must exist as COO files in `../data/\u003cgraphname\u003e/processed/`, compressed with pickle. \nFor Reddit, PyG handles downloading and accessing the dataset (see below).\n\n## Running on OLCF Summit (example)\n\nTo run the CAGNET 1.5D algorithm on Reddit with\n- 16 processes\n- 100 epochs\n- 16 hidden layer activations\n- 2-factor replication\n\nrun the following command to download the Reddit dataset:\n\n`python gcn_distr_15d.py --graphname=Reddit --download=True`\n\nThis will download Reddit into `../data`. After downloading the Reddit dataset, run the following command to run training\n\n`ddlrun -x WORLD_SIZE=16 -x MASTER_ADDR=$(echo $LSB_MCPU_HOSTS | cut -d \" \" -f 3) -x MASTER_PORT=1234 -accelerators 6 python gcn_distr_15d.py --accperrank=6 --epochs=100 --graphname=Reddit --timing=False --midlayer=16 --runcount=1 --replication=2`\n\n## Citation\n\nTo cite CAGNET, please refer to:\n\n\u003e Alok Tripathy, Katherine Yelick, Aydın Buluç. Reducing Communication in Graph Neural Network Training. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20), 2020.\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["GNN"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FCAGNET","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FCAGNET","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FCAGNET/lists"}