{"id":17498138,"url":"https://github.com/alibaba/graphlearn-for-pytorch","last_synced_at":"2025-05-16T16:01:41.377Z","repository":{"id":148982388,"uuid":"620111179","full_name":"alibaba/graphlearn-for-pytorch","owner":"alibaba","description":"A GPU-accelerated graph learning library for PyTorch, facilitating the scaling of GNN training and inference.","archived":false,"fork":false,"pushed_at":"2025-04-10T11:23:58.000Z","size":1554,"stargazers_count":131,"open_issues_count":6,"forks_count":40,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-12T14:57:07.975Z","etag":null,"topics":["deep-learning","distributed","gpu","graph-neural-networks","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alibaba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-28T03:38:06.000Z","updated_at":"2025-04-10T11:24:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"afba7c5c-75b4-4807-ae59-b97f89c6927a","html_url":"https://github.com/alibaba/graphlearn-for-pytorch","commit_stats":{"total_commits":121,"total_committers":12,"mean_commits":"10.083333333333334","dds":0.6776859504132231,"last_synced_commit":"bebf64f92d942a24f9bd1d9f8cf27fd626155917"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Fgraphlearn-for-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Fgraphlearn-for-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Fgraphlearn-for-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Fgraphlearn-for-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alibaba","download_url":"https://codeload.github.com/alibaba/graphlearn-for-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254100357,"owners_count":22014862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","distributed","gpu","graph-neural-networks","pytorch"],"created_at":"2024-10-19T16:10:24.149Z","updated_at":"2025-05-16T16:01:41.313Z","avatar_url":"https://github.com/alibaba.png","language":"Python","readme":"[![GLT-pypi](https://img.shields.io/pypi/v/graphlearn-torch.svg)](https://pypi.org/project/graphlearn-torch/)\n[![docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://graphlearn-torch.readthedocs.io/en/latest/)\n[![GLT CI](https://github.com/alibaba/graphlearn-for-pytorch/workflows/GLT%20CI/badge.svg)](https://github.com/alibaba/graphlearn-for-pytorch/actions)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/alibaba/graphlearn-for-pytorch/blob/main/LICENSE)\n\n**GraphLearn-for-PyTorch(GLT)** is a graph learning library for PyTorch that makes\ndistributed GNN training and inference easy and efficient. It leverages the\npower of GPUs to accelerate graph sampling and utilizes UVA to reduce the\nconversion and copying of features of vertices and edges. For large-scale graphs,\nit supports distributed training on multiple GPUs or multiple machines through\nfast distributed sampling and feature lookup. Additionally, it provides flexible\ndeployment for distributed training to meet different requirements.\n\n\n- [Highlighted Features](#highlighted-features)\n- [Architecture Overview](#architecture-overview)\n- [Installation](#installation)\n  - [Requirements](#requirements)\n  - [Pip Wheels](#pip-wheels)\n  - [Build from source](#build-from-source)\n    - [Install Dependencies](#install-dependencies)\n    - [Python](#python)\n    - [C++](#c)\n- [Quick Tour](#quick-tour)\n  - [Accelarating PyG model training on a single GPU.](#accelarating-pyg-model-training-on-a-single-gpu)\n  - [Distributed training](#distributed-training)\n- [License](#license)\n\n## Highlighted Features\n* **GPU acceleration**\n\n  GLT provides both CPU-based and GPU-based graph operators such\n  as neighbor sampling, negative sampling, and feature lookup. For GPU training,\n  GPU-based graph operations accelerate the computation and reduce data movement\n  by a considerable amount.\n\n* **Scalable and efficient distributed training**\n\n  For distributed training, we implement multi-processing asynchronous sampling,\n  pin memory buffer, hot feature cache, and use fast networking\n  technologies (PyTorch RPC with RDMA support) to speed up distributed sampling\n  and reduce communication. As a result, GLT can achieve high\n  scalability and support graphs with billions of edges.\n\n* **Easy-to-use API**\n\n  Most of the APIs of GLT are compatible with PyG/PyTorch,\n  so you can only need to modify a few lines of PyG's code to get the\n  acceleration of the program. For GLT specific APIs,\n  they are compatible with PyTorch and there is complete documentation\n  and usage examples available.\n\n* **Large-scale real-world GNN models**\n\n  We focus on real-world scenarios and provide distributed GNN training examples\n  on large-scale graphs. Since GLT is compatible with PyG,\n  you can use almost any PyG's model as the base model. We will also continue to\n  provide models with practical effects in industrial scenarios.\n\n* **Easy to extend**\n\n  GLT directly uses PyTorch C++ Tensors and is easy to extend just\n  like PyTorch. There are no extra restrictions for CPU or CUDA based graph\n  operators, and adding a new one is straightforward. For distributed\n  operations, you can write a new one in Python using PyTorch RPC.\n\n* **Flexible deployment**\n\n  Graph Engine(Graph operators) and PyTorch engine(PyTorch nn modules) can be\n  deployed either co-located or separated on different machines. This\n  flexibility enables you to deploy GLT to your own environment\n  or embed it in your project easily.\n\n\n## Architecture Overview\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"60%\" src=docs/figures/arch.png /\u003e\n\u003c/p\u003e\n\n\nThe main goal of GLT is to leverage hardware resources like GPU/NVLink/RDMA and\ncharacteristics of GNN models to accelerate end-to-end GNN training in both\nthe single-machine and distributed environments.\n\nIn the case of multi-GPU training, graph sampling and CPU-GPU data transferring\ncould easily become the major performance bottleneck. To speed up graph sampling\nand feature lookup, GLT implements the Unified Tensor Storage to unify the\nmemory management of CPU and GPU. Based on this storage, GLT supports\nboth CPU-based and GPU-based graph operators such as neighbor sampling,\nnegative sampling, feature lookup, subgraph sampling etc.\nTo alleviate the CPU-GPU data transferring overheads incurred by feature collection,\nGLT supports caching features of hot vertices in GPU memory,\nand accessing the remaining feature data (stored in pinned memory) via UVA.\nWe further utilize the high-speed NVLink between GPUs expand the capacity of\nGPU cache.\n\nAs for distributed training, to prevent remote data access from blocking\nthe progress of model training, GLT implements an efficient RPC framework on top\nof PyTorch RPC and adopts asynchronous graph sampling and feature lookup operations\nto hide the network latency and boost the end-to-end training throughput.\n\nTo lower the learning curve for PyG users,\nthe APIs of key abstractions in GLT, such as dataset and dataloader,\nare designed to be compatible with PyG. Thus PyG users can\ntake full advantage of GLT's acceleration capabilities by only modifying\nvery few lines of code.\n\nFor model training, GLT supports different models to fit different scales of\nreal-world graphs. It allows users to collocate model training  and graph\nsampling (including feature lookup) in the same process, or separate them into\ndifferent processes or even different machines.\nWe provide two example to illustrate the training process on\nsmall graphs: [single GPU training example](examples/train_sage_ogbn_products.py)\nand [multi-GPU training example](examples/multi_gpu/). For large-scale graphs,\nGLT separates sampling and training processes for\nasynchronous and parallel acceleration, and supports deployment of sampling\nand training processes on the same or different machines. Examples of\ndistributed training can be found in [distributed examples](examples/distributed/).\n\n## Installation\n\n### Requirements\n- cuda\n- python\u003e=3.6\n- torch(PyTorch)\n- torch_geometric, torch_scatter, torch_sparse. Please refer to [PyG](https://github.com/pyg-team/pytorch_geometric) for installation.\n### Pip Wheels\n\n```\n# glibc\u003e=2.14, torch\u003e=1.13\npip install graphlearn-torch\n```\n\n### Build from source\n\n#### Install Dependencies\n```shell\ngit submodule update --init\nsh install_dependencies.sh\n```\n\n#### Python\n1. Build\n``` shell\npython setup.py bdist_wheel\npip install dist/*\n```\n\nBuild in CPU-mode \n``` shell\nWITH_CUDA=OFF python setup.py bdist_wheel\npip install dist/*\n```\n\n2. UT\n``` shell\nsh scripts/run_python_ut.sh\n```\n\n#### C++\nIf you need to test C++ operations, you can only build the C++ part.\n\n1. Build\n``` shell\ncmake .\nmake -j\n```\n2. UT\n``` shell\nsh scripts/run_cpp_ut.sh\n```\n## Quick Tour\n\n### Accelarating PyG model training on a single GPU.\n\nLet's take PyG's [GraphSAGE on OGBN-Products](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py)\nas an example, you only need to replace PyG's `torch_geometric.loader.NeighborSampler`\nby the [`graphlearn_torch.loader.NeighborLoader`](graphlearn_torch.loader.NeighborLoader)\nto benefit from the the acceleration of model training using GLT.\n\n```python\nimport torch\nimport graphlearn_torch as glt\nimport os.path as osp\n\nfrom ogb.nodeproppred import PygNodePropPredDataset\n\n# PyG's original code preparing the ogbn-products dataset\nroot = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'products')\ndataset = PygNodePropPredDataset('ogbn-products', root)\nsplit_idx = dataset.get_idx_split()\ndata = dataset[0]\n\n# Enable GLT acceleration on PyG requires only replacing\n# PyG's NeighborSampler with the following code.\nglt_dataset = glt.data.Dataset()\nglt_dataset.build(edge_index=data.edge_index,\n                  feature_data=data.x,\n                  sort_func=glt.data.sort_by_in_degree,\n                  split_ratio=0.2,\n                  label=data.y,\n                  device=0)\ntrain_loader = glt.loader.NeighborLoader(glt_dataset,\n                                         [15, 10, 5],\n                                         split_idx['train'],\n                                         batch_size=1024,\n                                         shuffle=True,\n                                         drop_last=True,\n                                         as_pyg_v1=True)\n```\n\nThe complete example can be found in [`examples/train_sage_ogbn_products.py`](examples/train_sage_ogbn_products.py).\n\n\u003cdetails\u003e\n\nWhile building the `glt_dataset`, the GPU where the graph sampling operations\nare performed is specified by parameter `device`. By default, the graph topology are stored\nin pinned memory for ZERO-COPY access. Users can also choose to stored the graph\ntopology in GPU by specifying `graph_mode='CUDA` in [`graphlearn_torch.data.Dataset.build`](graphlearn_torch.data.Dataset.build).\nThe `split_ratio` determines the fraction of feature data to be cached in GPU.\nBy default, GLT sorts the vertices in descending order according to vertex indegree\nand selects vetices with higher indegree for feature caching. The default sort\nfunction used as the input parameter for\n[`graphlearn_torch.data.Dataset.build`](graphlearn_torch.data.Dataset.build) is\n[`graphlearn_torch.data.reorder.sort_by_in_degree`](graphlearn_torch.data.reorder.sort_by_in_degree). Users can also customize their own sort functions with compatible APIs.\n\u003c/details\u003e\n\n### Distributed training\n\nFor PyTorch DDP distributed training, there are usually several steps as follows:\n\nFirst, load the graph and feature from partitions.\n```python\nimport torch\nimport os.path as osp\nimport graphlearn_torch as glt\n\n# load from partitions and create distributed dataset.\n# Partitions are generated by following script:\n# `python partition_ogbn_dataset.py --dataset=ogbn-products --num_partitions=2`\n\nroot = osp.join(osp.dirname(osp.realpath(__file__)), '..', '..', 'data', 'products')\nglt_dataset = glt.distributed.DistDataset()\nglt_dataset.load(\n  num_partitions=2,\n  partition_idx=int(os.environ['RANK']),\n  graph_dir=osp.join(root, 'ogbn-products-graph-partitions'),\n  feature_dir=osp.join(root, 'ogbn-products-feature-partitions'),\n  label_file=osp.join(root, 'ogbn-products-label', 'label.pt') # whole label\n)\ntrain_idx = torch.load(osp.join(root, 'ogbn-products-train-partitions',\n                                'partition' + str(os.environ['RANK']) + '.pt'))\n```\n\nSecond, create distributed neighbor loader based on the dataset above.\n```python\n# distributed neighbor loader\ntrain_loader = glt.distributed.DistNeighborLoader(\n  data=glt_dataset,\n  num_neighbors=[15, 10, 5],\n  input_nodes=train_idx,\n  batch_size=batch_size,\n  drop_last=True,\n  collect_features=True,\n  to_device=torch.device(rank % torch.cuda.device_count()),\n  worker_options=glt.distributed.MpDistSamplingWorkerOptions(\n    num_workers=nsampling_proc_per_train,\n    worker_devices=[torch.device('cuda', (i + rank) % torch.cuda.device_count())\n                    for i in range(nsampling_proc_per_train)],\n    worker_concurrency=4,\n    master_addr='localhost',\n    master_port=12345, # different from port in pytorch training.\n    channel_size='2GB',\n    pin_memory=True\n  )\n)\n```\n\nFinally, define DDP model and run.\n```python\nfrom torch.nn.parallel import DistributedDataParallel\nfrom torch_geometric.nn import GraphSAGE\n\n# DDP model\nmodel = GraphSAGE(\n  in_channels=num_features,\n  hidden_channels=256,\n  num_layers=3,\n  out_channels=num_classes,\n).to(rank)\nmodel = DistributedDataParallel(model, device_ids=[rank])\noptimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n# training.\nfor epoch in range(0, epochs):\n  model.train()\n  for batch in train_loader:\n    optimizer.zero_grad()\n    out = model(batch.x, batch.edge_index)[:batch.batch_size].log_softmax(dim=-1)\n    loss = F.nll_loss(out, batch.y[:batch.batch_size])\n    loss.backward()\n    optimizer.step()\n  dist.barrier()\n```\n\nThe training scripts for 2 nodes each with 2 GPUs are as follows:\n```shell\n# node 0:\nCUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=0 --master_addr=xxx dist_train_sage_supervised.py\n\n# node 1:\nCUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --use_env --nnodes=2 --node_rank=1 --master_addr=xxx dist_train_sage_supervised.py\n```\n\nFull code can be found in [distributed training example](examples/distributed/dist_train_sage_supervised.py).\n\n## License\n[Apache License 2.0](LICENSE)","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Fgraphlearn-for-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falibaba%2Fgraphlearn-for-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Fgraphlearn-for-pytorch/lists"}