{"id":13958730,"url":"https://github.com/quiver-team/torch-quiver","last_synced_at":"2025-04-04T17:10:32.360Z","repository":{"id":40723033,"uuid":"288162895","full_name":"quiver-team/torch-quiver","owner":"quiver-team","description":"PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.","archived":false,"fork":false,"pushed_at":"2023-08-17T12:28:05.000Z","size":5193,"stargazers_count":299,"open_issues_count":21,"forks_count":36,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-03-28T16:08:08.258Z","etag":null,"topics":["distributed-computing","geometric-deep-learning","gpu-acceleration","graph-learning","graph-neural-networks","pytorch"],"latest_commit_sha":null,"homepage":"https://torch-quiver.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quiver-team.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-17T11:32:35.000Z","updated_at":"2025-03-05T14:50:54.000Z","dependencies_parsed_at":"2024-11-28T02:32:42.141Z","dependency_job_id":"6d2a5265-ae5b-441b-bc57-377d098bcddc","html_url":"https://github.com/quiver-team/torch-quiver","commit_stats":{"total_commits":125,"total_committers":19,"mean_commits":6.578947368421052,"dds":0.6719999999999999,"last_synced_commit":"0592669225954e46d9c57c0ca121c365af47b57f"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Ftorch-quiver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Ftorch-quiver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Ftorch-quiver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Ftorch-quiver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quiver-team","download_url":"https://codeload.github.com/quiver-team/torch-quiver/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247217222,"owners_count":20903009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-computing","geometric-deep-learning","gpu-acceleration","graph-learning","graph-neural-networks","pytorch"],"created_at":"2024-08-08T13:01:49.756Z","updated_at":"2025-04-04T17:10:32.341Z","avatar_url":"https://github.com/quiver-team.png","language":"Python","funding_links":[],"categories":["图机器学习库"],"sub_categories":["网络服务_其他"],"readme":"[pypi-image]: https://badge.fury.io/py/torch-geometric.svg\n[pypi-url]: https://pypi.org/project/torch-quiver/\n\n\u003cp align=\"center\"\u003e\n  \u003cimg height=\"150\" src=\"docs/multi_medias/imgs/quiver-logo-min.png\" /\u003e\n\u003c/p\u003e\n\n--------------------------------------------------------------------------------\n\nQuiver is a distributed graph learning library for [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric) (PyG). The goal of Quiver is to make distributed graph learning easy-to-use and achieve high-performance.\n\n[![Documentation Status](https://readthedocs.org/projects/torch-quiver/badge/?version=latest)](https://torch-quiver.readthedocs.io/en/latest/?badge=latest)\n\n--------------------------------------------------------------------------------\n\n## Release 0.2.0 is out!\n\nIn the latest release `torch-quiver==0.2.0`, we have added support for efficient GNN serving and faster feature collection.\n\n### High-throughput \u0026 Low-latency GNN Serving\n\nQuiver now supports efficient GNN serving. The serving API is simple and easy-to-use. For example, the following code snippet shows how to use Quiver to serve a GNN model:\n\n\u003c!-- TODO: complete code --\u003e\n\n```python\nfrom torch_geometric.datasets import Reddit\nfrom torch.multiprocessing import Queue\nfrom quiver import AutoBatch, ServingSampler, ServerInference\n\n\n# Define dataset and sampler\ndataset = Reddit(...)\n\n# Instantiate the auto batch component\nrequest_batcher = RequestBatcher(stream_input_queue, ...)\n# batched_request_queue_list = [cpu_batched_request_queue_list, gpu_batched_request_queue_list]\nbatched_queue_list = request_batcher.batched_request_queue_list() \n\n# Instantiate the sampler component\nhybrid_sampler = HybridSampler(dataset, batched_queue_list, ...)\n# sampled_request_queue_list = [cpu_sampled_request_queue_list, gpu_sampled_request_queue_list]\nsampled_queue_list = hybrid_sampler.sampled_request_queue_list()\nhybrid_sampler.start()\n\n# Instantiate the inference server component\nserver = InferenceServer(model_path, dataset, sampled_queue_list, ...)\n# result_queue_list = [Queue, ..., Queue]\nresult_queue_list = server.result_queue_list() \n\nserver.start()\n```\n\nA full example using Quiver to serve a GNN model with Reddit dataset on a single machine can be found [here](https://github.com/quiver-team/torch-quiver/examples/serving/reddit/reddit_serving.py).\n\n### Test Serving\n\n```cmd\n$ cd examples/serving/reddit\n$ python prepare_data.py\n$ python reddit_serving.py\n```\n\n### Key Idea\n\nQuiver's key idea is to exploit **workload metrics** for predicting the irregular computation of GNN requests, and governing the use of GPUs for graph sampling and feature aggregation: (1) for graph sampling, Quiver calculates the **probabilistic sampled graph size**, a metric that predicts the degree of parallelism in graph sampling. Quiver uses this metric to assign sampling tasks to GPUs only when the performance gains surpass CPU-based sampling; and (2) for feature aggregation, Quiver relies on the **feature access probability** to decide which features to partition and replicate across a distributed GPU NUMA topology. Quiver achieves up to 35$\\times$ lower latency with a 8$\\times$ higher throughput compared to state-of-the-art GNN approaches (DGL and PyG).\n\nBelow is a figure that describes a benchmark that evaluates the performance of Quiver in serving situation, PyG (2.0.3) and [DGL](https://github.com/dmlc/dgl) (1.0.2) on a 2-GPU server that runs the [Reddit with GraphSage](http://snap.stanford.edu/graphsage/). \n\n![Throughput vs. Latency of GNN request serving](docs/serving/tp99.png)\n\n---\n\n## Why Quiver?\n\n----\nThe primary motivation for this project is to make it easy to take a PyG program and scale it across many GPUs and CPUs. A typical scenario is: Users can use the easy-to-use APIs of PyG to efficiently develop graph learning programs, and rely on Quiver to run these PyG programs at large scale. To make such scaling effective, Quiver has several novel features:\n\u003c!-- \nIf you are a GNN researcher or you are a `PyG`'s or `DGL`'s user and you are suffering from consuming too much time on graph sampling and feature collection when training your GNN models, then here are some reasons to try out Quiver for your GNN model trainning. --\u003e\n\n* **High performance**: Quiver enables GPUs to be effectively used in accelerating performance-critical graph learning tasks: graph sampling, feature collection and data-parallel training. Quiver thus often significantly out-perform PyG and DGL even with a single GPU (see benchmark results below), especially when processing large-scale datasets and models.\n\n* **High scalability**: Quiver can achieve (super) linear scalability in distributed graph learning. This is contributed by Quiver's novel adaptive data/feature/processor management techniques and effective usage of fast networking technologies (e.g., NVLink and RDMA).\n\n\u003c!-- * **Greate performance and scalibility**: Using CPU to do graph sample and feature collection not only leads to poor performance, but also leads to poor scalability because of CPU contention. Quiver, however, can achieve much better scalability and can even achieve `super linear scalibility` on machines equipped with NVLink. --\u003e\n\n* **Easy to use**: To use Quiver, developers only need to add a few lines of code in existing PyG programs. Quiver is thus easy to be adopted by PyG users and deployed in production clusters.\n\n\u003c!-- * **Easy-to-use and unified API**:\nIntegrate Quiver into your training pipeline in `PyG` or `DGL` is just a matter of several lines of code change. We've also implemented IPC mechanism which makes it also a piece of cake to use Quiver to speedup your multi-gpu GNN model training (see the next section for a [quick tour](#quick-tour-for-new-users)).  --\u003e\n\n### Faster Feature Aggregation\n\nFeature aggregation is one of the performance bottleneck of GNN systems. Quiver enables faster feature aggregation with the following techniques:\n\n- Quiver uses the **feature access probability** metric to place popular features strategically on GPUs. A primary objective of feature placement is to\nenable GPUs to take advantage of low-latency connectivity,\nsuch as NVLink and InfiniBand, to their peer GPUs. This\nallows GPUs to achieve low-latency access to features when\naggregating features.\n\n- Quiver uses GPU kernels that can leverage efficient one-sided\nreads to access remote features over NVLink/InfiniBand.\n\nMore details of our feature aggregation techniques can be found in our repo [quiver-feature](https://github.com/quiver-team/quiver-feature).\n\n\u003c!-- **Quiver** is a high-performance GNN training add-on which can fully utilize the hardware to achive the best GNN trainning performance. By integrating Quiver into your GNN training pipeline with **just serveral lines of code change**, you can enjoy **much better end-to-end performance** and **much better scalability with multi-gpus**, you can even achieve **super linear scalability** if your GPUs are connected with NVLink, Quiver will help you make full use of NVLink. --\u003e\n\nBelow is a chart that describes a benchmark that evaluates the performance of Quiver, PyG (2.0.1) and [DGL](https://github.com/dmlc/dgl) (0.7.0) on a 4-GPU server that runs the [Open Graph Benchmark](https://ogb.stanford.edu/). \n\n![e2e_benchmark](docs/multi_medias/imgs/benchmark_e2e_performance-min.png)\n\nWe will add multi-node result soon.\n\nFor system design details, see Quiver's [design overview](docs/Introduction_en.md) (Chinese version: [设计简介](docs/Introduction_cn.md)).\n\n\n## Install\n\n----\n### Install Dependence\n\nTo install Quiver:\n  1. Install [Pytorch](https://pytorch.org/get-started/locally/)\n  2. Install [PyG](https://github.com/pyg-team/pytorch_geometric)\n   \n### Pip Install\n\n```cmd\n$ pip install torch-quiver\n```\n\nWe have tested Quiver with the following setup:\n\n* OS: Ubuntu 18.04, Ubuntu 20.04\n* CUDA: 10.2, 11.1\n* GPU: P100, V100, Titan X, A6000\n\n\u003c!-- |     OS        | `cu102` | `cu111` |\n|-------------|---------|---------|\n| **Ubuntu**   | ✅      | ✅      | --\u003e\n\n### Install From Source\n\n```cmd\n$ git clone https://github.com/quiver-team/torch-quiver.git \u0026\u0026 cd torch-quiver\n$ QUIVER_ENABLE_CUDA=1 python setup.py install\n```\n\n### Test Install\n\nYou can download Quiver's examples to test installation:\n\n```cmd\n$ git clone git@github.com:quiver-team/torch-quiver.git \u0026\u0026 cd torch-quiver\n$ python3 examples/pyg/reddit_quiver.py\n```\n\nA successful run should contain the following line:\n\n`Epoch xx, Loss: xx.yy, Approx. Train: xx.yy`\n\n\n\u003c!-- ### Install from source\n\nTo build Quiver from source:\n\n```cmd\n$ git clone git@github.com:quiver-team/torch-quiver.git \u0026\u0026 cd torch-quiver\n$ sh ./install.sh\n``` --\u003e\n\n### Use Quiver with Docker\n\n[Docker](https://www.docker.com/) is the simplest way to use Quiver. Check the [guide](docker/README.md) for details.\n\n\n## Quick Start\n\nTo use Quiver, you need to replace PyG's graph sampler and feature collector with `quiver.Sampler` and `quiver.Feature`. The replacement usually requires only a few changes in existing PyG programs. \n\n### Use Quiver in Single-GPU PyG Scripts\n\nOnly three steps are required to enable Quiver in a single-GPU PyG script:\n\n```python\nimport quiver\n\n...\n\n## Step 1: Replace PyG graph sampler\n# train_loader = NeighborSampler(data.edge_index, ...) # Comment out PyG sampler\ntrain_loader = torch.utils.data.DataLoader(train_idx) # Quiver: PyTorch Dataloader\nquiver_sampler = quiver.pyg.GraphSageSampler(quiver.CSRTopo(data.edge_index), sizes=[25, 10]) # Quiver: Graph sampler\n\n...\n\n## Step 2: Replace PyG feature collectors\n# feature = data.x.to(device) # Comment out PyG feature collector\nquiver_feature = quiver.Feature(rank=0, device_list=[0]).from_cpu_tensor(data.x) # Quiver: Feature collector\n\n...\n  \n## Step 3: Train PyG models with Quiver\n# for batch_size, n_id, adjs in train_loader: # Comment out PyG training loop\nfor seeds in train_loader: # Use PyTorch training loop in Quiver\n  n_id, batch_size, adjs = quiver_sampler.sample(seeds)  # Use Quiver graph sampler\n  batch_feature = quiver_feature[n_id]  # Use Quiver feature collector\n  ...\n...\n\n```\n### Use Quiver in Multi-GPU PyG Scripts\n\nTo use Quiver in multi-GPU PyG scripts, we can simply pass `quiver.Feature` and `quiver.Sampler` as arguments to the child processes launched in PyTorch's DDP training, as shown below:\n\n```python\nimport quiver\n\n# PyG DDP function that trains GNN models\ndef ddp_train(rank, feature, sampler):\n  ...\n\n# Replace PyG graph sampler and feature collector with Quiver's alternatives\nquiver_sampler = quiver.pyg.GraphSageSampler(...)\nquiver_feature = quiver.Feature(...)\n\nmp.spawn(\n      ddp_train, \n      args=(quiver_feature, quiver_sampler), # Pass Quiver components as arguments\n      nprocs=world_size,\n      join=True\n  )\n```\n\nA full multi-gpu Quiver example is [here](examples/multi_gpu/pyg/ogb-products/dist_sampling_ogb_products_quiver.py).\n\n### Run Quiver\n\nBelow is an example command that runs a Quiver's script `examples/pyg/reddit_quiver.py`:\n\n```cmd\n$ python3 examples/pyg/reddit_quiver.py\n```\n\nQuiver has the same launch command on both single-GPU servers and multi-GPU servers. We will provide multi-node examples soon. \n\u003c!-- We are developing an adaptive end-to-end parallelism system in a distributed cluster.  --\u003e\n\n\u003c!-- You can check [our reddit example](examples/pyg/reddit_quiver.py) for details. --\u003e\n\n## Examples\n\nWe provide rich examples to show how to enable Quiver in real-world PyG scripts:\n\n- Enabling Quiver in PyG's single-GPU examples: [ogbn-product](examples/pyg/) and [reddit](examples/pyg/).\n- Enabling Quiver in PyG's multi-GPU examples: [ogbn-product](examples/multi_gpu/pyg/ogb-products/) and [reddit](examples/multi_gpu/pyg/reddit/).\n\n## Documentation\n\nQuiver provides many parameters to optimise the performance of its graph samplers (e.g., GPU-local or CPU-GPU hybrid) and feature collectors (e.g., feature replication/sharding strategies). Check [Documentation](https://torch-quiver.readthedocs.io/en/latest/) for details.\n\n\u003c!-- ## License\n\nQuiver is released under the Apache 2.0 license.  --\u003e\n\n## Community\n\nWe welcome contributors to join the development of Quiver. Quiver is currently maintained by researchers from the [University of Edinburgh](https://www.ed.ac.uk/), [Imperial College London](https://www.imperial.ac.uk/), [Tsinghua University](https://www.tsinghua.edu.cn/en/index.htm) and [University of Waterloo](https://uwaterloo.ca/). The development of Quiver has received the support from [Alibaba](https://github.com/alibaba) and [Lambda Labs](https://lambdalabs.com/).\n\n## Citation\n\n\u003c!--TODO: complete citation--\u003e\nIf you find the design of Quiver useful or use Quiver in your work, please cite Quiver with the bibtex below:\n```bibtex\n@misc{quiver2023,\n    author = {Zeyuan Tan, Xiulong Yuan, Congjie He, Man-Kit Sit, Guo Li, Xiaoze Liu, Baole Ai, Kai Zeng, Peter Pietzuch and Luo Mai},\n    title = {Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness},\n    eprint={2305.10863},\n    year = {2023}\n}\n```\n\n\u003c!-- ## Architecture Overview\nKey reasons behind Quiver's high performance are that it provides two key components: `quiver.Feature` and `quiver.Sampler`.\n\nQuiver provide users with **UVA-Based**（Unified Virtual Addressing Based）graph sampling operator, supporting storing graph topology data in CPU memory and sampling the graph with GPU. In this way, we not only get performance benefits beyond CPU sampling, but can also process graphs whose size are too large to host in GPU memory. With UVA, Quiver achieves nearly **20x** sample performance compared with CPU doing graph sample. Besides `UVA mode`, Quiver also support `GPU` sampling mode which will host graph topology data all into GPU memory and will give you 40% ~ 50% performance benifit w.r.t `UVA` sample.\n\n![uva_sample](docs/multi_medias/imgs/UVA-Sampler.png)\n\n\nA training batch in GNN also consumed hundreds of MBs memory and move memory of this size across CPU memory or between CPU memory and GPU memory consumes hundreds of milliseconds.Quiver utilizes high throughput between page locked memory and GPU memory, high throughput of p2p memory access between different GPUs' memory when they are connected with NVLinks and high throughput of local GPU global memory access to achieve 4-10x higher feature collection throughput compared to conventional method(i.e. use CPU to do sparse feature collection and transfer data to GPU). It partitons data to local GPU memory, other GPUs's memory(if they connected to current GPU with NVLink) and CPU page locked memory. \n\nWe also discovered that real graphs nodes' degree often obeys power-law distribution and nodes with high degree are more often to be accessed during training and sampling. `quiver.Feature` can also do some preprocess to ensure that hottest data are always in GPU's memory(local GPU's memory or other GPU's memory which can be p2p accessed) and this will furtherly improve feature collection performance during training.\n\n![feature_collection](docs/multi_medias/imgs/single_device.png)\n\nFor system design details, you can read our (introduction)[docs/Introduction_en.md], we also provide chinese version: [中文版本系统介绍](docs/Introduction_cn.md) --\u003e\n\n\n\u003c!-- ## Benchmarks\n\nHere we show benchmark about graph sample, feature collection and end2end training. They are all tested on open dataset.\n\n### Sample benchmark\nQuiver's sampling can be configured to use UVA sampling (`mode='UVA'`) or GPU sampling(`mode='GPU'`), hosting the whole graph structure in CPU memory and GPU memory respectively.\nWe use **S**ampled **E**dges **P**er **S**econd (**SEPS**) as metrics to evaluate sample performance. **Without storing the graph on GPU, Quiver get 20x speedup on real datasets**.\n\n![sample benchmark](docs/multi_medias/imgs/benchmark_img_sample.png)\n\n### Feature collection benchmark\n\nWe constrain each GPU caching 20% of feature data. Quiver can achieve **10x throughput** on ogbn-product data compared to CPU feature collection.\n\n![single_device](docs/multi_medias/imgs/benchmark_img_feature_single_device.png)\n\nIf your GPUs are connected with NVLink, Quiver can make full use of it and achieve **super linear throughput increase**. Our test machine has 2 GPUs connected with NVLink and we still constrain each GPU caching 20% percent of feature data(which means 40% feature data are cached on GPU with 2 GPUs), we achieve 4~5x total throughput increase with the second GPU comes in.\n\n![p2p_access](docs/multi_medias/imgs/p2p_access.png)\n\n![super_linear](docs/multi_medias/imgs/super_linear_feature_bench.png)\n\n### End2End training benchmark\n\nWith high performance sampler and feature collection, Quiver not only achieve good performance with single GPU training, but also enjoys good scalability. We modify [PyGs official multi-gpu training example](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling.py) to train `ogbn-product`([code file is here](example/multi_gpu/pyg/ogb-products)). By constraining each GPU to cache only 20% of feature data, we can achieve better scalability even compared with placing all of feature data in GPU in PyG. \n\n![e2e_benchmark](docs/multi_medias/imgs/benchmark_e2e_performance.png)\n\nWhen training with multi-GPU and there are no NVLinks between these GPUs, Quiver will use `device_replicate` cache policy by default(you can refer to our [introduction](docs/Introductions_en.md) to learn more about this cache policy). If you have NVLinks, Quiver can make several GPUs share their GPU memory and cache more data to achieve higher feature collection throughput. Our test machine has 2 GPUs connected with NVLink and we still constrain each GPU caching 20% percent of feature data(which means 40% feature data are cached on GPU with 2 GPUs), we show our scalability results here:\n\n![](docs/multi_medias/imgs/nvlink_e2e.png) --\u003e\n\n\n\n\u003c!-- ## Note\n\nIf you notice anything unexpected, please open an [issue](https://github.com/quiver-team/torch-quiver/issues) and let us know.\nIf you have any questions or are missing a specific feature, feel free to discuss them with us.\nWe are motivated to constantly make Quiver even better. --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiver-team%2Ftorch-quiver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquiver-team%2Ftorch-quiver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiver-team%2Ftorch-quiver/lists"}