{"id":20674665,"url":"https://github.com/quiver-team/quiver-feature","last_synced_at":"2025-04-19T20:33:49.878Z","repository":{"id":36956137,"uuid":"473634739","full_name":"quiver-team/quiver-feature","owner":"quiver-team","description":"High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph","archived":false,"fork":false,"pushed_at":"2022-07-03T02:38:17.000Z","size":1491,"stargazers_count":52,"open_issues_count":8,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-18T09:11:03.520Z","etag":null,"topics":["gnn","graph","high-performance","quiver","rdma"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quiver-team.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-24T14:12:07.000Z","updated_at":"2025-04-15T02:10:43.000Z","dependencies_parsed_at":"2022-07-29T08:39:56.330Z","dependency_job_id":null,"html_url":"https://github.com/quiver-team/quiver-feature","commit_stats":null,"previous_names":["quiver-team/quiver_feature"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Fquiver-feature","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Fquiver-feature/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Fquiver-feature/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiver-team%2Fquiver-feature/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quiver-team","download_url":"https://codeload.github.com/quiver-team/quiver-feature/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249794785,"owners_count":21326774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gnn","graph","high-performance","quiver","rdma"],"created_at":"2024-11-16T21:06:37.753Z","updated_at":"2025-04-19T20:33:49.838Z","avatar_url":"https://github.com/quiver-team.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[pypi-image]: https://badge.fury.io/py/torch-geometric.svg\n[pypi-url]: https://pypi.org/project/quiver-feature/\n\n\u003cp align=\"center\"\u003e\n  \u003cimg height=\"150\" src=\"https://github.com/quiver-team/torch-quiver/blob/main/docs/multi_medias/imgs/quiver-logo-min.png\" /\u003e\n\u003c/p\u003e\n\n--------------------------------------------------------------------------------\n\nQuiver-Feature is a RDMA-based high performance **distributed feature collection component** for **training GNN models on extremely large graphs**, It is built on [Quiver](https://github.com/quiver-team/torch-quiver) and has several novel features:\n\n1. **High Performance**: Quiver-Feature has **5-10x throughput performance** over feature collection solutions in existing GNN systems such as [DGL](https://github.com/dmlc/dgl) and [PyG](https://github.com/pyg-team/pytorch_geometric). \n\n2. **Maximum Hardware Resource Utilization Efficiency**: Quiver-Feature has minimal CPU usage and minimal memory bus traffic, leaving much of the CPU and memory resource to tasks like graph sampling and model training.\n\n3. **Easy to use**: To use Quiver-Feature, developers only need to add a few lines of code in existing PyG/DGL programs. Quiver-Feature is thus easy to be adopted by PyG/DGL users and deployed in production clusters.\n\n![train_gnn_models_on_large_graph](docs/imgs/train_gnn_on_large_graphs.png)\n\n--------------------------------------------------------------------------------\n\n# GPU-centric Data Placement And Zero-Copy Data Access\n\n**`GPU-centric data placement`** and **`Zero-Copy data access method`** are two keys behind Quiver-Feature's high performance. \n\n**`GPU-Centric Data Placement`:** Quiver-Feature has a unified view of memories across heterogeneous devices and machines. It classifies these memories into 4 memory spaces under a GPU-centric view: **Local HBM**(Current GPU's Memory),**Neighbor HBM**, **Local DRAM**(Current machines's CPU memory) and **Remote DRAM**(Remote CPU's memory). These 4 memory spaces have connections with each other using PCIe, NVLink and RDMA etc.\n\n![memory_view](docs/imgs/consistent_memory_view.png)\n\nAccessing different memory spaces from GPU has unbalanced performance. Considering that feature data access frequency during GNN training is also unbalanced, Quiver-Feature uses an **`application-aware and GPU-Centric data palcement algorithm`** to takes full advantage of the GPU-centric multi-level memory layers.\n\n**`Zero-Copy Data Access`:** Feature collection in GNN training involves massive data movement across network, DRAM, PCIe and NVLink and any extra memory copy hurts the e2e performance. Quiver-Feature uses one-sided commnunication methods such as `UVA` for local memory spaces access(Local HBM, Local DRAM, Neighbor HBM) and `RDMA READ` for remote memory space access(Remote DRAM), achiving zero-copy and minimum CPU intervention.([You can refer to this document for more RDMA details](docs/rdma_details.md))\n\n\n**`DistTensorPGAS`:** Above those memory spaces, Quiver-Feature adopts **[`PGAS`](https://en.wikipedia.org/wiki/Partitioned_global_address_space) memory model** and implements a 2-dimension distributed tensor abstraction which is called `DistTensorPGAS`. Users can use `DistTensorPGAS` just like a local torch.Tensor, such as querying `shape` and performing `slicing operation` etc.\n\n![pgas_tensor](docs/imgs/pgas_tensor_view.png)\n\n\n# Performance Benchmark\n\nAs far as we know, there's no public GNN system directly supports using RDMA for feature collection. `DGL` uses [TensorPipe](https://github.com/pytorch/tensorpipe) as its rpc backend, [TensorPipe](https://github.com/pytorch/tensorpipe) itself supports RDMA but `DGL` has not integrated this feature. Since [TensorPipe](https://github.com/pytorch/tensorpipe) is also the [official rpc backend](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.init_rpc) of Pytorch, we compare the feature collection performance between`Quiver-Feature` with `Pytorch-RPC Based Solution`. \n\nWe have 2 machines and 100Gbps IB networks between them. We partition the data uniformly and start M GPU training processes on each machine(which we will refer as `2 Machines 2M GPUs` in the following result chart). we benchmark feature collection performance of `Quiver-Feature` and `Pytorch-RPC Based Solution` and we can see that `Quiver-Feature` is 5x better over `Pytorch-RPC Based Solution` in all settings.\n\n![img](docs/imgs/e2e_feature_collection.png)\n\n# Install\n\n## Install From Source(Recommended For Now)\n1. Install [Quiver](https://github.com/quiver-team/torch-quiver).\n\n2. Install Quiver-Feature from source\n\n        $ git clone git@github.com:quiver-team/quiver-feature\n        $ cd quiver-feature/\n        $ pip install .\n\n## Pip Install\n\n1. Install [Quiver](https://github.com/quiver-team/torch-quiver).\n\n2. Install the `Quiver-Feature` pip package.\n\n        $ pip install quiver-feature\n\nWe have tested Quiver with the following setup:\n\n - OS: Ubuntu 18.04, Ubuntu 20.04\n\n - CUDA: 10.2, 11.1\n\n - GPU: Nvidia P100, V100, Titan X, A6000\n\n## Test Install\n\nYou can download Quiver-Feature's examples to test installation:\n\n        $ git clone git@github.com:quiver-team/quiver-feature.git\n        $ cd quiver-feature/examples/reddit\n        $ python3 distribute_training.py \n\nA successful run should contain the following line:\n\n`Starting Server With: xxxx`\n\n\n# Quick Start\n\nTo use Quiver-Feature, you need to replace PyG's feature tensors with `quiver_feature.DistTensorPGAS`,this usually requires only a few changes in existing PyG programs with following 4 steps on each machine:\n\n- Load feature partition and meta data which belongs to the current machine.\n\n- Exchange feature partition meta data with other processes using `quiver_feature.DistHelper`.\n\n- Create a `quiver_feature.DistTensorPGAS` from local feature partition and meta data.\n\n- Pass the `quiver_feature.DistTensorPGAS` built above as parameter to each training process for feature collection.\n\nHere is a simple example for using Quiver-Feature in a PyG's program. You can check the [original scripts](examples/reddit/distribute_training.py) for more details.\n\n```python\n    \n    def train_process(rank, dist_tensor):\n        ...\n        for batch_size, n_id, adjs in train_loader:\n                ...\n                # Using DistTensorPGAS Just Like A torch.Tensor\n                collected_feature = dist_tensor[n_id]\n                ...\n\n    if __name__ == \"__main__\":\n\n        # Step 1: Load Local data partition\n        local_tensor, cached_range, local_range = load_partitioned_data(...)\n\n        # Step 2: Exchange TensorPoints Information\n        dist_helper = DistHelper(...)\n        tensor_endpoints = dist_helper.exchange_tensor_endpoints_info()\n\n        \n        # Step 3:  Build DistTensorPGAS from local feature partition\n        dist_tensor = DistTensorPGAS(...)\n\n\n        # Step 4: Spawn Training Processes Using DistTensor as Parameter\n        mp.spawn(\n                train_process,\n                args=(..., dist_tensor, ...),\n                nprocs=args.device_per_node,\n                join=True\n        )\n        ...\n\n```\n\n# License\n\nQuiver-Feature is licensed under the Apache License, Version 2.0\n\n# Citation\nIf you use Quiver-Feature in your publication,please cite it by using the following BibTeX entry.\n\n    @Misc{Quiver-Feature,\n        institution = {Quiver Team},\n        title =  {Quiver-Feature:A High Performance Feature Collection Component For Training GNN On Extremely Large Graphs},\n        howpublished = {\\url{https://github.com/quiver-team/quiver-feature}},\n        year = {2022}\n    }","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiver-team%2Fquiver-feature","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquiver-team%2Fquiver-feature","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiver-team%2Fquiver-feature/lists"}