https://github.com/veritasyin/subg_acc
SubG is a C/OpenMP-based library for accelerating subgraph operations in Python.
https://github.com/veritasyin/subg_acc
graph-representation-learning parallel-computing scalable-graph-learning subgraph
Last synced: about 22 hours ago
JSON representation
SubG is a C/OpenMP-based library for accelerating subgraph operations in Python.
- Host: GitHub
- URL: https://github.com/veritasyin/subg_acc
- Owner: VeritasYin
- License: bsd-2-clause
- Created: 2021-06-22T18:42:50.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-31T02:04:56.000Z (6 months ago)
- Last Synced: 2025-06-26T01:02:15.206Z (about 22 hours ago)
- Topics: graph-representation-learning, parallel-computing, scalable-graph-learning, subgraph
- Language: C
- Homepage:
- Size: 1.68 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **SubG**: Subgraph Operation Accelerator
The `SubG` package is an extension library based on C and OpenMP to accelerate subgraph operations for building structural features and subgraph-based graph representation learning (SGRL).
Follow the principles of algorithm system co-design, subgraph queries (e.g. ego-network in canonical SGRLs) of target links, motifs, and high-order patterns can be decomposed into node-level intermediate results (e.g. collection of walks by `walk_sampler` in [SUREL](https://arxiv.org/abs/2202.13538), set of nodes by `gset_sampler` in [SUREL+](https://github.com/VeritasYin/SUREL_Plus/blob/main/manuscript/SUREL_Plus_Full.pdf)), whose joint can act as proxies of subgraphs, and be reused among different queries.
Currently, `SubG` consists of the following methods for efficient and scalable implementation of SGRLs:
- `gset_sampler` node set sampling with structure encoder of landing probability (LP)
- `walk_sampler` walk sampling with relative positional encoder (RPE)
- `batch_sampler` query sampling (a group of nodes) for mini-batch training of link prediction
- `walk_join` online joining of node-level walks to construct the proxy of subgraph for given queries (e.g. a link query $Q= \lbrace u,v \rbrace$ $\to$ join sampled walks of node $u$ and $v$ as $\mathcal{G}_{Q} = \lbrace W_u \uplus W_v \rbrace$)## Update
**Dec. 30, 2024**
* Release v2.3 with bug fixes and improved memory efficiency
* Support pip install & MacOS**Feb. 25, 2023**:
* Release v2.2 with more robust memory management of allocation, release and indexing (billion edges).
* Add bitwise-based hash for encoding structural features.
* Add test cases and script of wall time measure.**Jan. 29, 2023**:
* Release v2.1 with refactored code base.
* More robust memory accessing with buffer for set sampler on large graphs (million nodes).**Jan. 28, 2023**:
* Release v2.0 with the walk-based set sampler `gset_sampler`.## Requirements
(Other versions may work, but are untested)- python >= 3.8
- numpy >= 1.17
- gcc >= 8.4
- openmp (for MacOS, install llvm-openmp via Conda)## Installation
```
pip install .
```## Functions
### walk_sampler
```
subg.gset_sampler(indptr, indices, query, num_walks, num_steps)
-> (numpy.array [n, num_walks*(num_steps+1)], n * (numpy.array [?], numpy.array [?,num_steps+1]))
```Sample a collection of paths for each node in `query` (size of `n`) through `num_walks`-many `num_steps`-step random walks on the input graph in CSR format (`indptr`, `indices`), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.
For usage examples, see [test.py](https://github.com/VeritasYin/subg_acc/blob/master/test/test.py).
#### Parameters
* **indptr** *(np.array)* - Index pointer array of the adjacency matrix in CSR format.
* **indices** *(np.array)* - Index array of the adjacency matrix in CSR format.
* **query** *(np.array / list)* - Nodes are queried to be sampled.
* **num_walks** *(int)* - The number of random walks.
* **num_steps** *(int)* - The number of steps in a walk.
* **nthread** *(int, optional)* - The number of threads.
* **seed** *(int, optional)* - Random seed.#### Returns
* **walks** *(np.array)* - Sampled walks $W_q$ for each node in `query`.
* **rpes** *(np.array, np.array)* - Unique node set of sampled walks for each node in `query` and their corresponding structural encodings.### gset_sampler
```
subg.gset_sampler(indptr, indices, query, num_walks, num_steps)
-> (numpy.array [n], numpy.array [2,?], numpy.array [?,num_steps+1])
```Sample a node set for each node in `query` (size of `n`) through `num_walks`-many `num_steps`-step random walks on the input graph in CSR format (`indptr`, `indices`), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.
For usage examples, see [test.py](https://github.com/VeritasYin/subg_acc/blob/master/test/test.py).
#### Parameters
* **indptr** *(np.array)* - Index pointer array of the adjacency matrix in CSR format.
* **indices** *(np.array)* - Index array of the adjacency matrix in CSR format.
* **query** *(np.array / list)* - Nodes are queried to be sampled.
* **num_walks** *(int)* - The number of random walks.
* **num_steps** *(int)* - The number of steps in a walk.
* **bucket** *(int, optional)* - The buffer size for sampled neighbors per node.
* **nthread** *(int, optional)* - The number of threads.
* **seed** *(int, optional)* - Random seed.#### Returns
* **nsize** *(np.array)* - The size of sampled set for each node in `query`.
* **remap** *(np.array)* - Pairwised node id and the index of its associated structural encoding in `enc` array.
* **enc** *(np.array)* - The compressed (unique) encoding of structural features.### walk_join
```
subg.walk_join(walks, indices, query)
-> (numpy.array [2,n*num_walks*(num_steps+1)*2])
```
Join the sampled walks for nodes in each `query` (size of `n`). For a link query $Q= \lbrace u,v \rbrace$, `walk_join` returns the indices of structural features for $u$ as $W_{u|u} \bigoplus W_{u|v}$ and for $v$ as $W_{v|u} \bigoplus W_{v|v}$.For usage examples, see [test.py](https://github.com/VeritasYin/subg_acc/blob/master/test/test.py).
#### Parameters
* **indptr** *(np.array)* - Index pointer array of the adjacency matrix in CSR format.
* **indices** *(np.array)* - Index array of the adjacency matrix in CSR format.
* **query** *(np.array / list)* - Nodes are queried to be sampled.
* **nthread** *(int, optional)* - The number of threads.#### Returns
* **join_walk** *(np.array)* - The indices of structural features attached to the joint walks of given queries.