{"id":28700518,"url":"https://github.com/deepgraphlearning/graphany","last_synced_at":"2025-06-14T11:08:15.029Z","repository":{"id":242453799,"uuid":"809570631","full_name":"DeepGraphLearning/GraphAny","owner":"DeepGraphLearning","description":"GraphAny: Fully-inductive Node Classification on Arbitrary Graphs","archived":false,"fork":false,"pushed_at":"2025-02-03T01:19:44.000Z","size":611,"stargazers_count":118,"open_issues_count":0,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-03T02:23:54.086Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-03T02:58:31.000Z","updated_at":"2025-02-03T01:20:14.000Z","dependencies_parsed_at":"2025-02-03T02:23:03.893Z","dependency_job_id":"6ca61f5a-57c2-4de2-9d28-b5377ea2bd8a","html_url":"https://github.com/DeepGraphLearning/GraphAny","commit_stats":null,"previous_names":["deepgraphlearning/graphany"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/GraphAny","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FGraphAny","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FGraphAny/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FGraphAny/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FGraphAny/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/GraphAny/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FGraphAny/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804865,"owners_count":22913903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-14T11:08:08.865Z","updated_at":"2025-06-14T11:08:15.020Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\n# GraphAny: Fully-inductive Node Classification on Arbitrary Graphs #\n\n[![pytorch](https://img.shields.io/badge/PyTorch_2.1+-ee4c2c?logo=pytorch\u0026logoColor=white)](https://pytorch.org/get-started/locally/)\n[![lightning](https://img.shields.io/badge/-Lightning_2.2+-792ee5?logo=pytorchlightning\u0026logoColor=white)](https://pytorchlightning.ai/)\n[![pyg](https://img.shields.io/badge/PyG_2.4+-3C2179?logo=pyg\u0026logoColor=#3C2179)](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html)\n[![arxiv](http://img.shields.io/badge/arxiv-2405.20445-blue.svg)](http://arxiv.org/abs/2405.20445)\n[![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)\n![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)\n\n\u003c/div\u003e\n\nOriginal PyTorch implementation of [GraphAny].\n\nAuthored by [Jianan Zhao], [Zhaocheng Zhu], [Mikhail Galkin], [Hesham Mostafa], [Michael Bronstein],\nand [Jian Tang].\n\n[Jianan Zhao]: https://andyjzhao.github.io/\n[Zhaocheng Zhu]: https://kiddozhu.github.io\n[Mikhail Galkin]: https://migalkin.github.io/\n[Hesham Mostafa]: https://www.linkedin.com/in/hesham-mostafa-79ba93237\n[Michael Bronstein]: https://www.cs.ox.ac.uk/people/michael.bronstein/\n[Jian Tang]: https://jian-tang.com/\n[GraphAny]: https://openreview.net/pdf?id=1Qpt43cqhg\n\n## Overview ##\n\n![Fully-Inductive Model on Node Classification](assets/fully_ind_node_cla.png)\n\nGraphAny is a fully-inductive model for node classification. A single trained GraphAny\nmodel performs node classification tasks on any graph with any feature and label\nspaces. Performance-wise, averaged on 30+ graphs, a single trained GraphAny model **_in inference mode_** \nis better than many transductive (supervised) models (e.g., MLP, GCN, and GAT)\ntrained specifically for each graph. Following the pretrain-inference paradigm of\nfoundation models, you can perform training from scratch and inference on 30 datasets\nas shown in [Training from scratch](#training-from-scratch).\n\nThis repository is based on PyTorch 2.1, Pytorch-Lightning 2.2, PyG 2.4, DGL 2.1, and Hydra 1.3.\n\n## Environment Setup ##\n\nOur experiments are designed to run on both GPU and CPU platforms. A GPU with 16 GB\nof memory is sufficient to handle all 31 datasets, and we have also tested the setup\non a single CPU (specifically, an M1 MacBook).\n\nTo configure your environment, use the following commands based on your setup:\n\n```bash\n# For setups with a GPU (requires CUDA 11.8):\nconda env create -f environment.yaml\n# For setups using a CPU (tested on macOS with M1 chip):\nconda env create -f environment_cpu.yaml\n```\n\n## File Structure ##\n\n```\n├── README.md\n├── checkpoints\n├── configs\n│   ├── data.yaml\n│   ├── main.yaml\n│   └── model.yaml\n├── environment.yaml\n├── environment_cpu.yaml\n└── graphany\n    ├── __init__.py\n    ├── data.py\n    ├── model.py\n    ├── run.py\n    └── utils\n```\n\n## Reproduce Our Results ##\n\n### Training GraphAny from Scratch ###\n\nThis section would detail how users can train GraphAny on one dataset (Cora,\nWisconsin, Arxiv, or Product) and evaluate on all 31 datasets. You can reproduce\nour results via the commands below. The checkpoints of these commands are saved in\nthe `checkpoints/` folder.\n\n```bash\ncd path/to/this/repo\n# Reproduce GraphAny-Cora: test_acc= 66.98 for seed 0\npython graphany/run.py dataset=CoraXAll total_steps=500 n_hidden=64 n_mlp_layer=1 entropy=2 n_per_label_examples=5\n# Reproduce GraphAny-Wisconsin: test_acc= 67.36 for seed 0\npython graphany/run.py dataset=WisXAll total_steps=1000 n_hidden=32 n_mlp_layer=2 entropy=1 n_per_label_examples=5\n# Reproduce GraphAny-Arxiv: test_acc=67.58 for seed 0\npython graphany/run.py dataset=ArxivXAll total_steps=1000 n_hidden=128 n_mlp_layer=2 entropy=1 n_per_label_examples=3\n# Reproduce GraphAny-Product: test_acc=67.77 for seed 0\npython graphany/run.py dataset=ProdXAll total_steps=1000 n_hidden=128 n_mlp_layer=2 entropy=1 n_per_label_examples=3\n```\n\n### Inference Using Pre-trained Checkpoints ###\n\nOnce trained, GraphAny enjoys the ability to perform inference on any graph. You\ncan use our trained checkpoint to run inference on your graph easily. Here, we\nshowcase an example of loading a GraphAny model trained on Arxiv and perform\ninference on Cora and Citeseer.\n\n**Step 1**: Define your custom combined dataset config in the `configs/data.yaml` :\n\n```yaml\n# configs/data.yaml\n_dataset_lookup:\n  # Train on Arxiv, inference on Cora and Citeseer\n  CoraCiteInference:\n    train: [ Arxiv ]\n    eval: [ Cora, Citeseer ]\n```\n\n**Step 2** _(optional)_: Define your dataset processing logic in graph_any/data.py.\nThis step is necessary only if you are not using our pre-processed data. If you\nchoose to use our provided datasets, you can skip this step and proceed directly to\nStep 3.\n\n**Step 3**: Inference using pre-trained model using command:\n\n```bash\npython graphany/run.py prev_ckpt=checkpoints/graph_any_arxiv.pt total_steps=0 dataset=CoraCiteInference\n# ind/cora_test_acc 79.4 ind/cite_test_acc 68.4\n```\n\n\n\u003cdetails\u003e\n\u003csummary\u003eExample Output Log\u003c/summary\u003e\n\u003cpre\u003e\u003ccode\u003e# Training Logs\nCRITICAL {\n'ind/cora_val_acc': 75.4,             \n'ind/cite_val_acc': 70.4,             \n'val_acc': 72.9,                      \n'trans_val_acc': nan,  # Not applicable as Arxiv is not included in the evaluation set             \n'ind_val_acc': 72.9,                  \n'heldout_val_acc': 70.4,              \n'ind/cora_test_acc': 79.4,            \n'ind/cite_test_acc': 68.4,            \n'test_acc': 73.9,                     \n'trans_test_acc': nan,                \n'ind_test_acc': 73.9,                 \n'heldout_test_acc': 68.4              \n}    \nINFO Finished main at 06-01 05:07:49, running time = 2.52s.\n\u003c/code\u003e\u003c/pre\u003e\n\nNote: The `trans_test_acc` field is not applicable since Arxiv is not specified in\nthe evaluation datasets. Additionally, the heldout accuracies are calculated by\nexcluding datasets specified as transductive in `configs/data.yaml` (default\nsettings: `_trans_datasets: [Arxiv, Product, Cora, Wisconsin]`). To utilize the heldout\nmetrics correctly, please adjust these transductive datasets in your configuration\nto reflect your specific dataset inductive split settings.\n\u003c/details\u003e\n\n## Configuration Details ##\nWe use [Hydra](https://hydra.cc/docs/intro/) to manage the configuration. The\nconfigs are organized in three files under the `configs/` directory:\n\n### `main.yaml` ###\nSettings for experiments, including random seed, wandb, path,\nhydra, and logging configs. \n \n### `data.yaml` ###\nThis file contains settings for datasets, including preprocessing specifications,\nmetadata, and lookup configurations. Here’s an overview of the key elements:\n\n\u003cdetails\u003e\n\n#### Dataset Preprocessing Options ####\n- `preprocess_device: gpu` — Specifies the device for computing propagated features $\\boldsymbol{F}$. Set to cpu if your GPU memory is below 32GB.\n- `add_self_loop: false` — Specifies whether to add self-loops to the nodes in the\n  graph.\n- `to_bidirected: true` — If set to true, edges are made bidirectional.\n- `n_hops: 2` — Defines the maximum number of hops of message passing. In our\n  experiments, besides Linear, we use LinearSGC1, LinearSGC1, LinearHGC1,\n  LinearHGC2, which predicts information within 2 hops of message passing.\n\n#### Train and Evaluation Dataset Lookup ####\n- The datasets for training and evaluation are dynamically selected based on the\n  command-line arguments by looking up from the `_dataset_lookup` configuration\n- Example: Using `dataset=CoraXAll` sets `train_datasets` to `[Cora]` and\n  `eval_datasets` to all datasets (31 in total).\n\n```yaml\ntrain_datasets: ${oc.select:_dataset_lookup.${dataset}.train,${dataset}}\neval_datasets: ${oc.select:_dataset_lookup.${dataset}.eval,${dataset}}\n_dataset_lookup:\n- CoraXAll:\n  - train: [Cora]\n  - eval: ${_all_datasets}\n```\n\nPlease define your own dataset combinations in `_dataset_lookup` if desired. \n\n#### Detailed Dataset Configurations ####\nThe dataset meta-data stores the meta information including the interfaces [DGL],\n[PyG], [OGB], [Heterophilous] and their aliases (e.g. `Planetoid.Cora`) to load the\ndataset.  The statistics are provided in the comment with a format of 'n_nodes,\nn_edges, n_feat_dim, n_labels'. For example:\n\n[DGL]: https://docs.dgl.ai/en/2.0.x/api/python/dgl.data.html#node-prediction-datasets\n[PyG]: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html\n[OGB]: https://ogb.stanford.edu/docs/nodeprop/\n[Heterophilous]: https://arxiv.org/abs/2302.11640\n\n```yaml\n_ds_meta_data:\n  Arxiv: ogb, ogbn-arxiv # 168,343 1,166,243 100 40\n  Cora: pyg, Planetoid.Cora # 2,708 10,556 1,433 7\n```\n\u003c/details\u003e\n\n### `model.yaml` ###\nThis file contains the settings for models and training.\n\n\u003cdetails\u003e\n\nGraphAny leverages **_interactions between predictions_** as input features for an\nMLP to calculate inductive attention scores. These inputs are termed \"**_feature\nchannels_**\" and are defined in the configuration file as `feat_chn`. Subsequently,\nthe outputs from LinearGNNs, referred to as \"**_prediction channels_**\", are\ncombined using inductive attention scores and are defined as `pred_chn` in the\nconfiguration file. The default settings are:\n\n```yaml\nfeat_chn: X+L1+L2+H1+H2 # X=Linear, L1=LinearSGC1, L2=LinearSGC2, H1=LinearHGC1, H2=LinearHGC2\npred_chn: X+L1+L2 # H1 and H2 channels are masked to enhance convergence speed.\n```\n\nIt is important to note that the feature channels and prediction channels do not\nneed to be identical. Empirical observations indicate that masking LinearHGC1 and\nLinearHGC2 leads to faster convergence and marginally improved results (results in\nTable 2, Figure 1, and Figure 5). Furthermore, for the attention visualizations in\nFigure 6, all five channels (`pred_chn=X+L1+L2+H1+H2`) are employed. This\ndemonstrates GraphAny's capability to learn inductive attention that effectively\nidentifies critical channels for unseen graphs.\n\nOther model parameters and default values:\n```yaml\n# The entropy to normalize the distance features (conditional gaussian distribution). The standard deviation of conditional gaussian distribution is dynamically determined via binary search, default to 1\nentropy: 1\nattn_temp: 5 # The temperature for attention normalization\nn_hidden: 128 # The hidden dimension of MLP\nn_mlp_layer: 2\n```\n\u003c/details\u003e\n\n\n## Bring Your Own Dataset ##\n\n\u003cdetails\u003e\n\u003csummary\u003e\nWe support three major sources of graph dataset interfaces:\n\u003ca href=\"https://docs.dgl.ai/en/2.0.x/api/python/dgl.data.html#node-prediction-datasets\"\u003eDGL\u003c/a\u003e,\n\u003ca href=\"https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html\"\u003ePyG\u003c/a\u003e, and\n\u003ca href=\"https://ogb.stanford.edu/docs/nodeprop/\"\u003eOGB\u003c/a\u003e.\nIf you are interested in adding your own dataset, here's how we integrated the cleaned\nTexas dataset processed by \u003ca href=\"https://arxiv.org/abs/2302.11640\"\u003ethis paper\u003c/a\u003e.\n\u003ci\u003eThe original Texas dataset contains 5 classes, with a class with only one node,\nwhich makes using this class for training and evaluation meaningless.\u003c/i\u003e\n\u003c/summary\u003e\n\nIn the example below, we demonstrate how to add a dataset called \"Texas\" with 4\nclasses from a new data source termed `heterophilous`.\n\n**Step 1**: Update `configs/data.yaml`:\n\nFirst, define your dataset's metadata.\n\n```yaml\n# configs/data.yaml\n_ds_meta_data: # key: dataset name, value: data_source, alias\n  Texas: heterophilous, texas_4_classes \n```\n\nThe `data_source` is set as 'heterophilous', which is handled differently from other\nsources ('pyg', 'dgl', 'ogb').\n\nAdditionally, update the `_dataset_lookup` with a new setting:\n\n```yaml\n# configs/data.yaml\n_dataset_lookup:\n  Debug:\n    train: [ Wisconsin ]\n    eval: [ Texas ]\n```\n\n**Step 2**: Implement the dataset interface:\n\nImplement `load_heterophilous_dataset` in `data.py` to download and process the dataset.\n\n```python\nimport numpy as np\nimport torch\nfrom graphany.data import download_url\nimport dgl\n\ndef load_heterophilous_dataset(url, raw_dir):\n    # Converts Heterophilous dataset to DGL Graph format\n    download_path = download_url(url, raw_dir)\n    data = np.load(download_path)\n    node_features = torch.tensor(data['node_features'])\n    labels = torch.tensor(data['node_labels'])\n    edges = torch.tensor(data['edges'])\n\n    graph = dgl.graph((edges[:, 0], edges[:, 1]),\n                      num_nodes=len(node_features), idtype=torch.int32)\n    num_classes = len(labels.unique())\n    train_mask, val_mask, test_mask = torch.tensor(data['train_mask']), torch.tensor(data['val_mask']), torch.tensor(\n        data['test_mask'])\n\n    return graph, labels, num_classes, node_features, train_mask, val_mask, test_mask\n```\n\n**Step 3**: Update `GraphDataset` class in `data.py`:\n\nModify the initialization and dataset loading functions:\n\n```python\n# In GraphDataset.__init__():\nif self.data_source in ['dgl', 'pyg', 'ogb']:\n    pass # Code for other data sources omitted for brevity\nelif self.data_source == 'heterophilous':\n    target = '.data.load_heterophilous_dataset'\n    url = f'https://example.com/data/{ds_alias}.npz'\n    ds_init_args = {\n        \"_target_\": target, 'raw_dir': f'{cfg.dirs.data_storage}{self.data_source}/', 'url': url\n    }\nelse:\n    raise NotImplementedError(f'Unsupported data source: {self.data_source}')\n\n# In GraphDataset.load_dataset():\nfrom hydra.utils import instantiate\ndef load_dataset(self, data_init_args):\n    dataset = instantiate(data_init_args)\n    if self.data_source in ['dgl', 'pyg', 'ogb']:\n        pass # Code for other data sources omitted for brevity\n    elif self.data_source == 'heterophilous':\n        g, label, num_class, feat, train_mask, val_mask, test_mask = dataset\n    # Rest of the code omitted for brevity\n```\n\nYou can now run the code using the following commands:\n\n```bash\n# Training from scratch\npython graphany/run.py dataset=Debug total_steps=500\n# Inference using existing checkpoint\npython graphany/run.py prev_ckpt=checkpoints/graph_any_wisconsin.pt dataset=Debug total_steps=0\n```\n\u003c/details\u003e\n\n## Using Wandb for Enhanced Visualization ##\n\nWe recommend using [Weights \u0026 Biases](https://wandb.ai/) (wandb) for advanced\nvisualization capabilities. As an example, consider the visualizations for the\nGraphAny-Arxiv project shown below, which illustrate the validation accuracy across\ndifferent data set categories:\n- **Transductive**: Training dataset (i.e. Arxiv)\n- **Heldout**: 27 datasets (except Cora, Wisconsin, Arxiv, Product)\n- **Inductive**: 30 datasets (except arxiv)\n- **Overall**: 31 datasets (all datasets)\n\n![wandb_training_curve](assets/wandb_training_curve.png)\n\nBy default, wandb integration is disabled. To enable and configure wandb for your\nproject, use the following command, substituting `YourOwnWandbEntity` with your\nactual Weights \u0026 Biases entity name:\n\n```bash\nuse_wandb=true wandb_proj=GraphAny wandb_entity=YourOwnWandbEntity\n```\n\nThis setup will allow you to track and visualize metrics dynamically.\n\n## Citation ##\nIf you find this codebase useful in your research, please cite the paper.\n\n```bibtex\n@article{zhao2025graphany,\n  title = {Fully-inductive Node Classification on Arbitrary Graphs},\n  author = {Jianan Zhao and Zhaocheng Zhu and Mikhail Galkin and Hesham Mostafa and Michael Bronstein and Jian Tang},\n  journal = {International Conference on Learning Representations},\n  year = {2025}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fgraphany","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Fgraphany","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fgraphany/lists"}