{"id":20216081,"url":"https://github.com/thudm/gcc","last_synced_at":"2025-04-06T07:14:45.333Z","repository":{"id":37646966,"uuid":"272621068","full_name":"THUDM/GCC","owner":"THUDM","description":"GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020","archived":false,"fork":false,"pushed_at":"2023-07-06T21:59:36.000Z","size":577,"stargazers_count":326,"open_issues_count":12,"forks_count":54,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-03-30T06:07:23.398Z","etag":null,"topics":["contrastive-learning","graph-neural-networks","pre-training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-06-16T05:51:15.000Z","updated_at":"2025-03-11T05:42:30.000Z","dependencies_parsed_at":"2022-07-12T16:42:15.505Z","dependency_job_id":"6ef6c977-4cc9-4aa0-8cff-1532fb1276e9","html_url":"https://github.com/THUDM/GCC","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FGCC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FGCC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FGCC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FGCC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/GCC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247445681,"owners_count":20939961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["contrastive-learning","graph-neural-networks","pre-training"],"created_at":"2024-11-14T06:26:14.922Z","updated_at":"2025-04-06T07:14:45.316Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"fig.png\" width=\"500\"\u003e\n  \u003cbr /\u003e\n  \u003cbr /\u003e\n  \u003ca href=\"https://github.com/THUDM/GCC/blob/master/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/THUDM/GCC\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ambv/black\"\u003e\u003cimg alt=\"Code Style\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n-------------------------------------\n\n# GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training\n\nOriginal implementation for paper [GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training](https://arxiv.org/abs/2006.09963).\n\nGCC is a **contrastive learning** framework that implements unsupervised structural graph representation pre-training and achieves state-of-the-art on 10 datasets on 3 graph mining tasks.\n\n- [GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training](#gcc-graph-contrastive-coding-for-graph-neural-network-pre-training)\n  - [Installation](#installation)\n    - [Requirements](#requirements)\n  - [Quick Start](#quick-start)\n    - [Pretraining](#pretraining)\n      - [Pre-training datasets](#pre-training-datasets)\n      - [E2E](#e2e)\n      - [MoCo](#moco)\n      - [Download Pretrained Models](#download-pretrained-models)\n    - [Downstream Tasks](#downstream-tasks)\n      - [Downstream datasets](#downstream-datasets)\n      - [Node Classification](#node-classification)\n        - [Unsupervised (Table 2 freeze)](#unsupervised-table-2-freeze)\n        - [Supervised (Table 2 full)](#supervised-table-2-full)\n      - [Graph Classification](#graph-classification)\n        - [Unsupervised (Table 3 freeze)](#unsupervised-table-3-freeze)\n        - [Supervised (Table 3 full)](#supervised-table-3-full)\n      - [Similarity Search (Table 4)](#similarity-search-table-4)\n  - [❗ Common Issues](#-common-issues)\n  - [Citing GCC](#citing-gcc)\n  - [Acknowledgements](#acknowledgements)\n\n## Installation\n\n### Requirements\n\n- Linux with Python ≥ 3.6\n- [PyTorch ≥ 1.4.0](https://pytorch.org/)\n- [0.5 \u003e DGL ≥ 0.4.3](https://www.dgl.ai/pages/start.html)\n- `pip install -r requirements.txt`\n- Install [RDKit](https://www.rdkit.org/docs/Install.html) with `conda install -c conda-forge rdkit=2019.09.2`.\n\n## Quick Start\n\n\u003c!--\n## How to process data\n\n```\npython x2dgl.py --graph-dir data_bin/kdd17 --save-file data_bin/dgl/graphs.bin\n```\n--\u003e\n\n### Pretraining\n\n#### Pre-training datasets\n\n```bash\npython scripts/download.py --url https://drive.google.com/open?id=1JCHm39rf7HAJSp-1755wa32ToHCn2Twz --path data --fname small.bin\n# For regions where Google is not accessible, use\n# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/b37eed70207c468ba367/?dl=1 --path data --fname small.bin\n```\n\n#### E2E\n\nPretrain E2E with `K = 255`:\n\n```bash\nbash scripts/pretrain.sh \u003cgpu\u003e --batch-size 256\n```\n\n#### MoCo\n\nPretrain MoCo with `K = 16384; m = 0.999`:\n\n```bash\nbash scripts/pretrain.sh \u003cgpu\u003e --moco --nce-k 16384\n```\n\n#### Download Pretrained Models\n\nInstead of pretraining from scratch, you can download our pretrained models.\n\n```bash\npython scripts/download.py --url https://drive.google.com/open?id=1lYW_idy9PwSdPEC7j9IH5I5Hc7Qv-22- --path saved --fname pretrained.tar.gz\n# For regions where Google is not accessible, use\n# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/cabec37002a9446d9b20/?dl=1 --path saved --fname pretrained.tar.gz\n```\n\n### Downstream Tasks\n\n#### Downstream datasets\n\n```bash\npython scripts/download.py --url https://drive.google.com/open?id=12kmPV3XjVufxbIVNx5BQr-CFM9SmaFvM --path data --fname downstream.tar.gz\n# For regions where Google is not accessible, use\n# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/2535437e896c4b73b6bb/?dl=1 --path data --fname downstream.tar.gz\n```\n\nGenerate embeddings on multiple datasets with\n\n```bash\nbash scripts/generate.sh \u003cgpu\u003e \u003cload_path\u003e \u003cdataset_1\u003e \u003cdataset_2\u003e ...\n```\n\nFor example:\n\n```bash\nbash scripts/generate.sh 0 saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_16384_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/current.pth usa_airport kdd imdb-binary\n```\n\n#### Node Classification\n\n##### Unsupervised (Table 2 freeze)\n\nRun baselines on multiple datasets with `bash scripts/node_classification/baseline.sh \u003chidden_size\u003e \u003cbaseline:prone/graphwave\u003e usa_airport h-index`.\n\nEvaluate GCC on multiple datasets:\n\n```bash\nbash scripts/generate.sh \u003cgpu\u003e \u003cload_path\u003e usa_airport h-index\nbash scripts/node_classification/ours.sh \u003cload_path\u003e \u003chidden_size\u003e usa_airport h-index\n```\n\n##### Supervised (Table 2 full)\n\nFinetune GCC on multiple datasets:\n\n```bash\nbash scripts/finetune.sh \u003cload_path\u003e \u003cgpu\u003e usa_airport\n```\n\nNote this finetunes the whole network and will take much longer than the freezed experiments above.\n\n#### Graph Classification\n\n##### Unsupervised (Table 3 freeze)\n\n```bash\nbash scripts/generate.sh \u003cgpu\u003e \u003cload_path\u003e imdb-binary imdb-multi collab rdt-b rdt-5k\nbash scripts/graph_classification/ours.sh \u003cload_path\u003e \u003chidden_size\u003e imdb-binary imdb-multi collab rdt-b rdt-5k\n```\n\n##### Supervised (Table 3 full)\n\n```bash\nbash scripts/finetune.sh \u003cload_path\u003e \u003cgpu\u003e imdb-binary\n```\n\n#### Similarity Search (Table 4)\n\nRun baseline (graphwave) on multiple datasets with `bash scripts/similarity_search/baseline.sh \u003chidden_size\u003e graphwave kdd_icdm sigir_cikm sigmod_icde`.\n\nRun GCC:\n\n```bash\nbash scripts/generate.sh \u003cgpu\u003e \u003cload_path\u003e kdd icdm sigir cikm sigmod icde\nbash scripts/similarity_search/ours.sh \u003cload_path\u003e \u003chidden_size\u003e kdd_icdm sigir_cikm sigmod_icde\n```\n\n## ❗ Common Issues\n\n\u003cdetails\u003e\n\u003csummary\u003e\n\"XXX file not found\" when running pretraining/downstream tasks.\n\u003c/summary\u003e\n\u003cbr/\u003e\nPlease make sure you've downloaded the pretraining dataset or downstream task datasets according to GETTING_STARTED.md.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\nServer crashes/hangs after launching pretraining experiments.\n\u003c/summary\u003e\n\u003cbr/\u003e\nIn addition to GPU, our pretraining stage requires a lot of computation resources, including CPU and RAM. If this happens, it usually means the CPU/RAM is exhausted on your machine. You can decrease `--num-workers` (number of dataloaders using CPU) and `--num-copies` (number of datasets copies residing in RAM). With the lowest profile, try `--num-workers 1 --num-copies 1`.\n\nIf this still fails, please upgrade your machine :). In the meanwhile, you can still download our pretrained model and evaluate it on downstream tasks.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\nHaving difficulty installing RDKit.\n\u003c/summary\u003e\n\u003cbr/\u003e\nSee the P.S. section in [this](https://github.com/THUDM/GCC/issues/12#issue-752080014) post.\n\u003c/details\u003e\n\n## Citing GCC\n\nIf you use GCC in your research or wish to refer to the baseline results, please use the following BibTeX.\n\n```\n@article{qiu2020gcc,\n  title={GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training},\n  author={Qiu, Jiezhong and Chen, Qibin and Dong, Yuxiao and Zhang, Jing and Yang, Hongxia and Ding, Ming and Wang, Kuansan and Tang, Jie},\n  journal={arXiv preprint arXiv:2006.09963},\n  year={2020}\n}\n```\n\n## Acknowledgements\n\nPart of this code is inspired by Yonglong Tian et al.'s [CMC: Contrastive Multiview Coding](https://github.com/HobbitLong/CMC).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fgcc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fgcc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fgcc/lists"}