{"id":17717987,"url":"https://github.com/chaitjo/gated-graph-transformers","last_synced_at":"2025-07-16T06:39:08.290Z","repository":{"id":111513440,"uuid":"321279673","full_name":"chaitjo/gated-graph-transformers","owner":"chaitjo","description":"Transformers are Graph Neural Networks!","archived":false,"fork":false,"pushed_at":"2020-12-27T13:57:45.000Z","size":2295,"stargazers_count":54,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-07T14:52:59.679Z","etag":null,"topics":["deep-learning","graph-neural-networks","graph-representation-learning","pytorch","transformers"],"latest_commit_sha":null,"homepage":"https://thegradient.pub/transformers-are-graph-neural-networks/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chaitjo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-14T08:21:40.000Z","updated_at":"2025-03-19T15:40:51.000Z","dependencies_parsed_at":"2023-03-13T13:39:41.239Z","dependency_job_id":null,"html_url":"https://github.com/chaitjo/gated-graph-transformers","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/chaitjo/gated-graph-transformers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaitjo%2Fgated-graph-transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaitjo%2Fgated-graph-transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaitjo%2Fgated-graph-transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaitjo%2Fgated-graph-transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chaitjo","download_url":"https://codeload.github.com/chaitjo/gated-graph-transformers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaitjo%2Fgated-graph-transformers/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265488905,"owners_count":23775204,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","graph-neural-networks","graph-representation-learning","pytorch","transformers"],"created_at":"2024-10-25T14:33:30.232Z","updated_at":"2025-07-16T06:39:08.281Z","avatar_url":"https://github.com/chaitjo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :rocket: Gated Graph Transformers\n\n\u003e**Gated Graph Transformers** for graph-level property prediction, *i.e.* graph classification and regression.\n\u003e\n\u003eAssociated article: [*Transformers are Graph Neural Networks*](https://thegradient.pub/transformers-are-graph-neural-networks/), by [Chaitanya K. Joshi](http://www.chaitjo.com/), published with [*The Gradient*](https://thegradient.pub/).\n\nThis repository is a continuously updated personal project to build intuitions about and track progress in **Graph Representation Learning** research. \nI aim to develop the most universal and powerful model which unifies state-of-the-art architectures from **Graph Neural Networks** and **Transformers**, without incorporating domain-specific tricks.\n\n![Gated Graph Transformer](gated-graph-transformer.png)\n\n## Key Architectural Ideas\n\n### :robot: **Deep, Residual Transformer Backbone** \n- As the backbone architecture, I borrow the [two-sub-layered, pre-normalization variant](https://arxiv.org/abs/2002.04745) of Transformer encoders that has emerged as the standard in the NLP community, e.g. [GPT-3](https://arxiv.org/abs/2005.14165). Each Transformer block consists of a **message-passing sub-layer** followed by a **node-wise feedforward sub-layer**. The graph convolution is described later. \n- The feedforward sub-layer projects node embeddings to an *absurdly* large dimension, passes them through a non-linear activation function, does dropout, and reduces back to the original embedding dimension.\n- The Transformer backbone enables training very **deep** and extremely **overparameterized** models. Overparameterization is [important for performance in NLP](https://arxiv.org/abs/1910.10683) and other combinatorially large domains, but was previously not possible for GNNs trained on small graph classifcation datasets. Coupled with unique node positional encodings (described later) and the feedforward sub-layer, overparameterization ensures that our GNN is **Turing Universal** (based on A. Loukas's recent insightful work, including [this paper](https://arxiv.org/abs/1907.03199)).\n\n---\n  \n### :envelope: **Anisotropic Graph Convolutions**\n  \n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"anisotropic-graphconv.PNG\"\u003e\u003cbr\u003e\n  \u003ci\u003eSource: \u003ca href=\"https://openaccess.thecvf.com/content_cvpr_2018/html/Wang_Deep_Parametric_Continuous_CVPR_2018_paper.html\"\u003e'Deep Parametric Continuous Convolutional Neural Networks'\u003c/a\u003e, Wang et al., 2018\u003c/i\u003e\n\u003c/p\u003e\n\n- As the graph convolution layer, I use the [**Gated Graph Convolution**](https://arxiv.org/abs/1711.07553) with **dense attention mechanism**, which we found to be the best performing graph convolution in [Benchmarking GNNs](https://arxiv.org/abs/2003.00982). Intuitively, Gated GraphConv [generalizes directional CNN filters](https://arxiv.org/abs/1905.01289) for 2D images to arbitrary graphs by learning a **weighted aggregations over the local neighbors** of each node. It upgrades the node-to-node attention mechanism from [GATs](https://arxiv.org/abs/1710.10903) and [MoNet](https://arxiv.org/abs/1611.08402) (i.e. one attention weight per node pair) to consider dense feature-to-feature attention (i.e. *d* attention weights for pairs of *d*-dimensional node embeddings).\n- Another intuitive motivation for the Gated GraphConv is as a **learnable directional diffusion process** over the graph, or as a **coupled PDE over node and edge features** in the graph. Gated GraphConv makes the diffusion process/neighborhood aggregation anisotropic or directional, **countering [oversmoothing/oversquashing](https://arxiv.org/abs/2006.05205)** of features and enabling deeper models.\n- This graph convolution was originally proposed as [a sentence encoder for NLP](https://arxiv.org/abs/1703.04826) and further developed at NTU for [molecule generation](https://arxiv.org/abs/1906.03412) and [combinatorial optimization](https://arxiv.org/abs/1906.01227). Evidently, I am partial to this idea. At the same time, it is worth noting that anisotropic local aggregations and generalizations of directed CNN filters have demonstrated strong performance across a myriad of applications, including [**3D point clouds**](https://arxiv.org/abs/1904.07601), [drug discovery](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237), [**material science**](https://openreview.net/forum?id=K3qa-sMHpQX), and [programming languages](https://arxiv.org/abs/1906.12192).\n\n---\n  \n### :arrows_counterclockwise: **Graph Positional Encodings** \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"laplacian-eigenvectors.PNG\"\u003e\u003cbr\u003e\n  \u003ci\u003eSource: \u003ca href=\"https://arxiv.org/abs/1611.08097\"\u003e'Geometric Deep Learning: Going beyond Euclidean Data'\u003c/a\u003e, Bronstein et al., 2017\u003c/i\u003e\n\u003c/p\u003e\n\n- I use the top-*k* non-trivial **Laplacian Eigenvectors** as unique node identifiers to inject structural/positional priors into the Transformer backbone. Laplacian Eigenvectors are a generalization of sinusoidal positional encodings from the original Transformers, and were concurrently proposed in the Benchmarking GNNs, [EigenGNNs](https://arxiv.org/abs/2006.04330), and [GCC](https://arxiv.org/abs/2006.09963) papers.\n- Randomly flipping the sign of Laplacian Eigenvectors during training (due to symmetry) can be seen as an additional **data augmentation** or **regularization technique**, helping delay overfitting to training patterns. Going further, the [Directional Graph Networks](https://arxiv.org/abs/2010.02863) paper presents a more principled approach for using Laplacian Eigenvectors.\n\n---\n\nSome ideas still in the pipeline include:\n\n- **Graph-specific Normalization** - Originally motivated in Benchmarking GNNs as 'graph size normalization', there have been several subsequent graph-specific normalization techniques such as [GraphNorm](https://arxiv.org/abs/2009.03294) and [MessageNorm](https://arxiv.org/abs/2006.07739), aiming to replace or augment standard Batch Normalization. Intuitively, there is room for improvement as BatchNorm flattens mini-batches of graphs instead of accounting for the underlying graph structure.\n\n- **Theoretically Expressive Aggregation** - There are several exciting ideas aiming to bridge the gap between theoretical expressive power, computational feasability, and generalization capacity for GNNs: [PNA-style](https://arxiv.org/abs/2004.05718) multi-head aggregation and scaling, generalized aggreagators from [DeeperGCNs](https://arxiv.org/abs/2006.07739), pre-computing structural motifs as in [GSN](https://arxiv.org/abs/2006.09252), etc.\n\n- **Virtual Node and Low Rank Global Attention** - After the message-passing step, the [virtual node trick](https://arxiv.org/abs/1905.12265) adds messages to-and-fro a virtual/super node connected to all graph nodes. [LRGA](https://arxiv.org/abs/2006.07846) comes with additional theretical motivations but does something similar. Intuitively, these techniques enable modelling long range or latent interactions in graphs and counter the oversquashing problem with deeper networks.\n\n- **General Purpose Pre-training** - It isn't truly a Transformer unless its pre-trained on hundreds of GPUs for thousands of hours...but general purpose pre-training for graph representation learning remains an open question!\n\n## Installation and Usage\n```bash\n# Create new Anaconda environment\nconda create -n new-env python=3.7\nconda activate new-env\n# Install PyTorch 1.6 for CUDA 10.x\nconda install pytorch=1.6 cudatoolkit=10.x -c pytorch\n# Install DGL for CUDA 10.x\nconda install -c dglteam dgl-cuda10.x\n# Install other dependencies\nconda install tqdm scikit-learn pandas urllib3 tensorboard\npip install -U ogb\n\n# Train GNNs on ogbg-mol* datasets\npython main_mol.py --dataset [ogbg-molhiv/ogbg-molpcba] --gnn [gated-gcn/gcn/mlp]\n\n# Prepare submission for OGB leaderboards\nbash scripts/ogbg-mol*.sh\n\n# Collate results for submission\npython submit.py --dataset [ogbg-molhiv/ogbg-molpcba] --expt [path-to-logs]\n```\n\nNote: The code was tested on Ubuntu 16.04, using Python 3.6, PyTorch 1.6 and CUDA 10.1.\n\n## Citation\n```\n@article{joshi2020transformers,\n  author = {Joshi, Chaitanya K},\n  title = {Transformers are Graph Neural Networks},\n  journal = {The Gradient},\n  year = {2020},\n  howpublished = {\\url{https://thegradient.pub/transformers-are-gaph-neural-networks/ } },\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaitjo%2Fgated-graph-transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchaitjo%2Fgated-graph-transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaitjo%2Fgated-graph-transformers/lists"}