{"id":28700497,"url":"https://github.com/deepgraphlearning/siamdiff","last_synced_at":"2025-07-28T08:07:47.522Z","repository":{"id":179779896,"uuid":"662828627","full_name":"DeepGraphLearning/SiamDiff","owner":"DeepGraphLearning","description":"Code for Pre-training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction (https://arxiv.org/abs/2301.12068)","archived":false,"fork":false,"pushed_at":"2023-07-09T02:25:27.000Z","size":1952,"stargazers_count":40,"open_issues_count":4,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-14T11:08:10.148Z","etag":null,"topics":["pre-training","protein","protein-protein-interaction","protein-representation-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-06T01:29:57.000Z","updated_at":"2025-04-25T06:55:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"d0fa2157-c3f7-491a-92f9-79374126598c","html_url":"https://github.com/DeepGraphLearning/SiamDiff","commit_stats":null,"previous_names":["deepgraphlearning/siamdiff"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/SiamDiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FSiamDiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FSiamDiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FSiamDiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FSiamDiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/SiamDiff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FSiamDiff/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267482004,"owners_count":24094508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pre-training","protein","protein-protein-interaction","protein-representation-learning"],"created_at":"2025-06-14T11:08:06.667Z","updated_at":"2025-07-28T08:07:47.512Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","readme":"# SiamDiff: Siamese Diffusion Trajectory Prediction\n\n\nThis is the official codebase of the paper\n\n**Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction**\n[[ArXiv](https://arxiv.org/abs/2301.12068)]\n\n[Zuobai Zhang*](https://oxer11.github.io/), [Minghao Xu*](https://chrisallenming.github.io/), [Aurelie Lozano](https://researcher.watson.ibm.com/researcher/view.php?person=us-aclozano), [Vijil Chenthamarakshan](https://researcher.watson.ibm.com/researcher/view.php?person=us-ecvijil), [Payel Das](https://researcher.watson.ibm.com/researcher/view.php?person=us-daspa), [Jian Tang](https://jian-tang.com/)\n\n## Overview\n\n*Siamese Diffusion Trajectory Prediction (**SiamDiff**)* is a diffusion-based pre-training algorithm for protein structure encoders.\nThe method performs diffusion on both protein sequences and structures and learn effective representations based on mutual denoising between two siamese diffusion trajectories.\nIt achieves large improvements on a diverse set of downstream tasks, including function annotation, protein-protein interaction prediction, mutational effect prediction, residue structural role modeling, and protein structure ranking.\nAmong all existing pre-training algorithms, SiamDiff is the only one that can consistently deliever large improvments on all the tasks.\n\n![SiamDiff](./asset/SiamDiff.png)\n\nThis codebase is based on PyTorch and [TorchDrug] ([TorchProtein](https://torchprotein.ai)). \nIt supports training and inference with multiple GPUs.\nThe basic implementation of GearNet and datasets can be found in the [docs](https://torchdrug.ai/docs/) of TorchDrug and the step-by-step [tutorials](https://torchprotein.ai/tutorials) in TorchProtein.\n\n[TorchDrug]: https://github.com/DeepGraphLearning/torchdrug\n\n## Installation\n\nYou may install the dependencies via either conda or pip. Generally, SiamDiff works\nwith Python 3.7/3.8 and PyTorch version \u003e= 1.8.0.\n\n### From Conda\n\n```bash\nconda install torchdrug pytorch=1.8.0 cudatoolkit=11.1 -c milagraph -c pytorch-lts -c pyg -c conda-forge\nconda install easydict pyyaml -c conda-forge\npip install atom3d\n```\n\n### From Pip\n\n```bash\npip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html\npip install torchdrug\npip install easydict pyyaml atom3d\n```\n\n\n## Reproduction\n\n### Training From Scratch\n\nTo reproduce the results of GearNet-Edge on Atom3D and EC prediction, use the following command. \nAll the datasets except PIP will be automatically downloaded in the code.\nIt takes longer time to run the code for the first time due to the preprocessing time of the dataset.\n\n```bash\n# Run GearNet-Edge on the MSP dataset with 1 gpu\npython script/run_1gpu.py -c config/atom/msp_gearnet.yaml\n\n# Run GearNet-Edge on the PSR dataset with 1 gpu\npython script/run_1gpu.py -c config/atom/psr_gearnet.yaml\n\n# First download and unzip the preprocessed PIP dataset from Atom3D,\n# Then run GearNet-Edge on the dataset with 1 gpu\nwget https://zenodo.org/record/4911102/files/PPI-DIPS-split.tar.gz -P ~/scratch/protein-datasets/PPI-DIPS-split/\ntar -zxvf ~/scratch/protein-datasets/PPI-DIPS-split/PPI-DIPS-split.tar.gz -C ~/scratch/protein-datasets/PPI-DIPS-split/\npython script/run_1gpu.py -c config/atom/pip_gearnet.yaml\n\n# Since the RES dataset is large, we run GearNet-Edge with 4 gpus\npython -m torch.distributed.launch --nproc_per_node=4 script/run_4gpu.py -c config/atom/res_gearnet.yaml\n```\n\nBesides atom-level tasks, we also provide residue-level evaluation on MSP, PSR and EC datasets. All these models are run with 4 gpus.\n```bash\npython -m torch.distributed.launch --nproc_per_node=4 script/run_4gpu.py -c config/res/msp_gearnet.yaml\n\npython -m torch.distributed.launch --nproc_per_node=4 script/run_4gpu.py -c config/res/psr_gearnet.yaml\n\npython -m torch.distributed.launch --nproc_per_node=4 script/run_4gpu.py -c config/res/ec_gearnet.yaml\n```\n\n### Pre-training and Fine-tuning\nBy default, we will use the AlphaFold Database for pretraining.\nTo pretrain GearNet-Edge with SiamDiff, use the following command. \nSimilar, all the datasets will be automatically downloaded in the code and preprocessed for the first time you run the code.\n\nThe pre-training is divided into two stages: large noise stage and small noise stage.\n```bash\n# The first-stage pre-training with SiamDiff\npython -m torch.distributed.launch --nproc_per_node=4 script/pretrain.py -c config/pretrain/gearnet_1st.yaml\n\n# The second-stage pre-training with SiamDiff\n# \u003cpath_to_ckpt\u003e is the path to the checkpoint from the first-stage pre-training\npython -m torch.distributed.launch --nproc_per_node=4 script/pretrain.py -c config/pretrain/gearnet_2st.yaml --ckpt \u003cpath_to_ckpt\u003e\n```\n\nAfter pretraining, you can load the model weight from the saved checkpoint via the `--ckpt` argument and then fine-tune the model on downstream tasks.\n\n```bash\n# Fine-tune the pre-trained model on the PIP dataset\n# \u003cpath_to_ckpt\u003e is the path to the checkpoint after two-stage pre-training\npython script/run_1gpu.py -c config/atom/pip_gearnet.yaml --ckpt \u003cpath_to_ckpt\u003e\n```\n\nSimilar commands can be used for residue-level pre-training.\n```bash\n# Two-stage pre-training with SiamDiff\npython -m torch.distributed.launch --nproc_per_node=4 script/pretrain.py -c config/pretrain/res_gearnet_1st.yaml\n\npython -m torch.distributed.launch --nproc_per_node=4 script/pretrain.py -c config/pretrain/res_gearnet_2st.yaml --ckpt \u003cpath_to_ckpt\u003e\n\n# Fine-tune the pre-trained model on the EC dataset\npython -m torch.distributed.launch --nproc_per_node=4 script/run_4gpu.py -c config/res/ec_gearnet.yaml --ckpt \u003cpath_to_ckpt\u003e\n```\n\nYou provide the two-stage pre-trained model weights as below.\n\n| Model | Config | Ckpt |\n| ---- | :----: | :----: |\n| GearNet-Edge (atom) | [config1](./config/pretrain/gearnet_1st.yaml), [config2](./config/pretrain/gearnet_2nd.yaml) | [ckpt](https://www.dropbox.com/scl/fi/hvtmqqfr6bvz8y2wrdrph/siamdiff_gearnet_res.pth?rlkey=daeanrcqk9b0erw9ot04932c6\u0026dl=0) |\n| GearNet-Edge (residue) | [config1](./config/pretrain/res_gearnet_1st.yaml), [config2](./config/pretrain/res_gearnet_2nd.yaml) | [ckpt](https://www.dropbox.com/scl/fi/njhq7lqrdn2bnvk0wxwz2/siamdiff_gearnet_atom.pth?rlkey=h78tif5a0pwq6mmw7atp7962v\u0026dl=0) |\n\nWe provide the hyperparameters for each experiment in configuration files.\nAll the configuration files can be found in `config/*.yaml`.\nWe list some important configuration hyperparameters here:\n| Config | Meaning |\n| :---- | :---- |\n| engine.gpus | which gpu(s) to use for training; if set to `null`, use cpu instead |\n| engine.batch_size | the batch size for training on each gpu |\n| train.train_time | the maximum time for training per epoch |\n| train.val_time | the maximum time for validation per epoch |\n| train.test_time | the maximum time for testing per epoch |\n| model_checkpoint | the path to a model checkpoint |\n| save_interval | save the pre-trained model every `save_interval` epochs |\n| save_model | whether to save the model (encoder) for downstream tasks or save the task (encoder + prediction head) for next-stage pre-training ; if `True`, save the model; otherwise, save the task |\n| task.SiamDiff.use_MI | whether to use mutual information maximization; if `True`, use SiamDiff; otherwise, use DiffPreT |\n\nDetails of model hyperparameters can be found in the docstring.\n\n## Results\nHere are the results of GearNet-Edge on all benchmark tasks.\n**It should be noted that since PSR and MSP datasets are quite small, they typcically have large variances for their results. So we cannot guarantee the absolute performance can be consistently achieved on different machines. But the improvements of pre-training with SiamDiff over un-pretrained models should be observable.**\nThe performance on downstream tasks are very sensitive to hyperparameters, so please carefully follow our configs for reproduction.\nMore detailed results are listed in the paper.\n\n![Atom](./asset/atom_result.png)\n![Residue](./asset/residue_result.png)\n\n## Citation\nIf you find this codebase useful in your research, please cite the following paper.\n\n```bibtex\n@article{zhang2023siamdiff,\n  title={Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction},\n  author={Zhang, Zuobai and Xu, Minghao and Lozano, Aur{\\'e}lie and Chenthamarakshan, Vijil and Das, Payel and Tang, Jian},\n  journal={arXiv preprint arXiv:2301.12068},\n  year={2023}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fsiamdiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Fsiamdiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fsiamdiff/lists"}