{"id":28700516,"url":"https://github.com/deepgraphlearning/protst","last_synced_at":"2025-07-02T17:33:12.633Z","repository":{"id":178378936,"uuid":"661758349","full_name":"DeepGraphLearning/ProtST","owner":"DeepGraphLearning","description":"[ICML-23 ORAL] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts","archived":false,"fork":false,"pushed_at":"2023-10-16T11:28:44.000Z","size":564,"stargazers_count":97,"open_issues_count":6,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-14T11:08:13.902Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-07-03T15:20:16.000Z","updated_at":"2025-05-04T07:33:30.000Z","dependencies_parsed_at":"2023-10-16T20:57:37.994Z","dependency_job_id":null,"html_url":"https://github.com/DeepGraphLearning/ProtST","commit_stats":null,"previous_names":["deepgraphlearning/protst"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/ProtST","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FProtST","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FProtST/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FProtST/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FProtST/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/ProtST/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FProtST/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259884740,"owners_count":22926456,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-14T11:08:08.353Z","updated_at":"2025-07-02T17:33:12.606Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","readme":"# ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts\n\nProtST is an advanced pretraining framework for protein sequence understanding and prediction, as introduced in our [ICML2023 oral paper](https://arxiv.org/abs/2301.12040). It is designed to enhance protein sequence pre-training and understanding by integrating protein functions and other important properties through biomedical texts.\n\nThe effectiveness and superiority of ProtST-induced PLMs over previous ones are demonstrated on diverse representation learning downstream tasks and zero-shot predictions. It also enables functional protein retrieval from large-scale databases even without any function annotation, as illustrated below.\n\n![ProtST](asset/framework.png)\n\n# Installation #\n\nYou may install the dependencies of TorchProtein and ProtST as below. \nGenerally, they work with Python 3.7/3.8 and PyTorch version \u003e= 1.8.0.\n\n```bash\nconda create -n protein python=3.9\nconda activate protein\n\nconda install pytorch==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia\nconda install torchdrug pytorch-sparse pytorch-scatter pytorch-cluster -c pytorch -c pyg -c milagraph\n\nconda install scikit-learn pandas decorator ipython networkx tqdm matplotlib -y\nconda install fair-esm transformers easydict pyyaml lmdb -c conda-forge\n```\n\n# Pre-trained Model Zoo\n\n|      Model      |                      Config                      |                                                                          Ckpt                                                                           |\n|:---------------:|:------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|\n|  ProtST-ESM-1b  |   [config](config/pretrain/pretrain_esm.yaml)    |                                     [ckpt](https://protsl.s3.us-east-2.amazonaws.com/checkpoints/protst_esm1b.pth)                                      |\n|  ProtST-ESM-2   |   [config](config/pretrain/pretrain_esm.yaml)    |                                                                        [ckpt](https://protsl.s3.us-east-2.amazonaws.com/checkpoints/protst_esm2.pth)    |                                                                     |\n| ProtST-ProtBert | [config](config/pretrain/pretrain_protbert.yaml) |                                    [ckpt](https://protsl.s3.us-east-2.amazonaws.com/checkpoints/protst_protbert.pth)                                    |\n\n# Usage\n\nTo reproduce all the experiments in ProtST, we provide all the necessary configuration files at `config/.../*.yaml`, which are categorized by the dataset, model architecture, and hyperparameters. When running experiments, we specify the configuration file with an argument `--config` and all the required arguments marked by `{{ }}` in that configuration file.\n\nNote that all the datasets will be automatically downloaded in the code. But if you are using clusters without Internet connection, please run `python ./script/prepare_all_datasets.py` to cache datasets in advance.\n\n## Pre-training\n\nBy default, we pretrain 3 different PLM backbones (ESM-1b, ESM2 and ProtBert) using 4 V100 GPUs with the following command. Note that we have the choice of using two versions of text encoders: PebMedBert trained with only abstracts `PebMedBert-abs` and PebMedBert trained with full papers `PebMedBert-full`.\n\n```\nalias python4proc='python -m torch.distributed.launch --nproc_per_node=4'\n\n# pretrain ESM-1b\npython4proc script/run_pretrain.py --config ./config/pretrain/pretrain_esm.yaml --protein_model ESM-1b --text_model PubMedBERT-abs\n\n# pretrain ESM-2\npython4proc script/run_pretrain.py --config ./config/pretrain/pretrain_esm.yaml --protein_model ESM-2-650M --text_model PubMedBERT-abs\n\n# pretrain ProtBert\npython4proc script/run_pretrain.py --config ./config/pretrain/pretrain_protbert.yaml --text_model PubMedBERT-abs\n```\n\n## Downstream Tasks: Representation Learning\n\nFor representation learning, we verify our pre-trained multimodal PLMs on 11 standard benchmarks for protein localization prediction, fitness landscape prediction and protein function annotation, under both fix-encoder learning and full-model tuning settings.\n\nWe label the pretrained checkpoints as `PRETRAIN_CHECKPOINT`. For different PLM backbone, the corresponding configuration files are in `./config/downstream_task/.../*.yaml`. We give a demonstration for ProtST-enhanced ESM-1b. \n\n### Protein Localization Prediction\n\nFor binary localization prediction, you can run as below to perform fix-encoder learning and full-model tuning, respectively:\n\n```\n# fix-encoder learning\npython4proc ./script/run_downstream.py --config ./config/downstream_task/PretrainESM/localization_fix.yaml --checkpoint $PRETRAIN_CHECKPOINT --dataset BinaryLocalization --num_class 2\n\n# full-model tuning\npython4proc ./script/run_downstream.py --config ./config/downstream_task/PretrainESM/localization_tune.yaml --checkpoint $PRETRAIN_CHECKPOINT --dataset BinaryLocalization --num_class 2\n```\n\n**Note that**, subcellular localization can be performed in the similar way (please see `./config` for details).\n\n### Fitness Landscape Prediction\n\nFor Beta-Lactamase fitness prediction, you can run as below to perform fix-encoder learning and full-model tuning, respectively:\n\n```\n# fix-encoder learning\npython4proc ./script/run_downstream.py --config ./config/downstream_task/PretrainESM/fitness_fix.yaml --checkpoint $PRETRAIN_CHECKPOINT --dataset BetaLactamase --batch_size 32\n\n# full-model tuning\npython4proc ./script/run_downstream.py --config ./config/downstream_task/PretrainESM/fitness_tune.yaml --checkpoint $PRETRAIN_CHECKPOINT --dataset BetaLactamase --batch_size 6\n```\n\n**Note that**, Fluorescence, Stability, AAV and Thermostability prediction can be performed in the similar way (please see `./config` for details).\n\n### Protein Function Annotation\n\nFor Enzyme Commission (EC) number prediction, you can run as below to perform full-model tuning:\n\n```\npython4proc ./script/run_downstream.py --config ./config/downstream_task/PretrainESM/annotation_tune.yaml --checkpoint $PRETRAIN_CHECKPOINT --dataset td_datasets.EnzymeCommission --branch null\n```\n\n**Note that**, the Gene Ontology (GO) term prediction at Molecular Function (MF), Biological Process (BP) and Cellular Component (CC) branches can be performed in the similar way (please see `./config` for details).\n\n## Downstream Tasks: Zero-shot Protein Classification\n\n### Zero-shot Predictors\n\nProtST supports zero-shot protein classification, where it does not require any labeled protein. This is achieved by comparing representation similarities between a query protein and all labels, thanks to the aligned representation space of protein sequences and label descriptions in ProtST. \n\nWe demonstrate on zero-shot subcellular localization prediction and zero-shot reaction classification with ProtST-enhanced ESM-1b. We have also explored different prompt templates and description fields as listed in `./data/zero_shot_classification/`.\n\n```\n# Subcellular Localization Prediction\n\npython ./script/run_zero_shot.py --config ./config/zero_shot/PretrainESM/zero_shot.yaml --checkpoint $PRETRAIN_CHECKPOINT --prompt_label ./data/zero_shot_classification/subloc_template.tsv --dataset SubcellularLocalization --field \"['name']\"\n\n# Reaction Classification\n\npython ./script/run_zero_shot.py --config ./config/zero_shot/PretrainESM/zero_shot.yaml --checkpoint $PRETRAIN_CHECKPOINT --prompt_label ./data/zero_shot_classification/reaction_name.tsv --dataset Reaction --field \"['name']\"\n```\n\n### Few-shot and Non-parametric Baselines\n\nProtST-induced zero-shot classifiers have better data efficiency against various few-shot and non-parametric classifiers. You can run these baselines as below:\n\n```\n# few-shot classifiers\n\n## Subcellular Localization Prediction\n\npython ./script/run_few_shot.py --config ./config/few_shot/PretrainESM/few_shot.yaml --dataset SubcellularLocalization --num_class 10  --checkpoint $PRETRAIN_CHECKPOINT \n\n## Reaction Classification\n\npython ./script/run_few_shot.py --config ./config/few_shot/PretrainESM/few_shot.yaml --dataset Reaction --num_class 384  --checkpoint $PRETRAIN_CHECKPOINT \n\n# non-parametric few-shot classifiers\n\n## Subcellular Localization Prediction\n\npython ./script/run_few_shot_nonparam.py --config ./config/few_shot/PretrainESM/few_shot.yaml --dataset SubcellularLocalization --num_class 10  --checkpoint $PRETRAIN_CHECKPOINT\n\n## Reaction Classification\n\npython ./script/run_few_shot_nonparam.py --config ./config/few_shot/PretrainESM/few_shot.yaml --dataset Reaction --num_class 384  --checkpoint $PRETRAIN_CHECKPOINT \n```\n\n### Predictor Ensemble\n\nWe also show that ProtST-based zero-shot predictor can enhance the performance of supervised learning models via ensemble. We use the following scripts to do ensembles, where `SUPERVISED_CHECKPOINT` refers to the checkpoints obtained by supervised learning on downstream tasks.\n\n```\n## Subcellular Localization Prediction\n\npython ./script/run_supervised_with_zero.py -sc ./config/downstream_task/PretrainESM/localization_fix.yaml -zc ./config/zero_shot/zero_shot.yaml --dataset SubcellularLocalization --num_class 10 --prompt_label ./data/zero_shot_classification/subloc_name.tsv --field \"['name']\" --checkpoint $PRETRAIN_CHECKPOINT --supervised_checkpoint $SUPERVISED_CHECKPOINT\n\n## Reaction Classification\n\npython ./script/run_supervised_with_zero.py -sc ./config/downstream_task/PretrainESM/reaction_tune.yaml -zc ./config/zero_shot/zero_shot.yaml --dataset Reaction --num_class 384 --prompt_label ./data/zero_shot_classification/reaction_name.tsv --field \"['name']\" --checkpoint $PRETRAIN_CHECKPOINT --supervised_checkpoint $SUPERVISED_CHECKPOINT\n```\n\n## Downstream Tasks: Text to Protein Retrieval\n\nWe illustrate the capability of ProtST-ESM-1b on retrieving functional proteins as below, where no function annotation is required:\n\n```\npython ./script/run_t2p_retrieval.py --config ./config/t2p_retrieval/go_mf.yaml --checkpoint $PRETRAIN_CHECKPOINT\n```\n\n# License \n\nThis codebase is released under the Apache License 2.0 as in the [LICENSE](LICENSE) file.\n\n# Citation\n\nIf you find this project helpful, please cite our paper:\n\n```\n@article{xu2023protst,\n  title={ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts},\n  author={Xu, Minghao and Yuan, Xinyu and Miret, Santiago and Tang, Jian},\n  journal={arXiv preprint arXiv:2301.12040},\n  year={2023}\n}\n```\n\n# Contact\n\nFor any questions or issues, open an issue or contact \nMinghao Xu (minghao.xu@mila.quebec) and Xinyu Yuan (xinyu.yuan@mila.quebec).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fprotst","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Fprotst","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fprotst/lists"}