{"id":28700526,"url":"https://github.com/deepgraphlearning/esm-s","last_synced_at":"2025-08-09T20:24:31.296Z","repository":{"id":221480278,"uuid":"754009336","full_name":"DeepGraphLearning/esm-s","owner":"DeepGraphLearning","description":"Structure-Informed Protein Language Model","archived":false,"fork":false,"pushed_at":"2024-02-15T15:11:23.000Z","size":950,"stargazers_count":26,"open_issues_count":3,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-06T09:03:23.383Z","etag":null,"topics":["protein","protein-structure","torchdrug"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.05856","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-07T08:13:01.000Z","updated_at":"2024-09-16T03:46:08.000Z","dependencies_parsed_at":"2024-02-15T16:41:52.599Z","dependency_job_id":null,"html_url":"https://github.com/DeepGraphLearning/esm-s","commit_stats":null,"previous_names":["deepgraphlearning/esm-s"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/esm-s","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2Fesm-s","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2Fesm-s/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2Fesm-s/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2Fesm-s/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/esm-s/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2Fesm-s/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804865,"owners_count":22913903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["protein","protein-structure","torchdrug"],"created_at":"2025-06-14T11:08:11.180Z","updated_at":"2025-06-14T11:08:18.380Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","readme":"# Structure-Informed Protein Language Model\n\nThis is the official codebase of the paper\n\n**Structure-Informed Protein Language Model** \n[[ArXiv](https://arxiv.org/abs/2402.05856)]\n\n[Zuobai Zhang](https://oxer11.github.io/), [Jiarui Lu](https://lujiarui.github.io/), [Vijil Chenthamarakshan](https://researcher.watson.ibm.com/researcher/view.php?person=us-ecvijil), [Aurelie Lozano](https://researcher.watson.ibm.com/researcher/view.php?person=us-aclozano), [Payel Das](https://researcher.watson.ibm.com/researcher/view.php?person=us-daspa), [Jian Tang](https://jian-tang.com/)\n\n\n## Overview\n\nProtein language models are a powerful tool for learning protein representations. However, traditional protein language models lack explicit structural supervision. Recent studies have developed models that combine large-scale pre-training on protein sequences with the integration of structural information as input, *e.g.*, [ESM-GearNet](https://arxiv.org/abs/2303.06275). However, their reliance on protein structures as input limits their application to proteins without structures.\n\nTo address this issue, in this work, we introduce the integration of remote homology detection to **distill structural information into protein language models\nwithout requiring explicit protein structures as input**.\n\n![Training](./asset/training.png)\n\nWe take the [ESM](https://github.com/facebookresearch/esm) models as example and train them on remote homology detection tasks, *a.k.a.*, fold classification.\nThe model weights for structure-informed ESM, *i.e.*, ESM-S, can be found [here](https://huggingface.co/Oxer11/ESM-S).\n\n## Installation\n\nYou may install the dependencies via either conda or pip. Generally, ESM-S works\nwith Python 3.7/3.8 and PyTorch version \u003e= 1.12.0.\nPlease make sure the latest version of torchdrug is installed.\n\n### From Conda\n\n```bash\nconda install torchdrug pytorch=1.12.1 cudatoolkit=11.6 -c milagraph -c pytorch-lts -c pyg -c conda-forge\nconda install easydict pyyaml -c conda-forge\n```\n\n### From Pip\n\n```bash\npip install torch==1.12.1+cu116 -f https://download.pytorch.org/whl/lts/1.12/torch_lts.html\npip install torchdrug\npip install easydict pyyaml\n```\n\n## Reproduction\n\n### Download Datasets and Model Weights\n\nDefine the environment variable `DATADIR` and `MODELDIR` and then download datasets and model weights into the corresponding directories.\nThe datasets and model weights can be downloaded from [Oxer11/ESM-S](https://huggingface.co/Oxer11/ESM-S) and [Oxer11/Protein-Function-Annotation](https://huggingface.co/datasets/Oxer11/Protein-Function-Annotation).\nFor all other datasets besides EC, GO and Fold, they will be downloaded automatically by TorchDrug during first loading.\n\n```bash\nDATADIR=./data\nMODELDIR=./model\n\nmkdir $DATADIR\ncd $DATADIR\n# Download remote homology detection dataset\nwget https://huggingface.co/datasets/Oxer11/Protein-Function-Annotation/resolve/main/fold.tar.gz\ntar -xvf fold.tar.gz\n# Download Enyzme Commission dataset\nwget https://huggingface.co/datasets/Oxer11/Protein-Function-Annotation/resolve/main/ec.tar.gz\ntar -xvf ec.tar.gz\n# Download Gene Ontology dataset\nwget https://huggingface.co/datasets/Oxer11/Protein-Function-Annotation/resolve/main/ec.tar.gz\ntar -xvf ec.tar.gz\n\ncd ..\nmkdir $MODELDIR\ncd $MODELDIR\n# Download ESM-2-650M model weight\nwget https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt\n# Download ESM-2-650M-S model weight\nwget https://huggingface.co/Oxer11/ESM-S/resolve/main/esm_650m_s.pth\n```\n\n### Load Trained Model Weight\nHere we show how to load the structure-informed PLM weights `esm_650m_s.pth` into the `torchdrug.models.EvolutionaryScaleModeling` module.\nBy default, the model weights are saved as state dict.\n```python\nimport torch\nfrom torchdrug import models\n\nmodel_dir = \"./model\"   # Set the path to your model dir\nesm = models.EvolutionaryScaleModeling(model_dir, model=\"ESM-2-650M\", readout=\"mean\")\n\n# Load ESM-2-650M-S\nmodel_dict = torch.load(os.path.join(model_dir, \"esm_650m_s.pth\"), map_location=torch.device(\"cpu\"))\nesm.load_state_dict(model_dict)\n```\n\n### Structure-Informed Training\n\nTo reproduce the training of structure-informed protein language models, we need to train a base protein language model on the remote homology detection task, *i.e.*, fold classification.\nYou may choose to run on 4 gpus by reseting the `gpus` parameter in configure files. \n\n```bash\n# Train ESM-2-650M on the fold classification dataset\npython script/run.py -c config/esm_fold.yaml --datadir $DATADIR/fold --modeldir $MODELDIR --model ESM-2-650M\n\n# Train ESM-2-650M with 4 gpus\n# Remember to change the gpu in the config file to [0, 1, 2, 3]\npython -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/esm_fold.yaml --datadir $DATADIR/fold --modeldir $MODELDIR --model ESM-2-650M\n```\n\n### Predictor-Based Methods\n\nTo test the effect of structure-informed training, we compare the results by feeding ESM and ESM-S's representations into a 2-layer MLP predictor.\nThe 2-layer MLP is fine-tuned on downtream function prediction datasets.\n\n```bash\n# Tune a 2-layer MLP on ESM's representations on EC\npython script/run.py -c config/predictor/esm_ec.yaml --datadir $DATADIR/ec --modeldir $MODELDIR --model ESM-2-650M --ckpt null\n\n# Tune a 2-layer MLP on ESM-S's representations on EC\npython script/run.py -c config/predictor/esm_ec.yaml --datadir $DATADIR/ec --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n\n# Tune a 2-layer MLP on ESM-S's representations on GO-BP\npython script/run.py -c config/predictor/esm_go.yaml --datadir $DATADIR/go --level bp --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n\n# Tune a 2-layer MLP on ESM-S's representations on Beta Lacatamase\n# The dataset will be downloaded automatcially.\npython script/run.py -c config/predictor/esm_beta.yaml --datadir $DATADIR/ --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n```\n\nYou can also change the model `ESM-2-650M` to other sizes of ESM models.\n```bash\n# Tune a 2-layer MLP on ESM-2-150M-S's representations on EC\n# Remember to download the esm_150_s.pth from the link above\npython script/run.py -c config/predictor/esm_ec.yaml --datadir $DATADIR/ec --modeldir $MODELDIR --model ESM-2-150M --ckpt $MODELDIR/esm_150m_s.pth\n```\n\nAfter fine-tuning, you are expected to obtain the following results.\n![Predictor](./asset/predictor.png)\n\n\n### Retriever-Based Methods\nBesides predictor-based methods, we also use ESM and ESM-2's representations as a measure for measuring protein similarity.\nBased on these similarities, we can annotate function labels for proteins in the test set.\n\n```bash\n# Run retriever with ESM's representations on EC\npython script/retrieve.py -c config/retriever/esm_ec.yaml --datadir $DATADIR/ec --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n\n# Run retriever with ESM-S's representations on GO-BP\npython script/retrieve.py -c config/retriever/esm_go.yaml --datadir $DATADIR/go --level bp --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n\n# Run retriever with ESM-S's representations on GO-MF\npython script/retrieve.py -c config/retriever/esm_go.yaml --datadir $DATADIR/go --level mf --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n\n# Run retriever with ESM-S's representations on GO-CC\npython script/retrieve.py -c config/retriever/esm_go.yaml --datadir $DATADIR/go --level cc --modeldir $MODELDIR --model ESM-2-650M --ckpt $MODELDIR/esm_650m_s.pth\n```\n\nYou are expected to obtain the following results.\n![Retriever](./asset/retriever.png)\n\n\n## Citation\nIf you find this codebase useful in your research, please cite the following paper.\n\n```bibtex\n@article{zhang2024structureplm,\n  title={Structure-Informed Protein Language Model},\n  author={Zhang, Zuobai and Lu, Jiarui and Chenthamarakshan, Vijil and Lozano, Aurelie and Das, Payel and Tang, Jian},\n  journal={arXiv preprint arXiv:2402.05856},\n  year={2024}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fesm-s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Fesm-s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fesm-s/lists"}