{"id":20216197,"url":"https://github.com/thudm/proteinlm","last_synced_at":"2025-04-10T15:12:18.987Z","repository":{"id":47172298,"uuid":"360751071","full_name":"THUDM/ProteinLM","owner":"THUDM","description":"Protein Language Model","archived":false,"fork":false,"pushed_at":"2024-01-15T23:35:06.000Z","size":711,"stargazers_count":116,"open_issues_count":2,"forks_count":21,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-24T13:11:16.414Z","etag":null,"topics":["deep-learning","pretrained-models","protein-language-model","transfer-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-23T03:24:08.000Z","updated_at":"2025-03-02T18:35:42.000Z","dependencies_parsed_at":"2024-11-14T12:30:43.595Z","dependency_job_id":null,"html_url":"https://github.com/THUDM/ProteinLM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FProteinLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FProteinLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FProteinLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FProteinLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/ProteinLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248243206,"owners_count":21071054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","pretrained-models","protein-language-model","transfer-learning"],"created_at":"2024-11-14T06:26:47.069Z","updated_at":"2025-04-10T15:12:18.968Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ProteinLM\n\n\n- [ProteinLM](#proteinlm)\n- [Overview](#overview)\n- [Guidance](#guidance)\n  - [Download ProteinLM](#download-proteinlm)\n    - [ProteinLM (200M)](#proteinlm-200m)\n    - [ProteinLM (3B)](#proteinlm-3b)\n- [Project Structure](#project-structure)\n- [Usage](#usage)\n- [Downstream Tasks Performance](#downstream-tasks-performance)\n- [Citation](#citation)\n- [Contact](#contact)\n- [Reference](#reference)\n\n\nWe pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.\n\n\n\n# Overview\n\nThe proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data. \n\n\n# Guidance\nWe provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.\n- Pretrain [README](./pretrain/README.md)\n- Finetune [README](./tape/README.md)\n\n## Download ProteinLM\n### ProteinLM (200M) \nFor the pretrained model with 200 million parameters,\nyou can download model checkpoint via [GoogleDrive](https://drive.google.com/file/d/1BkJn_7y7LNWyxntaAPa333jDGIVoTbrs/view?usp=sharing), or [TsinghuaCloud](https://cloud.tsinghua.edu.cn/f/f62bef666bc742ebb7c2/?dl=1).\n\n### ProteinLM (3B) \nFor the pretrained model with 3 billion parameters,\nyou can download model checkpoint from [here](https://resource.wudaoai.cn/).\n\n\n# Project Structure\n```\n.\n├── pretrain                (protein language model pretrain)\n│   ├── megatron            (model folder)\n│   ├── pretrain_tools      (multi-node pretrain)\n│   ├── protein_tools       (data preprocess shells)\n└── tape\n    ├── conda_env           (conda env in yaml format)\n    ├── converter           (converter script and model config files)\n    ├── scripts             (model generator, finetune)\n    └── tape                (tape model)\n```\n\n# Usage\n\nAs the structure above shows, there are two stages as follows.\n\n- Pretrain\n  - Prepare dataset (`PFAM`)\n  - Preprocess data\n  - Pretrain\n- Finetune\n  - Convert pretrain protein model checkpoint\n  - Finetune on downstream tasks\n\nDetailed explanations are given in each folder's readme.\n\n\n# Downstream Tasks Performance\n\n| Task | Metric | TAPE | ProteinLM (200M) | ProteinLM (3B) |  \n|:-:|:-:|:-:|:-:|:-:|\n| contact prediction  | P@L/5               | 0.36 | 0.52 | **0.75** |\n| remote homology     | Top 1 Accuracy      | 0.21 | 0.26 | **0.30** |\n| secondary structure | Accuracy (3-class)  | 0.73 | 0.75 | **0.79** |\n| fluorescence        | Spearman's rho      | 0.68 | 0.68 | 0.68 |\n| stability           | Spearman's rho      | 0.73 | 0.77 | **0.79** |\n\n\n# Citation\nPlease cite our paper if you find our work useful for your research. Our paper is can be accessed [here](https://arxiv.org/abs/2108.07435).\n```\n@article{DBLP:journals/corr/abs-2108-07435,\n  author    = {Yijia Xiao and\n               Jiezhong Qiu and\n               Ziang Li and\n               Chang{-}Yu Hsieh and\n               Jie Tang},\n  title     = {Modeling Protein Using Large-scale Pretrain Language Model},\n  journal   = {CoRR},\n  volume    = {abs/2108.07435},\n  year      = {2021},\n  url       = {https://arxiv.org/abs/2108.07435},\n  eprinttype = {arXiv},\n  eprint    = {2108.07435},\n  timestamp = {Fri, 20 Aug 2021 13:55:54 +0200},\n  biburl    = {https://dblp.org/rec/journals/corr/abs-2108-07435.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n\n# Contact\nIf you have any problem using ProteinLM, feel free to contact via [mr.yijia.xiao@gmail.com](mailto:mr.yijia.xiao@gmail.com).\n\n\n# Reference\n\nOur work is based on the following papers. And part of the code is based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [TAPE](https://github.com/songlab-cal/tape).\n\n\n[__Evaluating Protein Transfer Learning with TAPE__](https://arxiv.org/abs/1906.08230v1)\n```\n@article{DBLP:journals/corr/abs-1909-08053,\n  author    = {Mohammad Shoeybi and\n               Mostofa Patwary and\n               Raul Puri and\n               Patrick LeGresley and\n               Jared Casper and\n               Bryan Catanzaro},\n  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using\n               Model Parallelism},\n  journal   = {CoRR},\n  volume    = {abs/1909.08053},\n  year      = {2019},\n  url       = {http://arxiv.org/abs/1909.08053},\n  archivePrefix = {arXiv},\n  eprint    = {1909.08053},\n  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},\n  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n\n[__Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism__](https://arxiv.org/abs/1909.08053v4)\n```\n@article{DBLP:journals/corr/abs-1906-08230,\n  author    = {Roshan Rao and\n               Nicholas Bhattacharya and\n               Neil Thomas and\n               Yan Duan and\n               Xi Chen and\n               John F. Canny and\n               Pieter Abbeel and\n               Yun S. Song},\n  title     = {Evaluating Protein Transfer Learning with {TAPE}},\n  journal   = {CoRR},\n  volume    = {abs/1906.08230},\n  year      = {2019},\n  url       = {http://arxiv.org/abs/1906.08230},\n  archivePrefix = {arXiv},\n  eprint    = {1906.08230},\n  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},\n  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fproteinlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fproteinlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fproteinlm/lists"}