{"id":28676515,"url":"https://github.com/zjunlp/ontoprotein","last_synced_at":"2025-06-13T23:04:59.508Z","repository":{"id":41081186,"uuid":"439525410","full_name":"zjunlp/OntoProtein","owner":"zjunlp","description":"[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding","archived":false,"fork":false,"pushed_at":"2025-03-10T02:23:26.000Z","size":666,"stargazers_count":146,"open_issues_count":1,"forks_count":22,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-10T03:34:43.935Z","etag":null,"topics":["bert","gene-ontology","iclr","iclr2022","knowledge-graph","nlp","ontoprotein","pretrained-models","pretraining","protein","protein-function-prediction","protein-pretraining","protein-protein-interaction","protein-structure-prediction","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-18T04:15:18.000Z","updated_at":"2025-03-10T02:23:30.000Z","dependencies_parsed_at":"2024-11-28T02:32:24.780Z","dependency_job_id":"d79ca04b-3814-45b5-98c5-6b1096553816","html_url":"https://github.com/zjunlp/OntoProtein","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/OntoProtein","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FOntoProtein","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FOntoProtein/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FOntoProtein/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FOntoProtein/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/OntoProtein/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FOntoProtein/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259732771,"owners_count":22903087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","gene-ontology","iclr","iclr2022","knowledge-graph","nlp","ontoprotein","pretrained-models","pretraining","protein","protein-function-prediction","protein-pretraining","protein-protein-interaction","protein-structure-prediction","pytorch"],"created_at":"2025-06-13T23:04:58.537Z","updated_at":"2025-06-13T23:04:59.496Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","readme":"# OntoProtein\n\nThis is the implement of the ICLR2022 paper \"[OntoProtein: Protein Pretraining With Ontology Embedding](https://arxiv.org/pdf/2201.11147.pdf)\". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.\n- ❗NOTE: We provide a NLP for science paper-list at [https://github.com/zjunlp/NLP4Science_Papers](https://github.com/zjunlp/NLP4Science_Papers).\n\n\u003cdiv align=center\u003e\u003cimg src=\"resources/img/model.png\" width=\"80%\" height=\"80%\" /\u003e\u003c/div\u003e\n\n## Quick links\n\n* [Overview](#overview)\n* [Requirements](#requirements)\n  * [Environment for pre-training data generation](#environment-for-pre-training-data-generation)\n  * [Environmen for OntoProtein pre-training](#environment-for-ontoprotein-pre-training)\n  * [Environment for protein-related tasks](#environment-for-protein-related-tasks)\n* [Data preparation](#data-preparation)\n  * [Pre-training data](#pre-training-data)\n  * [Downstream task data](#downstream-task-data)\n* [Protein pre-training model](#protein-pre-training-model)\n* [Usage for protein-related tasks](#usage-for-protein-related-tasks)\n* [Citation](#citation)\n\n## Overview\n\u003cspan id=\"overview\"\u003e\u003c/span\u003e\n\nIn this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce **ProteinKG25**, a new large-scale KG dataset, promting the research on protein language pre-training.\n\n\u003cdiv align=center\u003e\u003cimg src=\"resources/img/main.jpg\" width=\"60%\" height=\"60%\" /\u003e\u003c/div\u003e\n\n## Requirements\n\u003cspan id=\"requirements\"\u003e\u003c/span\u003e\nTo run our code, please install dependency packages for related steps.\n\n### Environment for pre-training data generation\n\u003cspan id=\"environment-for-pre-training-data-generation\"\u003e\u003c/span\u003e\npython3.8 / biopython 1.37 / goatools\n\nFor extracting the definition of the GO term, we motified the code in `goatools` library. The changes in `goatools.obo_parser` are as follows:\n\n```python\n# line 132\nelif line[:5] == \"def: \":\n    rec_curr.definition = line[5:]\n\n# line 169\nself.definition = \"\"\n```\n\n### Environment for OntoProtein pre-training\n\u003cspan id=\"environment-for-ontoprotein-pre-training\"\u003e\u003c/span\u003e\npython3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb / \n\n### Environment for protein-related tasks\n\u003cspan id=\"environment-for-protein-related-tasks\"\u003e\u003c/span\u003e\npython3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb / tape_proteins\n\nSpecially, in library `tape_proteins`, it only implements the calculation of metric `P@L` for the contact prediction task. So, for reporting the metrics P@K taking different K values, in which the metrics P@K are precisions for the top K contacts, we made some changes in the library. Detailed changes could be seen in [[isssue #8]](https://github.com/zjunlp/OntoProtein/issues/8#issuecomment-1109975025) \n\n**Note:** environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:\n\n[BLAST](https://www.ncbi.nlm.nih.gov/books/NBK569861/) / [Interproscan](https://github.com/ebi-pf-team/interproscan) / [DeepGraphGO](https://github.com/yourh/DeepGraphGO) / [GNN-PPI](https://github.com/lvguofeng/GNN_PPI)\n\n## Data preparation\n\u003cspan id=\"data-preparation\"\u003e\u003c/span\u003e\nFor pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.\n\n### Pre-training data\n\u003cspan id=\"pre-training-data\"\u003e\u003c/span\u003e\nTo incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct [ProteinKG25](https://zjunlp.github.io/project/ProteinKG25/), a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data **ProteinKG25**, 2) generate your own pre-training data.\n\n\u003cdiv align=center\u003e\u003cimg src=\"resources/img/times.png\" width=\"50%\" height=\"50%\" /\u003e\u003c/div\u003e\n\n#### Download released data\n\nWe have released our prepared data **ProteinKG25** in [Google Drive](https://drive.google.com/file/d/1iTC2-zbvYZCDhWM_wxRufCvV6vvPk8HR/view).\n\nThe whole compressed package includes following files:\n\n- `go_def.txt`: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.\n- `go_type.txt`: The ontology type which the specific GO term belong to. The index is correponding to GO ID in `go2id.txt` file.\n- `go2id.txt`: The ID mapping of GO terms.\n- `go_go_triplet.txt`: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is \u003c `h r t`\u003e, where `h` and `t` are respectively head entity and tail entity, both GO term nodes. `r` is relation between two GO terms, e.g. `is_a` and `part_of`.\n- `protein_seq.txt`: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.\n- `protein2id.txt`: The ID mapping of proteins.\n- `protein_go_train_triplet.txt`: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is \u003c`h r t`\u003e, where `h` and `t` are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function. `r` is relation between the protein and GO term.\n- `relation2id.txt`:  The ID mapping of relations. We mix relations in two triplet relation.\n\n#### Generate your own pre-training data\n\nFor generating your own pre-training data, you need download following raw data:\n\n- `go.obo`: the structure data of Gene Ontology. The download link and detailed format see in [Gene Ontology](http://geneontology.org/docs/download-ontology/)`\n- `uniprot_sprot.dat`: protein Swiss-Prot database. [[link]](https://www.uniprot.org/downloads)\n- `goa_uniprot_all.gaf`: Gene Annotation data. [[link]](https://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/)\n\nWhen download these raw data, you can excute following script to generate pre-training data:\n\n```bash\npython tools/gen_onto_protein_data.py\n```\n\n### Downstream task data\n\u003cspan id=\"downstream-task-data\"\u003e\u003c/span\u003e\nOur experiments involved with several protein-related downstream tasks. [[Download datasets]](https://drive.google.com/file/d/12d5wzNcuPxPyW8KIzwmvGg2dOKo0K0ag/view?usp=sharing)\n\n## Protein pre-training model\n\u003cspan id=\"protein-pre-training-model\"\u003e\u003c/span\u003e\nYou can pre-training your own OntoProtein based above pretraining dataset. Before pretraining OntoProtein, you need to download two pretrained model, respectively [ProtBERT](https://huggingface.co/Rostlab/prot_bert) and [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) and save them in `data/model_data/ProtBERT` and `data/model_data/PubMedBERT`. We provide the script `bash script/run_pretrain.sh` to run pre-training. And the detailed arguments are all listed in `src/training_args.py`, you can set pre-training hyperparameters to your  need.\n\n## Usage for protein-related tasks\n\u003cspan id=\"usage-for-protein-related-tasks\"\u003e\u003c/span\u003e\n\nWe have released the checkpoint of pretrained model on the model library of `Hugging Face`. [[Download model]](https://huggingface.co/zjunlp/OntoProtein).\n\n### Running examples\n\nThe shell files of training and evaluation for every task are provided in `script/` , and could directly run. Also, you can utilize the running codes `run_downstream.py` , and write your shell files according to your need:\n\n- `run_downstream.py`: support `{ss3, ss8, contact, remote_homology, fluorescence, stability}` tasks;\n\n#### Training models\n\nRunning shell files: `bash script/run_{task}.sh`, and the contents of shell files are as follow:\n\n```shell\nbash run_main.sh \\\n    --model model_data/ProtBertModel \\\n    --output_file ss3-ProtBert \\\n    --task_name ss3 \\\n    --do_train True \\\n    --epoch 5 \\\n    --optimizer AdamW \\\n    --per_device_batch_size 2 \\\n    --gradient_accumulation_steps 8 \\\n    --eval_step 100 \\\n    --eval_batchsize 4 \\\n    --warmup_ratio 0.08 \\\n    --frozen_bert False\n```\n\nArguments for the training and evalution script are as follows,\n\n- `--task_name`: Specify which task to evaluate on, and now the script supports `{ss3, ss8, contact, remote_homology, fluorescence, stability}` tasks;\n- `--model`: The name or path of a protein pre-trained checkpoint.\n- `--output_file`: The path of the fine-tuned checkpoint saved.\n- `--do_train`: Specify if you want to finetune the pretrained model on downstream tasks.\n- `--epoch`: Epochs for training model.\n- `--optimizer`: The optimizer to use, e.g., `AdamW`.\n- `--per_device_batch_size`: Batch size per GPU.\n- `--gradient_accumulation_steps`: The number of gradient accumulation steps.\n- `--warmup_ratio`: Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.\n- `--frozen_bert`: Specify if you want to frozen the encoder in the pretrained model.\n\nAdditionally, you can set more detailed parameters in `run_main.sh`.\n\n**Notice: the best checkpoint is saved in** `OUTPUT_DIR/`.\n\n## How to Cite\n```\n@inproceedings{\nzhang2022ontoprotein,\ntitle={OntoProtein: Protein Pretraining With Gene Ontology Embedding},\nauthor={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},\nbooktitle={International Conference on Learning Representations},\nyear={2022},\nurl={https://openreview.net/forum?id=yfe1VMYAXa4}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fontoprotein","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Fontoprotein","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fontoprotein/lists"}