{"id":16130116,"url":"https://github.com/chao1224/proteindt","last_synced_at":"2025-03-16T09:32:18.875Z","repository":{"id":158770195,"uuid":"597606944","full_name":"chao1224/ProteinDT","owner":"chao1224","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-20T12:35:48.000Z","size":20295,"stargazers_count":41,"open_issues_count":1,"forks_count":4,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-10T22:14:34.722Z","etag":null,"topics":["ai4science","drug-design","foundation-model","large-language-model","llm","protein","protein-design","protein-editing","protein-sequence","protein-structure"],"latest_commit_sha":null,"homepage":"https://chao1224.github.io/ProteinDT","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chao1224.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-05T03:38:10.000Z","updated_at":"2024-09-27T05:23:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"5f276d16-46ba-4ccb-948a-b9aa62b1edc6","html_url":"https://github.com/chao1224/ProteinDT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FProteinDT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FProteinDT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FProteinDT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FProteinDT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chao1224","download_url":"https://codeload.github.com/chao1224/ProteinDT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221661470,"owners_count":16859531,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai4science","drug-design","foundation-model","large-language-model","llm","protein","protein-design","protein-editing","protein-sequence","protein-structure"],"created_at":"2024-10-09T22:14:33.493Z","updated_at":"2025-03-16T09:32:18.859Z","avatar_url":"https://github.com/chao1224.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ProteinDT: A Text-guided Protein Design Framework\n\nAuthors: Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao\u003csup\u003e\\*\u003c/sup\u003e, Jian Tang\u003csup\u003e\\*\u003c/sup\u003e, Hongyu Guo\u003csup\u003e\\*\u003c/sup\u003e, Anima Anandkumar\u003csup\u003e\\*\u003c/sup\u003e\n\n\u003csup\u003e\\*\u003c/sup\u003e jointly supervised\n\n[[Project Page](https://chao1224.github.io/ProteinDT)] [[ArXiv](https://arxiv.org/abs/2302.04611)]\n[[Datasets on HuggingFace](https://huggingface.co/datasets/chao1224/ProteinDT/tree/main)] [[Checkpoints on HuggingFace](https://huggingface.co/chao1224/ProteinDT/tree/main)]\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/pipeline.png\" /\u003e \n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"figures/final.gif\" width=\"100%\" /\u003e \n\u003c/p\u003e\n\n## 1 Environment\n\n```\nconda create -n ProteinDT python=3.7\nconda activate ProteinDT\n\nconda install -y numpy networkx scikit-learn\n\npip install torch==1.10.*\n\npip install transformers\npip install lxml\n\n# for TAPE\npip install lmdb\npip install seqeval\n\n# for baseline ChatGPT\npip install openai==0.28.0\n\n# for baseline Galactica\npip install accelerate\n\n# for visualization\npip install matplotlib\n\n# for binding editing\npip install h5py\npip install torch_geometric==2.0 torch_scatter torch_sparse torch_cluster\npip install biopython\n\n# for ESM folding\npip install \"fair-esm[esmfold]\"\npip install dm-tree omegaconf ml-collections einops\npip install fair-esm[esmfold]==2.0.0  --no-dependencies # Override deepspeed==0.5 \npip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'\npip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'\n\nconda install -c conda-forge -yq mdtraj\n\n# for ProteinDT\npip install .\n```\n\n\n## 2 Pretraining Datasets (SwissProtCLAP) Preparation\n\nPlease check folder `preprocess/SwissProtCLAP` for SwissProtCLAP construction from UniProt.\n\nWe also provide a copy of SwissProtCLAP at [this HuggingFace link](https://huggingface.co/datasets/chao1224/ProteinDT/tree/main). Or you can use the following script:\n```\nfrom huggingface_hub import HfApi, snapshot_download\napi = HfApi()\nsnapshot_download(repo_id=\"chao1224/ProteinDT\", repo_type=\"dataset\", cache_dir='./')\n```\n\nThen move the data under `./data` folder. The data structure is\n```\n./data/\n└── SwissProtCLAP\n    ├── protein_sequence.txt\n    └── text_sequence.txt\n```\n\n## 3 Pretraining\n\nGo to folder `examples`, and do the pretraining in 5 steps. We summarize the logics of these 5 steps as below:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/pretraining_roadmap.png\" width=\"75%\" /\u003e \n\u003c/p\u003e\n\nThe pretrained checkpoints can be found at [this HuggingFace link](https://huggingface.co/chao1224/ProteinDT/tree/main).\nBefore getting started, first we need to define our output home folder, e.g., `export OUTPUT_DIR=../output/ProteinDT/hyper_01`.\n\n- Step 1. Conduct CLAP pretraining\n    - On a single GPU card:\n        ```\n        python pretrain_step_01_CLAP.py \\\n        --protein_lr=1e-5 --protein_lr_scale=1 \\\n        --text_lr=1e-5 --text_lr_scale=1 \\\n        --protein_backbone_model=ProtBERT_BFD \\\n        --epochs=10 --batch_size=9 --num_workers=0 \\\n        --output_model_dir=\"$OUTPUT_DIR\"\n        ```\n\n    - We also support distribution learning with DDP. Example of using a server with 8 GPU cards:\n        ```\n        CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\\n        python pretrain_step_01_CLAP.py \\\n        --protein_lr=1e-5 --protein_lr_scale=1 \\\n        --text_lr=1e-5 --text_lr_scale=1 \\\n        --protein_backbone_model=ProtBERT_BFD \\\n        --epochs=10 --batch_size=9 --num_workers=0 \\\n        --output_model_dir=\"$OUTPUT_DIR\"\n        ```\n\n- Step 2. Obtain frozen representation:\n    ```\n    python pretrain_step_02_empty_sequence.py \\\n    --protein_backbone_model=ProtBERT_BFD \\\n    --batch_size=16 --num_workers=0 \\\n    --pretrained_folder=\"$OUTPUT_DIR\"\n\n    python pretrain_step_02_pairwise_representation.py \\\n    --protein_backbone_model=ProtBERT_BFD \\\n    --batch_size=16 --num_workers=0 \\\n    --pretrained_folder=\"$OUTPUT_DIR\"\n    ```\n\n- Step 3. Learn the facilitator distribution:\n    ```\n    python pretrain_step_03_facilitator.py \\\n    --protein_lr=1e-5 --protein_lr_scale=1 \\\n    --text_lr=1e-5 --text_lr_scale=1 \\\n    --protein_backbone_model=ProtBERT_BFD \\\n    --epochs=10 --batch_size=9 --num_workers=0 \\\n    --pretrained_folder=\"$OUTPUT_DIR\" \\\n    --output_model_folder=\"$OUTPUT_DIR\"/step_03_Gaussian_10\n    ```\n\n- Step 4. Learn the decoder distribution. Notice that we have three types of decoder distribution models:\n    - A Transformer-based auto-regressive decoder. Here we adopt the T5 architecture.\n        ```\n        python pretrain_step_04_decoder.py \\\n        --num_workers=0 --lr=1e-4 --epochs=50 \\\n        --decoder_distribution=T5Decoder \\\n        --score_network_type=T5Base \\\n        --hidden_dim=16 \\\n        --pretrained_folder=\"$OUTPUT_DIR\" \\\n        --output_folder=\"$OUTPUT_DIR\"/step_04_T5\n        ```\n    - A discrete denoising diffusion model (multinomial diffusion).\n        - Using RNN as score network:\n            ```\n            python pretrain_step_04_decoder.py \\\n            --num_workers=0 --lr=1e-4 --epochs=50 \\\n            --decoder_distribution=MultinomialDiffusion \\\n            --score_network_type=RNN \\\n            --hidden_dim=16 \\\n            --pretrained_folder=\"$OUTPUT_DIR\" \\\n            --output_folder=\"$OUTPUT_DIR\"/step_04_MultiDiffusion_RNN\n            ```\n\n        - Using BERT as score network:\n            ```\n            python pretrain_step_04_decoder.py \\\n            --num_workers=0 --lr=1e-4 --epochs=50 \\\n            --decoder_distribution=MultinomialDiffusion \\\n            --score_network_type=BertBase \\\n            --hidden_dim=16 \\\n            --pretrained_folder=\"$OUTPUT_DIR\" \\\n            --output_folder=\"$OUTPUT_DIR\"/step_04_MultiDiffusion_BERT\n            ```\n\n- Step 5. learn an auto-encoder that is specifically designed for text-guided editing task. You can also treat this as a downstream task.\n    ```\n    python pretrain_step_05_AE.py \\\n    --num_workers=0 --lr=1e-4 --epochs=50 \\\n    --pretrained_folder=\"$OUTPUT_DIR\" \\\n    --output_folder=\"$OUTPUT_DIR\"/step_05\n    ```\n\n## 4 Downstream Tasks\n\nWe include three types of downstream tasks, as will be introduced below. You can find the scripts for first two downstream tasks under folder `scripts`.\n\n### 4.1 Text-to-Protein Generation\n\nFirst let's go to the folder `examples/downstream_Text2Protein`.\n\nThen we sample text sequences for text-to-protein generation:\n```\npython step_01_text_retrieval.py\n```\nWe also provide the sampled text data in `step_01_text_retrieval.txt`. You can replace it with the text sequences you want to use.\n\nNow we can do the text-to-sequence generation, e.g., if we use T5 as the decoder:\n```\nexport OUTPUT_DIR=../../output/ProteinDT/hyper_01\n\npython step_02_inference_ProteinDT.py \\\n--decoder_distribution=T5Decoder --score_network_type=T5Base \\\n--num_workers=0 --hidden_dim=16 --batch_size=8 \\\n--pretrained_folder=\"$OUTPUT_DIR\" \\\n--step_04_folder=\"$OUTPUT_DIR\"/step_04_T5 \\\n--num_repeat=16 --use_facilitator --AR_generation_mode=01 \\\n--output_text_file_path=\"$OUTPUT_DIR\"/step_04_T5/downstream_Text2Protein/step_02_inference.txt\n```\n\n\n### 4.2 Zero-shot Text-guided Protein Editing\n\nFirst let's go to the folder `examples/downstream_Editing`.\n\nThe dataset preparation can be found at `examples/downstream_Edting/README.md`. You can also find it on [this HuggingFace link](https://huggingface.co/datasets/chao1224/ProteinDT/tree/main/downstream_Editing/datasets_and_checkpoints). We include three types of editing tasks: stability, structure, and peptide binding. In terms of the methods, we have two types: latent optimization and latent interpolation. The demo scripts are explained below.\n\n#### 4.2.1 Latent Optimization\n- Structure / Stability: `editing_task: alpha, beta, Villin, Pin1`.\n    ```\n    export OUTPUT_DIR=../../output/ProteinDT/hyper_01\n\n    python step_01_editing_latent_optimization.py \\\n    --num_workers=0 --batch_size=8 \\\n    --lambda_value=0.9 --num_repeat=16 --oracle_mode=text --temperature=2 \\\n    --editing_task=alpha --text_prompt_id=101 \\\n    --pretrained_folder=\"$OUTPUT_DIR\" \\\n    --step_05_folder=\"$OUTPUT_DIR\"/step_05_AE \\\n    --output_folder=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/alpha_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2 \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/alpha_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2/step_01_editing.txt\n\n    python step_01_evaluate_structure.py \\\n    --num_workers=0 --batch_size=8 --editing_task=alpha --text_prompt_id=101 \\\n    --output_folder=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/alpha_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2 \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/alpha_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2/step_01_editing.txt\n    ```\n- Peptide binding\n    ```\n    export OUTPUT_DIR=../../output/ProteinDT/hyper_01\n\n    python step_01_editing_latent_optimization.py \\\n    --num_workers=0 --batch_size=4 \\\n    --lambda_value=0.9 --num_repeat=16 --oracle_mode=text --temperature=2 \\\n    --editing_task=peptide_binding --text_prompt_id=101 \\\n    --pretrained_folder=\"$OUTPUT_DIR\" \\\n    --step_05_folder=\"$OUTPUT_DIR\"/step_05_AE \\\n    --output_folder=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/peptide_binding_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2 \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_05_AE/downstream_Editing_latent_optimization/peptide_binding_prompt_101_lambda_0.9_num_repeat_16_oracle_text_T_2/step_02_editing.txt\n    ```\n\n\n#### 4.2.2 Latent Interpolation\nNotice that for latent interpolation, we have three models: auto-regressive (T5), denoising diffusion model (RNN and BERT). We provide demos scripts using T5.\n- Structure / Stability: `editing_task: alpha, beta, Villin, Pin1`.\n    ```\n    export OUTPUT_DIR=../../output/ProteinDT/hyper_01\n\n    python step_01_editing_latent_interpolation.py \\\n    --editing_task=alpha --text_prompt_id=101 \\\n    --decoder_distribution=T5Decoder --score_network_type=T5Base \\\n    --num_workers=0 --hidden_dim=16 --batch_size=2 \\\n    --theta=0.9 --num_repeat=16 --oracle_mode=text --AR_generation_mode=01 --AR_condition_mode=expanded \\\n    --pretrained_folder=\"$OUTPUT_DIR\" --step_04_folder=\"$OUTPUT_DIR\"/step_04_T5 \\\n    --output_folder=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_alpha/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_alpha/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded/step_01_editing.txt\n\n    python step_01_evaluate_structure.py \\\n    --num_workers=0 --batch_size=1 \\\n    --editing_task=alpha --text_prompt_id=101 \\\n    --output_folder=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_alpha/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_alpha/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded/step_01_editing.txt\n    ```\n- Peptide binding\n    ```\n    export OUTPUT_DIR=../../output/ProteinDT/hyper_01\n\n    python step_02_binding_editing_latent_interpolation.py \\\n    --editing_task=peptide_binding --text_prompt_id=101 \\\n    --decoder_distribution=T5Decoder --score_network_type=T5Base \\\n    --num_workers=0 --hidden_dim=16 --batch_size=1 \\\n    --theta=0.9 --num_repeat=16 --oracle_mode=text --AR_generation_mode=01 --AR_condition_mode=expanded \\\n    --pretrained_folder=\"$OUTPUT_DIR\" --step_04_folder=\"$OUTPUT_DIR\"/step_04_T5 \\\n    --output_folder=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_peptide_binding/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded \\\n    --output_text_file_path=\"$OUTPUT_DIR\"/step_04_T5/downstream_Editing_latent_interpolation_peptide_binding/prompt_101_theta_0.9_num_repeat_16_oracle_text_inference_01_expanded/step_02_editing.txt\n    ```\n\n\n\n### 4.3 Protein Property Prediction\n\nFirst please download the TAPE data following instructions [here](https://github.com/songlab-cal/tape?tab=readme-ov-file#lmdb-data). We also provide it at [this HuggingFace link](https://huggingface.co/datasets/chao1224/ProteinDT/tree/main).\n\nUnder `examples`, and the script is `downstream_TAPE.py`. We follow the exactly same hyper-parameter as [OntoProtein](https://github.com/zjunlp/OntoProtein).\n\n```\npython downstream_TAPE.py \\\n--task_name=ss3 \\\n--seed=3 \\\n--learning_rate=3e-5 \\\n--num_train_epochs=5 \\\n--per_device_train_batch_size=2 \\\n--gradient_accumulation_steps=8 \\\n--warmup_ratio=0.08 \\\n--pretrained_model=ProteinDT \\\n--pretrained_folder=\"$OUTPUT_DIR\" \\\n--output_dir=\"$OUTPUT_DIR\"/downstream_TAPE\n```\n\n\n## Cite Us\nFeel free to cite this work if you find it useful to you!\n```\n@article{liu2023text,\n    title={A Text-guided Protein Design Framework},\n    author={Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, Anima Anandkumar},\n    journal={arXiv preprint arXiv:2302.04611},\n    year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchao1224%2Fproteindt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchao1224%2Fproteindt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchao1224%2Fproteindt/lists"}