https://github.com/tiger-ai-lab/disprotedit
Official Repo for "DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing" [ICMLW 2025]
https://github.com/tiger-ai-lab/disprotedit
protein protein-editing
Last synced: 5 months ago
JSON representation
Official Repo for "DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing" [ICMLW 2025]
- Host: GitHub
- URL: https://github.com/tiger-ai-lab/disprotedit
- Owner: TIGER-AI-Lab
- License: mit
- Created: 2025-06-17T04:40:49.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-20T20:36:32.000Z (about 1 year ago)
- Last Synced: 2025-07-27T06:35:16.650Z (11 months ago)
- Topics: protein, protein-editing
- Language: Python
- Homepage: https://tiger-ai-lab.github.io/DisProtEdit/
- Size: 10.4 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐งฌ DisProtEdit
[](https://arxiv.org/abs/2506.14853)
[**๐ Homepage**](https://tiger-ai-lab.github.io/DisProtEdit/) | [**๐ arXiv**](https://arxiv.org/abs/2506.14853)
[](https://github.com/TIGER-AI-Lab/DisProtEdit/blob/main/LICENSE)
[](https://github.com/TIGER-AI-Lab/DisProtEdit)
This repo contains the codebase for our paper:
**DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing**
**๐ ICML 2025 Workshops (GenBio, FM4LS)**
---
## ๐ Introduction
DisProtEdit is a protein editing framework that disentangles structural and functional properties using dual-channel natural language supervision. It learns modular latent representations aligned with protein sequences through a combination of alignment, uniformity, and angular MMD losses. Editing is performed via text modification, enabling interpretable and controllable edits to protein structure or function.

See https://tiger-ai-lab.github.io/DisProtEdit/ for more info.
---
## ๐ฐ News
- **2025 Jun 20**: Released SwissProtDis Dataset, Editing benchmark, also the full training code.
- **2025 Jun 18**: Paper available on Arxiv.
- **2025 Jun 17**: Website created!
- **2025 Jun 11**: DisProtEdit accepted to ICMLW GenBio.
- **2025 Jun 10**: DisProtEdit accepted to ICMLW FM4LS.
---
## ๐ฆ SwissProtDis Dataset
We introduce **SwissProtDis**, a large-scale multimodal dataset containing:
- ~540,000 protein sequences
- Automatically decomposed structural and functional text descriptions from UniProt using GPT-4o
๐ [https://huggingface.co/datasets/TIGER-Lab/SwissProtDis_500k](https://huggingface.co/datasets/TIGER-Lab/SwissProtDis_500k)
---
## โ๏ธ Environment Setup
```
conda create -n disprot python=3.10
conda activate disprot
pip install -r requirements.txt
```
## ๐ Training
Training multimodal embeddings
```shell
./1_SFs.sh
./2_empty_SFs.sh
./2_sample_SFs.sh
```
Alternatively, you can run:
```shell
export OUTPUT_DIR="output/"
export PRETRAINED_DIR="output/SFs_AU_b24_gpt4o_500k_DisAngle10"
CUDA_VISIBLE_DEVICES=0,1,2,3\
python3 pretrain_step_01_SFs.py \
--protein_lr=1e-5 --protein_lr_scale=1 \
--text_lr=1e-5 --text_lr_scale=1 --CL_loss="EBM_NCE"\
--protein_backbone_model=ProtBERT_BFD --wandb_name="SFs_AU05_b24_gpt4o_500k"\
--epochs=10 --batch_size=24 --num_workers=0 --verbose \
--output_model_dir="$OUTPUT_DIR" --CL=0.0 --D=0.0 --U=0.5 --A=1.0 --dis_angle --ds_llm="gpt4o" --ds_name="500k"
python3 pretrain_step_02_empty_sequence_SFs.py \
--protein_backbone_model=ProtBERT_BFD \
--batch_size=16 --num_workers=4 \
--pretrained_folder="$PRETRAINED_DIR" \
--target_subfolder="pairwise_all"
python3 pretrain_step_02_pairwise_representation_SFs.py \
--protein_backbone_model=ProtBERT_BFD \
--batch_size=16 --num_workers=4 \
--pretrained_folder="$PRETRAINED_DIR" \
--target_subfolder="pairwise_all" \
--ds_llm="gpt4o"
```
Training decoder for editing task
```shell
./4_SFs.sh
```
Alternatively, you can run:
```shell
export PRETRAINED_DIR="output/SFs_AU_b24_gpt4o_500k_DisAngle10"
CUDA_VISIBLE_DEVICES=0\
python pretrain_step_04_decoder_SFs.py \
--batch_size=8 --lr=1e-4 --epochs=10 \
--decoder_distribution=T5Decoder \
--score_network_type=T5Base --wandb_name="SFs_AU_b24_gpt4o_500k_DisAngle10"\
--hidden_dim=16 --verbose \
--pretrained_folder="$PRETRAINED_DIR" \
--output_model_dir="$PRETRAINED_DIR"/step_04_T5 \
--target_subfolder="pairwise_all"
```
## ๐งช Editing Benchmark
Please see [_datasets_and_checkpoints](https://github.com/TIGER-AI-Lab/DisProtEdit/blob/main/_datasets_and_checkpoints).
* The benchmark contains 196 protein inputs, suitable for protein editing on structure editing and functional editing.
* Please refer to `editing_dis_interpolation.py`.
## ๐ Downstream Tasks
### Editing
Multi-Attribute Protein Editing
```shell
./5_medit.sh
```
### TAPE
Protein Properties Prediction
```shell
./5_TAPE.sh
```
The code is built upon [TAPE in ProteinDT](https://github.com/chao1224/ProteinDT/blob/main/examples/downstream_TAPE.py).
---
## ๐ Citation
```bibtex
@misc{ku2025disproteditexploringdisentangledrepresentations,
title={DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing},
author={Max Ku and Sun Sun and Hongyu Guo and Wenhu Chen},
year={2025},
booktitle={ICML Workshop on Generative AI and Biology},
eprint={2506.14853},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2506.14853},
}
```
## ๐ Acknowledgements
This code is heavily built upon [ProteinDT](https://github.com/chao1224/ProteinDT). we thank all the contributors for open-sourcing.