Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pku-yuangroup/taxdiff
The official code for "TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation"
https://github.com/pku-yuangroup/taxdiff
ai4science generate-model meachine-learning protein-sequences
Last synced: 2 days ago
JSON representation
The official code for "TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation"
- Host: GitHub
- URL: https://github.com/pku-yuangroup/taxdiff
- Owner: PKU-YuanGroup
- License: mit
- Created: 2024-02-26T08:11:46.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-08-23T06:27:13.000Z (5 months ago)
- Last Synced: 2024-12-27T11:13:03.890Z (9 days ago)
- Topics: ai4science, generate-model, meachine-learning, protein-sequences
- Language: Python
- Homepage:
- Size: 3.53 MB
- Stars: 58
- Watchers: 5
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation
[![arXiv](https://img.shields.io/badge/Arxiv-2310.01852-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2402.17156)
[![License](https://img.shields.io/badge/Code%20License-MIT-yellow)](https://github.com/HowardLi1984/ECDFormer/blob/main/LICENSE)
[![HuggingFace](https://img.shields.io/badge/Hugging%20Face-TaxDiff%20-blue)](https://github.com/HowardLi1984/ECDFormer/blob/main/LICENSE)
[![Data License](https://img.shields.io/badge/Dataset%20license-CC--BY--NC%204.0-orange)](https://github.com/HowardLi1984/ECDFormer/blob/main/DATASET_LICENSE)
If you like our project, please give us a star ⭐ on GitHub for latest update.
The official code for "TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation". Here we publish the inference code of TaxDiff. The training code & Protein sequence with Taxonomic lables dataset will be released after our paper is accepted.
💡 I also have other AI for Science projects that may interest you ✨.
> [**ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing**](https://github.com/Lyu6PosHao/ProLLaMA)
>Liuzhenghao Lv, Zongying Lin, Li Hao, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/Lyu6PosHao/ProLLaMA)
[![arXiv](https://img.shields.io/badge/Arxiv-2401.15947-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2402.16445)## 😮 Highlights
### 💡 Protein sequences Generation Model
- To the best of our knowledge, our TaxDiff is **the first controllable protein generation model** utilizing guidance from taxonomies.
### 🔥 Diffusion-based Framework
- TaxDiff proposes a **taxonomic-guided framework** that fits all diffusion-based protein design models. We also propose the patchify attention mechanism for better protein design.
### ⭐ Excellent performance
- Experiments demonstrate that our TaxDiff achieves **state-of-the-art results** in both taxonomic-guided controllable and unconditional protein sequence generation, excelling in structural modeling scores and sequence consistency.## 🚀 Main Results
More detailed results can be found in our paper.
### Unconditional Generation
### Controllable Generation
## 📖 Data Preparation
For inference, please download from [HuggingFace](https://huggingface.co/linzy19/TaxDiff/tree/main). Unzip it and put the [ckpt](https://huggingface.co/linzy19/TaxDiff/tree/main) into the folder ckpt/
```bash
ckpt/0012802_eval.ckpt
```
Our dataset can download from [HuggingFace](https://huggingface.co/linzy19/TaxDiff/tree/main).
```bash
uniref50_200_256_clean_taxnomic_family_tid__filter_layer6.fasta
```We will release protein sequences with taxonmic labels for training procedure once our paper is accepted.
If you want to select a specific protein taxonomic for your research, you need to first find his corresponding tax-id in the [data_reader/Taxonnmic_classfication.xlsx](https://github.com/Linzy19/TaxDiff/blob/main/data_reader/Taxonnmic_classfication.xlsx), and then modify protein class lables in the [sample_protein.py](https://github.com/Linzy19/TaxDiff/blob/main/sample_protein.py).
```bash
class_lables = torch.randint(low=1, high=int(23427), size=(1,num))
```## 🛠️ Requirements and Installation
* Python == 3.10
* Pytorch == 2.2.0
* Torchvision == 0.17.0
* CUDA Version == 12.0
* Install required packages:
```bash
git clone git@[github.com/Linzy19/TaxDiff.git]
cd TaxDiff
pip install -r requirements.txt
```## 🗝️ Inferencing
The inferencing instruction is in [sample_protein.py](sample_protein.py).
```bash
python sample_protein.py --model DiT-pro-12-h6-L16 --cuda-num cuda:0 --num 500
```## ✏️ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.```BibTeX
@article{zongying2024taxdiff,
title={TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation},
author={Zongying, Lin and Hao, Li and Liuzhenghao, Lv and Bin, Lin and Junwu, Zhang and Yu-Chian, Chen Calvin and Li, Yuan and Yonghong, Tian},
journal={arXiv preprint arXiv:2402.17156},
year={2024}
}
```