{"id":13511044,"url":"https://github.com/microsoft/foldingdiff","last_synced_at":"2025-03-30T19:30:48.167Z","repository":{"id":60733797,"uuid":"509133407","full_name":"microsoft/foldingdiff","owner":"microsoft","description":"Diffusion models of protein structure; trigonometry and attention are all you need!","archived":false,"fork":false,"pushed_at":"2023-12-12T17:55:03.000Z","size":239473,"stargazers_count":538,"open_issues_count":17,"forks_count":63,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-03-21T23:42:05.502Z","etag":null,"topics":["diffusion","diffusion-models","protein","protein-structure-generation","proteins","transformer"],"latest_commit_sha":null,"homepage":"https://www.nature.com/articles/s41467-024-45051-2","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null}},"created_at":"2022-06-30T15:24:37.000Z","updated_at":"2025-03-21T06:57:31.000Z","dependencies_parsed_at":"2024-01-13T19:32:39.090Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/foldingdiff","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffoldingdiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffoldingdiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffoldingdiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Ffoldingdiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/foldingdiff/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246368633,"owners_count":20766054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion","diffusion-models","protein","protein-structure-generation","proteins","transformer"],"created_at":"2024-08-01T03:00:32.154Z","updated_at":"2025-03-30T19:30:43.154Z","avatar_url":"https://github.com/microsoft.png","language":"Jupyter Notebook","funding_links":[],"categories":["PPI"],"sub_categories":["Year 2023"],"readme":"# foldingdiff - Diffusion model for protein backbone generation\n\n[![DOI](https://zenodo.org/badge/509133407.svg)](https://zenodo.org/doi/10.5281/zenodo.10365889) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) ![PyTorch Lightning](https://img.shields.io/badge/pytorch-lightning-blue.svg?logo=PyTorch%20Lightning)\n\nWe present a diffusion model for generating novel protein backbone structures. For more details, see our preprint on [arXiv](https://arxiv.org/abs/2209.15611). We also host a trained version of our model on [HuggingFace spaces](https://huggingface.co/spaces/wukevin/foldingdiff) and at [SuperBio](https://app.superbio.ai/apps/240?id=63dd1aecbd2a3db57fdf1e42) so you can get started with generating protein structures with just your browser!\n\n![Animation of diffusion model protein folds over timesteps](plots/generated_0.gif)\n\n## Installation\n\nTo install, clone this using `git clone`. This software is written in Python, notably using PyTorch, PyTorch Lightning, and the HuggingFace transformers library. The required conda environment is defined within the `environment.yml` file. To set this up, make sure you have conda (or [mamba](https://mamba.readthedocs.io/en/latest/index.html)) installed, clone this repository, and run:\n\n```bash\nconda env create -f environment.yml\nconda activate foldingdiff\npip install -e ./  # make sure ./ is the dir including setup.py\n```\n\n### Downloading data\n\nWe require some data files not packaged on Git due to their large size. These are not required for sampling (as long as you are not using the `--testcomparison` option, see below); this is required for training your own model. We provide a script in the `data` dir to download requisite CATH data.\n\n```bash\n# Download the CATH dataset\ncd data  # Ensure that you are in the data subdirectory within the codebase\nchmod +x download_cath.sh\n./download_cath.sh\n```\n\nIf the download link in the `.sh` file is not working, the tarball is also mirrored at the following [Dropbox link](https://www.dropbox.com/s/ka5m5lx58477qu6/cath-dataset-nonredundant-S40.pdb.tgz?dl=0).\n\n## Training models\n\nTo train your own model on the CATH dataset, use the script at `bin/train.py` in combination with one of the\njson config files under `config_jsons` (or write your own). An example usage of this is as follows:\n\n```bash\npython bin/train.py config_jsons/cath_full_angles_cosine.json --dryrun\n```\n\nBy default, the training script will calculate the KL divergence at each timestep before starting training, which can be quite computationally expensive with more timesteps. To skip this, append the `--dryrun` flag. The output of the model will be in the `results` folder with the following major files present:\n\n```\nresults/\n    - config.json           # Contains the config file for the huggingface BERT model itself\n    - logs/                 # Contains the logs from training\n    - models/               # Contains model checkpoints. By default we store the best 5 models by validation loss and the best 5 by training loss\n    - training_args.json    # Full set of arguments, can be used to reproduce run\n```\n\n## Pre-trained models\n\nWe provide weights for a model trained on the CATH dataset. These weights are stored on HuggingFace model hub at [wukevin/foldingdiff_cath](https://huggingface.co/wukevin/foldingdiff_cath). The following code snippet shows how to load this model, load data (assuming it's been downloaded), and perform a forward pass:\n\n```python\nfrom huggingface_hub import snapshot_download\nfrom torch.utils.data.dataloader import DataLoader\nfrom foldingdiff import modelling\nfrom foldingdiff import datasets as dsets\n\n# Load the model (files will be cached for future calls)\nm = modelling.BertForDiffusion.from_dir(snapshot_download(\"wukevin/foldingdiff_cath\"))\n\n# Load dataset\n# As part of loading, we try to compute internal angles in parallel. This may\n# throw warnings like the following; this is normal.\n# WARNING:root:Illegal values for omega in /home/*/projects/foldingdiff-main/data/cath/dompdb/2ebqA00 -- skipping\n# After computing these once, the results are saved in a .pkl file under the\n# foldingdiff source directory for faster loading in future calls.\nclean_dset = dsets.CathCanonicalAnglesOnlyDataset(pad=128, trim_strategy='randomcrop')\nnoised_dset = dsets.NoisedAnglesDataset(clean_dset, timesteps=1000, beta_schedule='cosine')\ndl = DataLoader(noised_dset, batch_size=32, shuffle=False)\nx = iter(dl).next()\n\n# Forward pass\npredicted_noise = m(x['corrupted'], x['t'], x['attn_mask'])\n```\n\n## Sampling protein backbones\n\nTo sample protein backbones, use the script `bin/sample.py`. Example commands to do this using the pretrained weights described above are as follows.\n\n```bash\n# To sample 10 backbones per length ranging from [50, 128) with a batch size of 512 - reproduces results in our manuscript\npython ~/projects/foldingdiff/bin/sample.py -l 50 128 -n 10 -b 512 --device cuda:0\n```\n\nThis will run the trained model hosted at [wukevin/foldingdiff_cath](https://huggingface.co/wukevin/foldingdiff_cath) and generate sequences of varying lengths. If you wish to load the test dataset and include test chains in the generated plots, use the option `--testcomparison`; note that this requires downloading the CATH dataset, see above. Running `sample.py` will create the following directory structure in the diretory where it is run:\n\n```\nsome_dir/\n    - plots/            # Contains plots comparing the distribution of training/generated angles\n    - sampled_angles/   # Contains .csv.gz files with the sampled angles\n    - sampled_pdb/      # Contains .pdb files from converting the sampled angles to cartesian coordinates\n    - model_snapshot/   # Contains a copy of the model used to produce results\n```\n\nNot specifying a `--device` will default to the first device `cuda:0`; use `--device cpu` to run on CPU (though this will be very slow). See the following table for runtimes from our machines.\n\n| Device | Runtime estimates sampling 512 structures |\n| --- | --- |\n| Nvidia RTX 2080Ti | 7 minutes |\n| i9-9960X (16 physical cores) | 2 hours |\n\n### Maximum training similarity TM scores\n\nAfter generating sequences, we can calculate TM-scores to evaluate the simliarity of the generated sequences and the original sequences. This is done using the script under `bin/tmscore_training.py` and requires data to have been downloaded prior (see above).\n\n### Visualizing diffusion \"folding\" process\n\nThe above sampling code can also be run with the ``--fullhistory`` flag to write an additional subdirectory `sample_history` under each of the `sampled_angles` and `sampled_pdb` folders that contain pdb/csv files coresponding to each timestep in the sampling process. The pdb files, for example, can then be passed into the script under `foldingdiff/pymol_vis.py` to generate a gif of the folding process (as shown above). An example command to do this is:\n\n```bash\npython ~/projects/foldingdiff/foldingdiff/pymol_vis.py pdb2gif -i sampled_pdb/sample_history/generated_0/*.pdb -o generated_0.gif\n```\n\n**Note** this script lives separately from other plotting code because it depends on PyMOL; feel free to install/activate your own installation of PyMOL for this, or set up an environment using [PyMOL open source](https://github.com/schrodinger/pymol-open-source).\n\n## Evaluating designability of generated backbones\n\nOne way to evaluate the quality of generated backbones is via their \"designability\". This refers to whether or not we can design an amino acid chain that will fold into the designed backbone. To evaluate this, we use an inverse folding model to generate amino acid sequences that are predicted to fold into our generated backbone, and check whether those generated sequences actually fold into a structure comparable to our backbone.\n\n### Inverse folding\n\nInverse folding is the task of predicting a sequence of amino acids that will produce a given protein backbone structure. We evaluated two different methods for this step, ProteinMPNN and ESM-IF1; we find ProteinMPNN to be significantly more performant. In our analyses, we generate 8 different amino caid sequences for each of FoldingDiff's generated structures.\n\n#### ESM-IF1\n\nWe use a different conda environment for [ESM-IF1](https://proceedings.mlr.press/v162/hsu22a.html); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details. We found that the following series of commands works on our machines:\n\n```bash\nmamba create -n inverse python=3.9 pytorch cudatoolkit pyg -c pytorch -c conda-forge -c pyg\nconda activate inverse\nmamba install -c conda-forge biotite\npip install git+https://github.com/facebookresearch/esm.git\n```\n\nAfter this, we `cd` into the folder that contains the `sampled_pdb` directory created by the prior step, and run:\n\n```bash\npython ~/projects/foldingdiff/bin/pdb_to_residues_esm.py sampled_pdb -o esm_residues\n```\n\nThis creates a new folder, `esm_residues` that contains 10 potential residues for each of the pdb files contained in `sampled_pdb`.\n\n#### ProteinMPNN\n\nTo set up [ProteinMPNN](https://www.science.org/doi/10.1126/science.add2187), see the authors guide on their [GitHub](https://github.com/dauparas/ProteinMPNN).\n\nAfter this, we follow a similar procedure as for ESM-IF1 (above) where we `cd` into the directory containing the `sampled_pdb` folder and run:\n\n```bash\npython ~/projects/foldingdiff/bin/pdb_to_residue_proteinmpnn.py sampled_pdb\n```\n\nThis will create a new directory called `proteinmpnn_residues` containing 8 amino acid chains per sampled PDB structure.\n\n### Structural prediction\n\nAfter generating amino acid sequences, we check that these recapitulate our original sampled structures by passing them through either OmegaFold or AlphaFold. After running one of these folders, we use the following command to asses self-consistency TM scores:\n\n```bash\npython ~/projects/foldingdiff/bin/sctm.py -f alphafold_predictions_proteinmpnn\n```\n\nWhere `alphafold_predictions_proteinmpnn` is a folder containing the folded structures corresponding to inverse folded amino acid sequences. This produces a json file of all scTM scores, as well as various pdf files containing plots and correlations of the scTM score distribution.\n\n#### OmegaFold\n\nWe primarily use [OmegaFold](https://github.com/HeliXonProtein/OmegaFold) to fold the amino acid sequences produced by either ESM-IF1 or ProteinMPNN. This is due to OmegaFold's relatively fast runtime compared to AlphaFold2, and due to the fact that OmegaFold is natively designed to be run without MSA information - making it more suitable for our protein design task.\n\nAfter creating and activating a separate conda environment and following the authors' instructions for installing OmegaFold, we use the following script to split our input amino acid fasta files across GPUs for inference, and subsequently calculate the self-consistency TM (scTM) scores.\n\n```bash\n# Fold each fasta, spreading the work over GPUs 0 and 1, outputs to omegafold_predictions folder\npython ~/projects/foldingdiff/bin/omegafold_across_gpus.py esm_residues/*.fasta -g 0 1\n```\n\n#### AlphaFold2\n\nWe run [AlphaFold2](https://github.com/deepmind/alphafold) via the `localcolabfold` installation method (see [GitHub](https://github.com/YoshitakaMo/localcolabfold)). Due to AlphaFold's runtime requirements, we provide scripts to split the set of fasta files into subdirectories that can then be separately folded; see SLURM script under `scripts/slurm/alphafold.sbatch` for an example.\n\n## Tests\n\nTests are implemented through a mixture of doctests and unittests. To run unittests, run:\n\n```bash\npython -m unittest -v\n```\n\nYou may see warnings like the following; these are expected.\n\n```bash\nWARNING:root:Illegal values for omega in protdiff-main/data/cath/dompdb/5a2qw00 -- skipping\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ffoldingdiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Ffoldingdiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ffoldingdiff/lists"}