{"id":13685272,"url":"https://github.com/microsoft/evodiff","last_synced_at":"2025-05-14T16:01:47.131Z","repository":{"id":194424090,"uuid":"500969679","full_name":"microsoft/evodiff","owner":"microsoft","description":"Generation of protein sequences and evolutionary alignments via discrete diffusion models","archived":false,"fork":false,"pushed_at":"2025-02-07T16:39:30.000Z","size":20685,"stargazers_count":577,"open_issues_count":13,"forks_count":89,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-04-12T00:55:23.543Z","etag":null,"topics":["discrete-diffusion","generative-model","multiple-sequence-alignment","protein-sequences"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-07T19:02:58.000Z","updated_at":"2025-04-11T10:51:43.000Z","dependencies_parsed_at":"2023-10-10T16:23:41.156Z","dependency_job_id":"a8d3565c-4a4f-45f7-9934-498a618d13f1","html_url":"https://github.com/microsoft/evodiff","commit_stats":{"total_commits":419,"total_committers":14,"mean_commits":"29.928571428571427","dds":0.2577565632458234,"last_synced_commit":"f696cfc0e58dcb17b31bf4110aaf11a8a612b07b"},"previous_names":["microsoft/evodiff"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fevodiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fevodiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fevodiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fevodiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/evodiff/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501879,"owners_count":21114683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["discrete-diffusion","generative-model","multiple-sequence-alignment","protein-sequences"],"created_at":"2024-08-02T14:00:47.967Z","updated_at":"2025-05-14T16:01:47.046Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Sequence generation","Machine Learning Tasks and Models","🔬 Domain-Specific Applications"],"sub_categories":["Foundation Models","🧬 Biology \u0026 Medicine"],"readme":"# EvoDiff\n\n### Description\nIn this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with \nthe distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. \nEvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional\nspace. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered \nregions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the \nuniversality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering\nbeyond the structure-function paradigm toward programmable, sequence-first design.\n\nWe evaluate our sequence and MSA models – EvoDiff-Seq and EvoDiff-MSA, respectively – across a range of generation tasks \nto demonstrate their power for controllable protein design. Below, we provide documentation for running our models.\n\nEvoDiff is described in this [preprint](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1); if you use the code from this repository or the results, please cite the preprint.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"img/combined-0.gif\" /\u003e\n\u003c/p\u003e\n\n----\n\n## Table of contents\n\n- [Evodiff](#EvoDiff)\n- [Table of contents](#table-of-contents)\n- [Installation](#installation)\n    - [Datasets](#datasets)\n    - [Loading pretrained models](#loading-pretrained-models)\n- [Available models](#available-models)\n- [Unconditional generation](#unconditional-sequence-generation) \n  - [Unconditional sequence generation](#unconditional-generation-with-evodiff-seq)\n  - [Unconditional MSA generation](#unconditional-generation-with-evodiff-msa)\n- [Conditional sequence generation](#conditional-sequence-generation)\n    - [Evolution-guided protein generation with EvoDiff-MSA](#evolution-guided-protein-generation-with-evodiff-msa)\n    - [Generating intrinsically disordered regions](#generating-intrinsically-disordered-regions)\n    - [Scaffolding functional motifs](#scaffolding-functional-motifs)\n- [Analysis](#analysis-of-generations)\n- [Downloading generated sequences](#downloading-generated-sequences)\n- [Docker](#docker)\n\n----\n\n## Installation\nTo download our code, we recommend creating a clean conda environment with python ```v3.9.0```, and installing pytorch (we have tested up to ```v2.7.0```)\n```\nconda create --name evodiff python=3.9\npip3 install torch\n```\nIn that new environment, install EvoDiff (torch-scatter may take a while): \n```\npip install evodiff\n```\nFor the bleeding edge version use: \n```\npip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch\n```\n\nWe provide a notebook with installation guidance that can be found in [examples/evodiff.ipynb](https://github.com/microsoft/evodiff/tree/main/examples/evodiff.ipynb). It also includes examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.\n\nThanks to Colby Ford EvoDiff is available as a space on [huggingface](https://huggingface.co/spaces/colbyford/evodiff)\n\nOur downstream analysis scripts make use of a variety of tools we do not include in our package installation. To run the\nscripts, please download the following packages in addition to EvoDiff:\n* [TM score](https://zhanggroup.org/TM-score/)\n* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)\n* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)\n* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.\n* [PGP](https://github.com/hefeda/PGP)\n* [DR-BERT](https://github.com/maslov-group/DR-BERT)\n\nWe refer to the setup instructions outlined by the authors of those tools.\n\n### Datasets\nWe obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains \napproximately 42 million protein sequences. \nThe Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), \nwhich contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters.\nThe intrinsically disordered regions (IDR) data was obtained from the [Reverse Homology GitHub](https://github.com/alexxijielu/reverse_homology/).\n\nFor the scaffolding structural motifs task, we use the baselines compiled in RFDiffusion. We provide pdb and fasta files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide pdb files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder.\n\nTo access the UniRef50 test sequences, use the following code:\n```\ntest_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences\n```\n\nThe filenames for train and validation Openfold splits are saved in `data/valid_msas.csv` and `data/train_msas.csv`\n\n### Loading pretrained models\nTo load a model:\n```\nfrom evodiff.pretrained import OA_DM_38M\n\nmodel, collater, tokenizer, scheme = OA_DM_38M()\n```\nAvailable evodiff models are:\n* ``` D3PM_BLOSUM_640M() ```\n* ``` D3PM_BLOSUM_38M() ```\n* ``` D3PM_UNIFORM_640M() ```\n* ``` D3PM_UNIFORM_38M() ```\n* ``` OA_DM_640M() ```\n* ``` OA_DM_38M() ```\n* ``` MSA_D3PM_BLOSUM_RANDSUB() ```\n* ``` MSA_D3PM_BLOSUM_MAXSUB() ```\n* ``` MSA_D3PM_UNIFORM_RANDSUB() ```\n* ``` MSA_D3PM_UNIFORM_MAXSUB() ```\n* ``` MSA_OA_DM_RANDSUB() ```\n* ``` MSA_OA_DM_MAXSUB() ```\n\nIt is also possible to load our LRAR baseline models: \n* ``` LR_AR_640M() ```\n* ``` LR_AR_38M() ```\n\nNote: if you want to download a `BLOSUM` model, you will first need to download [data/blosum62-special-MSA.mat](https://github.com/microsoft/evodiff/blob/main/data/blosum62-special-MSA.mat).\n\n## Available models\n\nWe investigated two types of forward processes for diffusion over discrete data modalities to determine which would be most effective. \nIn order-agnostic autoregressive diffusion [OADM](https://arxiv.org/abs/2110.02037), one amino acid is converted to a special mask token at each step in the forward process. \nAfter $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. \nWe additionally designed discrete denoising diffusion probabilistic models [D3PM](https://arxiv.org/abs/2107.03006) for protein sequences.\nIn EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids.\nIn the reverse process for both, a neural network model is trained to undo the previous corruption. \nThe trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. \nWe trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the [CARP](https://doi.org/10.1101/2022.05.19.492714) protein masked language model.\nWe trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding. \n\nTo explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the [MSA Transformer](https://proceedings.mlr.press/v139/rao21a.html) architecture on the [OpenFold](https://github.com/aqlaboratory/openfold) dataset}. \nTo do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences (\"Random\") or by greedily maximizing for sequence diversity (\"Max\"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes. \n\n## Unconditional sequence generation\n\n### Unconditional generation with EvoDiff-Seq\n\nEvoDiff can generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids. All available models \ncan be used to unconditionally generate new sequences, without needing to download the training datasets. \n\nTo unconditionally generate 100 sequences from EvoDiff-Seq, run the following script:\n\n``` \npython evodiff/generate.py --model-type oa_dm_38M --num-seqs 100 \n```\n\nThe default model type is `oa_dm_640M`, other evodiff models available are:\n* ` oa_dm_38M `\n* ` d3pm_blosum_38M `\n* ` d3pm_blosum_640M `\n* ` d3pm_uniform_38M `\n* ` d3pm_uniform_640M `\n\nOur LRAR baseline models are also available:\n* ` lr_ar_38M `\n* ` lr_ar_640M `\n\n\nAn example of unconditionally generating a sequence of a specified length can be found in\n[this notebook](https://github.com/microsoft/evodiff/tree/main/examples/evodiff.ipynb).\n\nTo evaluate the generated sequences, we implement our self-consistency Omegafold ESM-IF pipeline, as shown in\n[analysis/self_consistency_analysis.py](https://github.com/microsoft/evodiff/blob/main/analysis/self_consistency_analysis.py). \nTo use this evaluation script, you must have the dependencies listed under the [Installation](#installation) section installed.\n\n### Unconditional generation with EvoDiff-MSA\n\nTo explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture \non the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, \neither by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”). \n\nIt is possible to unconditionally generate an entire MSA, using the following script:\n``` \npython evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming\n```\n\nThe default model type is `msa_oa_dm_maxsub`, which is EvoDiff-MSA-OADM trained on Max subsampled sequences, and the other available \nevodiff models are: \n* EvoDiff-MSA OADM trained on random subsampled sequences: ` msa_oa_dm_randsub `\n* EvoDiff-MSA D3PM-BLOSUM trained on Max subsampled sequences:` msa_d3pm_blosum_maxsub `\n* EvoDiff-MSA D3PM-BLOSUM trained on random subsampled sequences: ` msa_d3pm_blosum_randsub `\n* EvoDiff-MSA D3PM-Uniform trained on Max subsampled sequences: ` msa_d3pm_uniform_maxsub `\n* EvoDiff-MSA D3PM-Uniform trained on random subsampled sequences: ` msa_d3pm_uniform_randsub `\n\nYou can also specify a desired number of sequences per MSA, sequence length, batch size, and more.\n\n## Conditional sequence generation\nEvoDiff’s OADM diffusion framework induces a natural method for conditional sequence generation by fixing some subsequences and \npredicting the remainder. Because the model is trained to generate proteins with an arbitrary decoding order, this is easily \naccomplished by simply masking and decoding the desired portions. We apply EvoDiff’s power for controllable protein design \nacross three scenarios: conditioning on evolutionary information encoded in MSAs, inpainting functional domains, and scaffolding\nstructural motifs.\n\n### Evolution-guided protein generation with EvoDiff-MSA\nFirst, we test the ability of EvoDiff-MSA (`msa_oa_dm_maxsub`) to generate new query sequences conditioned on the remainder of an MSA, \nthus generating new members of a protein family without needing to train family-specific generative models.\n\nTo generate a new query sequence, given an alignment, use the following with the `--start-msa` flag. This starts conditional \ngeneration by sampling from a validation MSA. To run this script you must have the Openfold dataset and splits downloaded.   \n``` \npython evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-msa\n```\nIf you want to generate on a custom MSA, it is possible to retrofit existing code. \n\nAdditionally, the code is capable of generating an alignment given a query sequence, use the following `--start-query` flag. \nThis starts with the query and generates the alignment. \n```\npython evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-query\n ```\nNOTE: you can only specify one of the above flags at a time. You cannot specify both (`--start-query` \u0026 `--start-msa`) together. \nPlease look at `generate.py` for more information.\n\n### Generating intrinsically disordered regions\n\nBecause EvoDiff generates directly in sequence space, we hypothesized that it could natively generate intrinsically disordered regions \n(IDRs). IDRs are regions within a protein sequence that lack secondary or tertiary structure, and they carry out important and diverse\nfunctional roles in the cell directly facilitated by their lack of structure. Despite their prevalence and critical roles in function\nand disease, IDRs do not fit neatly in the structure-function paradigm and remain outside the capabilities of structure-based protein\ndesign methods. \n\nWe used inpainting with EvoDiff-Seq and EvoDiff-MSA to intentionally generate disordered regions conditioned on their surrounding\nstructured regions, and then used DR-BERT to predict disorder scores for each residue in the generated and natural sequences. Note: to\ngenerate with our scripts here, you must have the IDR dataset downloaded. Different pre-processing steps may apply to other datasets. \n\nTo run our code and generate IDRs from EvoDiff-Seq, run the following: \n```\npython evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --num-seqs 1 \n```\nor equivalently, from EvoDiff-MSA: \n```\npython evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --query-only --max-seq-len 150 --num-seqs 1 \n```\n\nWhich will sample IDRs from the IDR dataset, and generate new ones.\n\n### Scaffolding functional motifs\n\nGiven that the fixed functional motif includes the residue identities for the motif, we show that a sequence-only model \ncan be used for a motif scaffolding task. We used EvoDiff to generate scaffolds for a set of 17 motif-scaffolding problems \nby fixing the functional motif, supplying only the motif's amino-acid sequence as conditioning information, and then decoding \nthe remainder of the sequence.\n\nFor the scaffolding structural motifs task, we provide pdb and fasta files used for conditionally generating sequences in the [examples/scaffolding-pdbs](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-pdbs) folder. We also provide a3m files used for conditionally generating MSAs in the [examples/scaffolding-msas](https://github.com/microsoft/evodiff/tree/main/examples/scaffolding-msas) folder. Please view the PDB codes available and select an appropriate code. In this example, we use PDB code 1prw with domains 16-35 (FSLFDKDGDGTITTKELGTV) and 52-71 (INEVDADGNGTIDFPEFLTM).\nAn example of generating 1 MSA scaffold of a structural motif can be found in [this notebook](https://github.com/microsoft/evodiff/tree/main/examples/evodiff.ipynb).\n\nTo generate from EvoDiff-Seq:\n```\npython evodiff/conditional_generation.py --model-type oa_dm_640M --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 100 --scaffold-min 50 --scaffold-max 100\n```\n\nThe `--start-idxs` and `--end-idxs` indicate the start \u0026 end indices for the motif being scaffolded. If defining multiple motifs, you can supply the start and end index motifs as new arguments, such as in the example we provide above.\n\nEquivalent code for generating a new scaffold sequence from an EvoDiff-MSA:\n```\npython evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 1 --query-only\n```\n\nTo generate a custom scaffold for a given motif, one simply needs to supply the PDB ID, and the residue indices of the motif. The code will download the PDB for you.\nIn some cases PDB files downloaded from [RCSB](https://www.rcsb.org/) will be incomplete, or contain additional residues. We have implemented code to circumvent PDB-reading issues, but we recommend care when\ngenerating files for this task. \n\n## Analysis of generations\n\nTo analyze the quality of the generations, we look at:\n* amino acid KL divergence ([aa_reconstruction_parity_plot](https://github.com/microsoft/evodiff/blob/main/evodiff/plot.py))\n* secondary structure KL divergence ([evodiff/analysis/calc_kl_ss.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_kl_ss.py))\n* model perplexity for sequences ([evodiff/analysis/sequence_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/sequence_perp.py))\n* model perplexity for MSAs ([evodiff/analysis/msa_perp.py](https://github.com/microsoft/evodiff/blob/main/analysis/msa_perp.py))\n* Fréchet inception distance ([evodiff/analysis/calc_fid.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_fid.py))\n* Hamming distance ([evodiff/analysis/calc_nearestseq_hamming.py](https://github.com/microsoft/evodiff/blob/main/analysis/calc_nearestseq_hamming.py))\n* RMSD score ([analysis/rmsd_analysis.py](https://github.com/microsoft/evodiff/blob/main/analysis/rmsd_analysis.py))\n\nWe also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:\n* [TM score](https://zhanggroup.org/TM-score/)\n* [Omegafold](https://github.com/HeliXonProtein/OmegaFold)\n* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN)\n* [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/esm/inverse_folding); see this [Jupyter notebook](https://colab.research.google.com/github/facebookresearch/esm/blob/main/examples/inverse_folding/notebook.ipynb) for setup details.\n* [PGP](https://github.com/hefeda/PGP)\n* [DR-BERT](https://github.com/maslov-group/DR-BERT)\n\nWe refer to the setup instructions outlined by the authors of those tools.\n\nOur analysis scripts for iterating over these tools are in the [evodiff/analysis/downstream_bash_scripts](https://github.com/microsoft/evodiff/tree/main/analysis/downstream_bash_scripts) folder. Once we run the scripts in this folder, we analyze the results in [self_consistency_analysis.py](https://github.com/microsoft/evodiff/blob/main/analysis/self_consistency_analysis.py).\n\n## Downloading generated sequences\n\nWe provide all generated sequences on the [EvoDiff Zenodo](https://zenodo.org/record/8332830).\n\nTo download our unconditional generated sequences from `unconditional_generations.csv` file:\n\n```\ncurl -O https://zenodo.org/record/8332830/files/unconditional_generations.csv?download=1\n```\n\nTo extract all unconditionally generated sequences created using the EvoDiff-seq `oa_dm_640M` model, run the following code:\n```\nimport pandas as pd\ndf = pd.read_csv('unconditional_generations.csv', index_col = 0)\nsubset = df.loc[df['model'] == 'evodiff_oa_dm_640M']\n```\n\nThe CSV files containing generated data are organized as follows:\n* Unconditional generations from sequence-based models: ` unconditional_generations.csv`\n  * `sequence`: generated sequence\n  * `min hamming dist`: minimum Hamming distance between generated sequence and all training sequences\n  * `seq len`: length of generated sequence\n  * `model`: model type used for generations, models: `evodiff_oa_dm_38M`, `evodiff_oa_dm_640M`, `evodiff_d3pm_uniform_38M`, \\\n  `evodiff_d3pm_uniform_640M`, `evodiff_d3pm_blosum_38M`, `evodiff_d3pm_blosum_640M`, `carp_38M`, `carp_640M`, `lr_ar_38M` \\\n  `lr_ar_38M`, `lr_ar_640M`, `esm_1b`, or `esm_2`\n* Sequence predictions for unconditional structure generation baselines: ` esmif_predictions_unconditional_structure_generations.csv`\n  * `sequence`: predicted protein sequence from protein structure (using ESM-IF1 model)\n  * `seq len`: length of generated sequence\n  * `model`: 'foldingdiff' or 'rfdiffusion'\n* Sequence generation via evolutionary alignments: ` msa_evolution_conditional_generations.csv`\n  * `sequence`: generated query sequences\n  * `seq len`: length of generated sequence\n  * `model`: model type used for generations: `evodiff_msa_oa_dm_maxsub`, `evodiff_msa_oa_dm_randsub`, `esm_msa_1b`, or `potts`\n* Generated IDRs: ` idr_conditional_generations.csv`\n  * `sequence`: subsampled sequence that contains IDR\n  * `seq len`: length of generated sequence\n  * `gen_idrs`: the generated IDR sequence\n  * `original_idrs`: the original IDR sequence\n  * `start_idxs`: indices corresponding to start of IDR in sequence\n  * `end_idxs`: indices corresponding to end of IDR in sequence (inclusive)\n  * `model`: model type used for generations `evodiff_seq_oa_dm_640M` or `evodiff_msa_oa_dm_maxsub`\n* Successfully generated scaffolds ` msa_scaffold.csv` (EvoDiff-MSA generations) or `seq_scaffold.csv` (Evodiff-Seq generations) \n  * `pdb`: pdb code corresponding to scaffold task\n  * `seqs`: generated scaffold and motif\n  * `start_idxs`: indices corresponding to start of motif\n  * `end_idxs`: indices corresponding to end of motif\n  * `seq len`: length of generated sequence\n  * `scores`: average predicted local distance difference test (pLDDT) of sequence\n  * `rmsd`: motifRMSD between predicted motif coordinates and crystal motif coordinates\n  * `model`: model type used for generations\n\n\n## Docker\n\nThe Docker image for EvoDiff is hosted on DockerHub at [https://hub.docker.com/r/cford38/evodiff](https://hub.docker.com/r/cford38/evodiff).\n\n```sh\ndocker pull cford38/evodiff:latest\n```\n\nAlternatively, you can build the Docker image locally.\n\n```sh\n## Build Docker Image\ndocker build -t evodiff .\n```\n\nThen, run the Docker image locally with the following command.\n\n```sh\n## Run Docker Image (Bash Console)\ndocker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name evodiff --rm -it evodiff /bin/bash\n```\n\n__Note:__ You may need to set your default Torch device to `cuda` in the Docker container so that EvoDiff executes on the GPU.\n\n```py\nimport torch\ntorch.set_default_device('cuda:0')\n```\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos are subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third party trademarks or logos is subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fevodiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fevodiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fevodiff/lists"}