{"id":29690733,"url":"https://github.com/mjendrusch/salad","last_synced_at":"2025-07-23T06:37:43.275Z","repository":{"id":276757548,"uuid":"924691583","full_name":"mjendrusch/salad","owner":"mjendrusch","description":"protein structure generation with sparse all-atom denoising models","archived":false,"fork":false,"pushed_at":"2025-06-01T11:19:00.000Z","size":30537,"stargazers_count":32,"open_issues_count":2,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-01T20:14:53.381Z","etag":null,"topics":["bioinformatics","jax","machine-learning","protein-design","protein-structure"],"latest_commit_sha":null,"homepage":"https://github.com/mjendrusch/salad","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mjendrusch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-30T13:31:57.000Z","updated_at":"2025-06-01T11:19:04.000Z","dependencies_parsed_at":"2025-05-20T12:51:13.700Z","dependency_job_id":null,"html_url":"https://github.com/mjendrusch/salad","commit_stats":null,"previous_names":["mjendrusch/salad"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/mjendrusch/salad","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjendrusch%2Fsalad","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjendrusch%2Fsalad/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjendrusch%2Fsalad/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjendrusch%2Fsalad/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mjendrusch","download_url":"https://codeload.github.com/mjendrusch/salad/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjendrusch%2Fsalad/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266631707,"owners_count":23959422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","jax","machine-learning","protein-design","protein-structure"],"created_at":"2025-07-23T06:37:40.887Z","updated_at":"2025-07-23T06:37:43.257Z","avatar_url":"https://github.com/mjendrusch.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# salad - sparse all-atom denoising\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/LogoDarkBG.png\" alt=\"Proteins designed to spell out 'salad' – the name of this software – overlaid with their ESMfold predicted structures, listing the root mean square deviation between design and prediction (scRMSD).\" width=\"1100px\" align=\"middle\"/\u003e\n\u003c/p\u003e\n\n*Proteins designed to spell out \"salad\" – the name of this software – overlaid with their ESMfold predicted structures, listing the root mean square deviation between design and prediction (scRMSD).*\n\n## Colab quickstart\nIf you just want to generate a couple of proteins, without installing salad, you can use these Colab notebooks:\n\n- [SALAD_example](https://colab.research.google.com/github/mjendrusch/salad/blob/master/notebooks/SALAD_example.ipynb)\n  - **experimental:** currently this notebook only does unconditional generation. More advanced use cases are coming soon.\n\n## NOTICE: jax compatibility issue\njax versions 0.5.1 to 0.5.3 seem to cause issues for symmetric protein generation using salad.\nsalad works as expected for versions 0.5.0 and 0.6.0.\nI am therefore pinning jax to version 0.5.0 for now while I investigate the issue.\nIf you have installed salad recently and your structures look bad, please check your jax version and reinstall.\n\n## What's salad?\n\nsalad (**s**parse **al**l-**a**tom **d**enoising) is the name of a family of machine learning models for controlled protein structure generation.\nLike many previous approaches to protein structure generation (e.g. [RFdiffusion](https://github.com/RosettaCommons/RFdiffusion), [Genie2](https://github.com/aqlaboratory/genie2) and [Chroma](https://github.com/generatebio/chroma)), salad is based on denoising diffusion models and is trained to gradually transform random noise into realistic protein structures.\n\nUnlike previous approaches, salad was developed from the ground up with efficiency in mind. salad outperforms its predecessors by a large margin and thus allows for higher-throughput structure generation. To give an example of how fast salad really is in comparison to previous methods, imagine you want to generate a set of 1,000 amino acid proteins on your previous generation RTX 3090. You start generation and go on a 10 minute coffee break. If you used salad with default settings, you come back to ~45 generated structures. If you used RFdiffusion, you might come back to find one or two.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/SALADRuntimeDark.png\" alt=\"alt text\" width=\"400px\" align=\"middle\"/\u003e\n\u003c/p\u003e\n\n*Comparison of per-structure runtime for salad models compared to Chroma, Genie2 and RFdiffusion.*\n\nLike previous models, salad models can generate protein structures for a variety of protein design tasks:\n* [unconditional protein generation](#de-novo-protein-design-script)\n   up to [~1,000 amino acids](#large-protein-generation-with-domain-shaped-noise)\n* [symmetric and repeat protein generation](#symmetric--repeat-protein-generation)\n  (currently cyclic and screw-axis symmetry)\n* [protein motif scaffolding](#multi-motif-scaffolding)\n  (with one or more independent motifs)\n* [shape-conditioned protein generation](#shape-conditioned-protein-generation)\n* [multi-state protein design](#multi-state-design)\n\nWhile unconditional protein generation, as well as protein motif scaffolding are fairly standard for current protein structure generators, salad reaches larger proteins up to 1,000 amino acids and can successfully perform on less standard tasks, such as multi-state protein design, shape-conditioned protein generation and repeat protein design. The full extent of what salad can do is described in our [manuscript](https://www.biorxiv.org/content/10.1101/2025.01.31.635780v1.abstract).\n\n## What's in this repository?\n\nThis repository contains the code for training and evaluating the\nmodels described in \"Efficient protein structure generation with\nsparse denoising models\":\n- autoencoders are implemented in `salad.modulues.structure_autoencoder`.\n- denoising models are implemented in `salad.modules.noise_schedule_benchmark`.\n  - noise schedules in `salad.modules.utils.diffusion`\n- features and attention modules for geometric data are implemented in `salad.modules.geometric`.\n- utilities for working with sparse geometric data are implemented in `salad.modules.utils.geometry`.\n\nIn addition, it provides modules for loading PDB data and implementing\nsparse protein models.\n\n**NOTE: This package has been developed and tested on linux. While many**\n**things should work as expected on other operating systems, we may not**\n**be able to offer adequate support outside of linux at this point.**\n\n## License\nCopyright (c) 2025 European Molecular Biology Laboratory (EMBL)\n\n### Code license\nLicensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n  \u003e \u003e [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n\n### Model parameter license\n \u003cp xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dct=\"http://purl.org/dc/terms/\"\u003e\u003cspan property=\"dct:title\"\u003eThe model parameters\u003c/span\u003e are licensed under \u003ca href=\"https://creativecommons.org/licenses/by/4.0/?ref=chooser-v1\" target=\"_blank\" rel=\"license noopener noreferrer\" style=\"display:inline-block;\"\u003eCreative Commons Attribution 4.0 International\u003cimg style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1\" alt=\"\"\u003e\u003cimg style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1\" alt=\"\"\u003e\u003c/a\u003e\u003c/p\u003e \n\n# Table of contents\n- [salad - sparse all-atom denoising](#salad---sparse-all-atom-denoising)\n  - [Colab quickstart](#colab-quickstart)\n  - [NOTICE: jax compatibility issue](#notice-jax-compatibility-issue)\n  - [What's salad?](#whats-salad)\n  - [What's in this repository?](#whats-in-this-repository)\n  - [License](#license)\n    - [Code license](#code-license)\n    - [Model parameter license](#model-parameter-license)\n- [Table of contents](#table-of-contents)\n- [Roadmap](#roadmap)\n  - [quality of life improvements](#quality-of-life-improvements)\n  - [documentation / code cleanup](#documentation--code-cleanup)\n  - [features](#features)\n- [Getting started](#getting-started)\n  - [Installing salad](#installing-salad)\n  - [Generating your first proteins](#generating-your-first-proteins)\n  - [Sequence redesign and benchmarking](#sequence-redesign-and-benchmarking)\n    - [Running ProteinMPNN](#running-proteinmpnn)\n    - [Running novobench](#running-novobench)\n  - [Using the salad protein autoencoders](#using-the-salad-protein-autoencoders)\n- [salad scripts documentation](#salad-scripts-documentation)\n  - [_De novo_ protein design script](#de-novo-protein-design-script)\n    - [--config](#--config)\n    - [--params](#--params)\n    - [--out\\_path](#--out_path)\n    - [--num\\_aa](#--num_aa)\n    - [--merge\\_chains (default: False)](#--merge_chains-default-false)\n    - [--num\\_steps (default: 500)](#--num_steps-default-500)\n    - [--out\\_steps (default: 400)](#--out_steps-default-400)\n    - [--num\\_designs (default: 10)](#--num_designs-default-10)\n    - [--prev\\_threshold (default: 0.8)](#--prev_threshold-default-08)\n    - [--timescale\\_pos (default: \"cosine(t)\")](#--timescale_pos-default-cosinet)\n    - [--cloud\\_std (default: \"none\")](#--cloud_std-default-none)\n    - [--dssp (default: \"none\")](#--dssp-default-none)\n    - [--template (default: \"none\")](#--template-default-none)\n  - [Large protein generation with domain-shaped noise](#large-protein-generation-with-domain-shaped-noise)\n  - [Shape-conditioned protein generation](#shape-conditioned-protein-generation)\n  - [Multi-motif scaffolding](#multi-motif-scaffolding)\n  - [Symmetric / repeat protein generation](#symmetric--repeat-protein-generation)\n    - [--num\\_aa](#--num_aa-1)\n    - [--replicate\\_count](#--replicate_count)\n    - [--screw\\_translation](#--screw_translation)\n    - [--screw\\_radius](#--screw_radius)\n    - [--screw\\_angle](#--screw_angle)\n    - [--fixcenter\\_threshold](#--fixcenter_threshold)\n    - [--compact\\_lr](#--compact_lr)\n    - [--clash\\_lr](#--clash_lr)\n    - [--f\\_small](#--f_small)\n    - [--f\\_strand](#--f_strand)\n    - [--mix\\_output (couple or replicate)](#--mix_output-couple-or-replicate)\n    - [--mode (screw or rotation)](#--mode-screw-or-rotation)\n    - [--sym\\_noise (default: True)](#--sym_noise-default-true)\n  - [Multi-state design](#multi-state-design)\n- [salad datasets and training](#salad-datasets-and-training)\n  - [PDB data setup](#pdb-data-setup)\n  - [training salad models](#training-salad-models)\n\n# Roadmap\nWhile the models and code provided here are functional and reflect the state of salad we used for our [manuscript](https://www.biorxiv.org/content/10.1101/2025.01.31.635780v1.abstract), we realize that there is a lot of room for improvement. As it stands, salad is not as user-friendly as it probably could be and the set of its features is neither a strict superset nor a strict subset of the features provided by other methods for protein structure generation. With this in mind, we provide a roadmap of improvements and features we would like to implement over the following months to make salad as good of a protein design tool as we can.\n\n## quality of life improvements\n- [ ] provide Colab notebooks\n  - minimal salad Colab notebook: [SALAD_example.ipynb](notebooks/SALAD_example.ipynb)\n  - more will come once I figure out the API\n- [ ] provide docker / apptainer containers\n- [ ] provide scripts to run the entire `salad` pipeline in one go\n- [ ] implement an improved CLI\n- [ ] implement \u0026 document a better API for structure-editing\n\n## documentation / code cleanup\n- [ ] improve coverage of documentation\n- [ ] clean up / structure the `scripts/` directory\n- [ ] provide more code examples\n- [ ] provide more usage examples of `salad` scripts\n\n## features\n- [ ] add partial diffusion option to design scripts\n- [ ] implement, train \u0026 benchmark small molecule-aware models\n- [ ] implement `salad` scripts for protein design tasks\n    from the literature\n- [ ] implement and document additional hk.Modules\n    for implementing protein models\n\n# Getting started\n## Installing salad\nYou can install `salad` the following way:\n1. Clone this git repository: `git clone https://github.com/mjendrusch/salad.git`\n2. Move to the created directory: `cd salad`\n3. If not already created, create and activate a fresh python environment,\n   e.g. using `conda`: `conda create -n salad python=3.10`\n4. Install `salad` and its prerequisites using `pip`: `pip install -e .`\n\nTo use pre-trained `salad` models, download and extract the set of available parameters:\n```\nwget https://zenodo.org/records/14711580/files/salad_params.tar.gz\ntar -xzf salad_params.tar.gz\n```\n\nTo use the ProteinMPNN sequence design scripts in this repository,\ninstall ProteinMPNN following the instructions [here](https://github.com/dauparas/ProteinMPNN).\n\nTo set up design benchmarking using the scripts in this repository,\ninstall `novobench` following the instructions [here](https://github.com/mjendrusch/novobench).\n\n**Note:** salad will currently not install with GPU support on ARM devices.\nI will look into installing with jax-metal for MacOS devises further down the line.\n\n## Generating your first proteins\nAfter installing `salad` and unpacking the parameters, you are ready to design your first proteins. \n\nActivate your conda environment and run the following command to generate a proteins with length 100 amino acids ([full options](#de-novo-protein-design-script)):\n\n```bash\npython salad/training/eval_noise_schedule_benchmark.py \\\n    --config default_vp \\\n    --params params/default_vp-200k.jax \\\n    --out_path output/my_first_protein/ \\\n    --num_aa 100 \\\n    --num_designs 10\n```\n\nThis will create a directory `output/my_first_protein/` containing PDB files for each generated protein structure. Let's go through this command line by line:\n\n`--config default_vp` and `--params params/default_vp-200k.jax` choose a model configuration and a corresponding set of parameters. You will notice that the names of parameter files start with the name of the configuration they are supposed to be used with.\n\n`--out_path` specifies the path to the directory where the salad outputs will be saved.\n\n`--num_aa` specifies how many amino acids you want to design. With this script (eval_noise_schedule_benchmark.py) and config (default_vp), generation will work well up to a total of around 400-500 amino acids. You can also specify multiple chains by separating numbers of amino acids with a colon, e.g. `--num_aa 100:50`.\n\nFinally, `--num_designs` says how many designs you want to generate.\n\nTo generate large proteins up to 1,000 amino acids, you should use the following script instead ([full options](#large-protein-generation-with-domain-shaped-noise)):\n\n```bash\npython salad/training/eval_ve_large.py \\\n    --config default_ve_scaled \\\n    --params params/default_ve_scaled-200k.jax \\\n    --out_path output/my_large_protein/ \\\n    --num_aa 1000 \\\n    --num_designs 10\n```\n\nThis uses a different configuration and set of parameters better suited for large protein generation, as well as different starting noise to increase designability of large generated structures.\n\nSee [below](#salad-scripts-documentation) for a detailed description of all available scripts and their options.\n\n## Sequence redesign and benchmarking\n### Running ProteinMPNN\nFirst, install ProteinMPNN in a separate conda environment following the instructions [here](https://github.com/dauparas/ProteinMPNN) and activate the environment.\n\nThen, set an environment variable to point to that directory and activate\nthe corresponding conda environment you created for ProteinMPNN:\n```bash\nexport PROTEINMPNN=/path/to/proteinmpnn/\nconda activate pmpnn\n```\n\nYou will then be able to run ProteinMPNN for unconditional designs\nusing:\n```bash\nbash scripts/pmpnn.sh path/to/pdbs/ path/to/outputs/\n```\nThis will put the generated fasta files in `path/to/outputs/seqs/`.\n\nFor motif scaffolding, you can run\n```bash\nbash scripts/pmpnn_fixed.sh path/to/pdbs/ path/to/motif.pdb path/to/outputs/\n```\nwith a motif PDB file in the [Genie2 motif PDB format](https://github.com/aqlaboratory/genie2?tab=readme-ov-file#format-of-motif-scaffolding-problem-definition-file).\n\nFor repeat protein and homooligomer design, you can use\n```bash\nbash scripts/pmpnn_tied.sh path/to/pdbs/ path/to/outputs/ \u003crepeat length\u003e \u003crepeat count\u003e\n```\nfor example for three repeats of a 50 amino acid repeat unit you could call\n```bash\nbash scripts/pmpnn_tied.sh path/to/pdbs/ path/to/outputs/ 50 3\n```\n\nFinally, to design sequences for multi-state proteins like the ones described in \"Efficient protein structure generation with sparse denoising models\", you can use\n```bash\nbash scripts/pmpnn_confchange.sh path/to/pdbs/ path/to/outputs/ 0.5\n```\n\n### Running novobench\nFirst, install `novobench` following the instructions [here](https://github.com/mjendrusch/novobench) and activate the corresponding conda environment.\n\nThen, you can point `novobench` at a directory of PDB files and a corresponding directory of ProteinMPNN fasta files to compute ESMfold pLDDTs and scRMSDs:\n```bash\npython -m novobench.run \\\n    --pdb-path path/to/pdbs/ \\\n    --fasta-path path/to/fastas/seqs/ \\\n    --out-path path/to/outputs/esm/ \\\n    --model esm\n```\nThe first time you run this, it might take a while, as ESMfold needs to download and cache its parameters.\n\nRunning `novobench` will create a directory `predictions/` and a CSV file `scores.csv` in the output directory. `predictions/` contains one subdirectory per PDB file in `path/to/pdbs/`, which will contain one PDB file named `design_N.pdb` for each sequence in the corresponding fasta file, in order (`design_0.pdb`, `design_1.pdb`, etc.) which contain the ESMfold predicted structure for that sequence.\n\n`scores.csv` contains one row per PDB file and fasta sequence, with the `name` of the PDB file, the `index` of the sequence in the fasta file, the `sequence` in one-letter code, and a set of ESMfold-derived metrics:\n* `sc_rmsd`, `sc_tm`: RMSD / TMscore between the predicted structure for that sequence and the salad-generated structure.\n* `plddt`, `ptm`: ESMfold pLDDT and pTM scores\n* `ipae`, `mpae`: mean and minimum interface pAE for complexes\n\nRunning `novobench` for multi-state design might require running separate instances of `novobench` for each state. As `salad.train.eval_split_change.py` writes files suffixed `_s0.pdb`, `_s1.pdb`, `_s2.pdb` for the three designed states, you can do this by telling `novobench` to use only designs which contain one of these suffixes:\n```bash\npython -m novobench.run \\\n    --pdb-path path/to/pdbs/ \\\n    --fasta-path path/to/fastas/seqs/ \\\n    --out-path path/to/outputs/esm/ \\\n    --model esm \\\n    --filter-name _s0\n``` \n\n## Using the salad protein autoencoders\nWhile the sparse protein structure autoencoders described in \"Efficient protein structure generation with sparse denoising models\" are more of a side-note, we still provide scripts and parameters for applying these models to your protein structures.\n\nTo use these, please download and unpack the autoencoder parameters first:\n```bash\nwget https://zenodo.org/records/14711580/files/ae_params.tar.gz\ntar -xzf ae_params.tar.gz\n```\nThis will create a directory `ae_params`, which contains the following autoencoder checkpoints:\n* `small_inner-200k.jax`: sparse autoencoder with inner product-based distance prediction for neighbour selection.\n* `small_none-200k.jax`: sparse autoencoder with only structure-based neighbour selection.\n* `small_semiequivariant-200k.jax`: same as `small_inner`, but using additional non-equivariant features.\n\nIn addition to these checkpoints, there are also corresponding VQ-VAE checkpoints (named e.g. `small_inner_vq-200k.jax`).\n\nAs for the above salad models, checkpoints names have the format \"\u003c config \u003e-\u003c training step \u003e.jax\".\n\nYou can then autoencode a protein structure in a PDB file by running:\n```bash\npython salad/training/eval_structure_autoencoder.py \\\n    --config small_inner \\\n    --params ae_params/small_inner-200k.jax \\\n    --diagnostics True \\\n    --path path-to-input-pdbs/ \\\n    --num_recycle 10 \\\n    --out_path path-to-ae-outputs/\n```\n\nRemember to match the config and corresponding parameter names.\n`--diagnostics True` writes a numpy archive containing the encoded latents for each structure in the input directory.\n\n# salad scripts documentation\n## _De novo_ protein design script\nTo generate protein structures _de novo_, without structure-editing\nor multi-motif scaffolding, you can use the script  `salad/training/eval_noise_schedule_benchmark.py`. It allows the use of different\nmodel configurations and checkpoints and allows to apply simple\nconstraints such as secondary structure, motifs, length and number of chains, etc.\n\nFor example, to use a variance-preserving (VP) salad model to generate\na set of 100 amino acid proteins, you could run the following:\n\n```bash\npython salad/training/eval_noise_schedule_benchmark.py \\\n    --config default_vp \\\n    --params params/default_vp-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_aa 100 \\\n    --num_steps 500 \\\n    --out_steps 400 \\\n    --num_designs 10 \\\n    --prev_threshold 0.8 \\\n    --timescale_pos \"cosine(t)\"\n```\n\nIf instead you wish to generate a complex of a 100 amino acid protein\nand a 8 amino acid cyclic peptide ligand, you merely need to set\n`--num_aa 100:c8`.\n\nTo change the type of model you want to use, choose a different\n`--config` and adjust the selected `--params` and other options\naccordingly. For example, to instead run a VE variance-expanding (VE)\nmodel, you could run:\n```bash\npython salad/training/eval_noise_schedule_benchmark.py \\\n    --config default_ve_scaled \\\n    --params params/default_ve_scaled-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_aa 100 \\\n    --num_steps 500 \\\n    --out_steps 400 \\\n    --num_designs 10 \\\n    --prev_threshold 0.99 \\\n    --timescale_pos \"ve(t)\"\n```\nHere, you need to change the noise schedule (`--timescale_pos`)\nto a VE noise schedule and adjust `--prev_threshold` to 0.99 to\nprevent over-use of self-conditioning which results in decreased\ndesignability of generated structures. Recommended values for each\nmodel and each setting are described below.\n\nFor a VP model with input-dependent variance, you could use:\n\n```bash\npython salad/training/eval_noise_schedule_benchmark.py \\\n    --config default_vp_scaled \\\n    --params params/default_vp_scaled-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_aa 100 \\\n    --num_steps 500 \\\n    --out_steps 400 \\\n    --num_designs 10 \\\n    --prev_threshold 0.8 \\\n    --timescale_pos \"cosine(t)\" \\\n    --cloud_std \"default(num_aa)\"\n```\nHere, you need to specify the standard deviation of the\nnoise distribution using `--cloud_std`. The resulting protein\nstructures will have a standard deviation of atom positions\nclose to this value. In this example we set `--cloud_std` to\nthe `default` function, which was implemented to approximate\nthe distribution of atom position standard deviations for proteins\nof a given length `num_aa`.\n\nFurther usage examples can be found in the `scripts/` directory,\nwhich contains all generation and benchmarking scripts that were\nused for evaluating salad in the corresponding preprint.\n\nBelow you can find an in-depth explanation of all parameters that\ncan be passed to `eval_noise_schedule_benchmark.py`, as well as their\ndefault and recommended values.\n\n### --config\nModel configuration in (default_vp, default_vp_scaled, default_ve_scaled).\n* `default_vp` is the configuration for the fixed-variance\n  variance-preserving (VP) model with variance 10 Å.\n* `multimotif_vp` is a version of `default_vp` with multi-motif\n  conditioning for motif scaffolding.\n* `default_vp_scaled` is the configuration for input-dependent\n  variance VP model. The standard deviation of the noise distribution\n  has to be specified by setting `--cloud_std \u003cvalue in angstroms\u003e`\n  (see below).\n* `default_ve_scaled` is the configuration for the variance-expanding\n  (VE) model. Using this configuration requires setting\n  `--timescale_pos \"ve(t)\"`,\n  or another diffusion time scale compatible with VE diffusion.\n  In addition, it is highly recommended to set\n  `--prev_threshold` to a value between 0.95 and 0.99.\n\nThe full list of implemented configurations can be found in salad/modules/config/noise_schedule_benchmark.py.\n\n### --params\nPath to a set of parameters compatible with the selected `--config`.\n\nParameters can be downloaded from Zenodo ([10.5281/zenodo.14711580](https://doi.org/10.5281/zenodo.14711580)), or obtained from the most recent `salad` code release on GitHub. Parameter sets are named after the underlying configuration and the conditions the parameter checkpoint was saved at. E.g. `default_vp-200k.jax` is a checkpoint\nfor the `default_vp` configuration taken at 200,000 training steps.\n\nCurrently, the following checkpoints are available:\n| name | config | notes |\n|---|---|---|\n|default_vp-200k.jax|default_vp| ~|\n|default_vp_scaled-200k.jax|default_vp_scaled| ~|\n|default_ve_scaled-200k.jax|default_ve_scaled| ~|\n|multimotif_vp-200k.jax|multimotif_vp| VP multi-motif scaffolding model|\n|default_vp-pdb256-200k.jax|default_vp| trained on proteins in PDB with \u003c= 256 amino acids |\n|default_vp-synthete256-200k.jax|default_vp| trained on synthetic data with \u003c= 256 amino acids |\n\n### --out_path\nPath to the output directory where the generated PDB files\nwill be saved. The directory will be created if it does not exist.\nRunning `salad` multiple times with the same `--out_path` will \nreplace the outputs of previous runs.\n\n### --num_aa\nSpecifies the number of amino acids in each chain that will be\ngenerated by `salad`.\n\n`--num_aa` can be:\n* an integer like `--num_aa 100`, to specify a single chain\n  of amino acids.\n* an integer prefixed with \"c\", like `--num_aa c100`, to specify\n  a cyclic chain of amino acids (cyclised with a peptide bond).\n* a list of (prefixed) integers, separated by \":\" like\n  `--num_aa 100:c50`, to generate a complex with two or\n  more chains of amino acids (linear or cyclic).\n\nFor instance, generating a 100 amino acid proteins with 8\namino acid cyclic ligands would be specified as:\n`--num_aa 100:c8`\n\n### --merge_chains (default: False)\nSpecifies if different chains should use the same chain_index\nduring generation and instead be separated by a gap in the\nresidue_index. This is not relevant for complex generation\nunless using a model that has not been exposed to protein\ncomplexes during training (e.g. the pdb256 and synthete256\ncheckpoints).\n\n### --num_steps (default: 500)\nThe total number of diffusion steps to subdivide the interval\n$[0, 1]$ of diffusion times. E.g. to set up diffusion over\n200 total steps, specify `--num_steps 200`.\n\n### --out_steps (default: 400)\nThe number of diffusion steps after which to write an output\nPDB file. Similar to RFdiffusion, we return structures early\nby default (at 400 steps of 500 total steps), as the resulting\nstructure shows very little change after that point.\n\n`--out_steps` can be:\n* a single integer step-count, like `--out_steps 400`.\n* a comma-separated list of step-counts, like `--out_steps 0,100,200`.\n  This writes a PDB file at _each_ of the specified step counts.\n\n### --num_designs (default: 10)\nThe number of generated PDB files.\n\n### --prev_threshold (default: 0.8)\nDiffusion time threshold for using self-conditioning.\nA `--prev_threshold` of 0.8 signifies that self-conditioning should\nbe used only for diffusion times between 0.8 and 1.0, early in the\ndenoising process.\n\n`--prev_threshold` should be set to:\n* **0.8** for `vp` and `vp_scaled` models.\n  Models still produce reasonable results for values \u003e 0.6 and\n  start losing quality for lower values.\n* **0.99** for `ve_scaled` models. Values \u003e 0.95 still produce \n  reasonable results, but lower values result in an overabundance\n  of beta sheet structures.\n\n### --timescale_pos (default: \"cosine(t)\")\nFunction mapping diffusion time to a noise schedule. By default,\nthis is set to a cosine noise schedule \"cosine(t)\" for VP models.\nFor VE models, it should be set to \"ve(t)\".\n\n`--timescale_pos` can be:\n* any expression depending on diffusion time \"t\", that can be \n  expressed in raw python + numpy (as np).\n* one of the following pre-defined noise schedules:\n  * cosine(t): a cosine noise schedule\n  * ve(t, sigma_max=80.0, rho=7.0): the EDM noise schedule\n    where sigma_max is the starting standard deviation and\n    rho is the exponent. Results in the manuscript were\n    generated with sigma_max = 80 Å and sigma_max = 100 Å\n\n### --cloud_std (default: \"none\")\nExpression defining the standard deviation of the noise distribution\nfor input-dependent variance VP models (`default_vp_scaled`).\n\n`--cloud_std` can be:\n* \"none\", when not used with a VP-scaled model.\n* any expression dependent on `num_aa`, the number of amino acids\n  and evaluating to a positive number. E.g. `\"num_aa ** 0.4\"`\n* use the default standard deviation scale `\"default(num_aa)\"`. \n  **This is a reasonable default for generating protein structures**\n  **with VP-scaled diffusion and was used throughout the preprint.**\n\n### --dssp (default: \"none\")\nSecondary structure specification. Overrides `--num_aa`.\nThis is a string of letters \"L\" (for Loops), \"H\" (for Helices),\n\"E\" (for shEets) and \"_\" (for unspecified secondary structure).\nE.g.:\n```\n___HHHHHHHH___EEEEEE___EEEEEE___EEEEE___HHHHHHHH____\n```\nAlternatively, you may set `--dssp random`. This randomly samples\na secondary structure string of the shape specified in `--num_aa`\nand can be used to increase the diversity of generated structures.\n\nBy default, no secondary structure constraints are applied.\n\n**NOTE: currently non-random dssp cannot be mixed with other**\n**options, such as cyclicity constraints or symmetry constraints.**\n**We plan to address this in a future release.**\n\n### --template (default: \"none\")\nSpecifies an ad-hoc template PDB file for single-motif scaffolding.\nTemplate is a comma-separated tuple of a path to a motif PDB file\nand a constraint string:\n```\npath_to/template.pdb,XXXXXXXXXXFFFFFFXXXXFFFFXXXXXFFFFXXXXXXXXX\n```\nThe constraint string specifies the structure of which amino acids\nis specified in the template PDB file. The Nth position marked with\n\"F\" in the constraint string corresponds to the Nth amino acid in the\ntemplate PDB file. The constraint string **must** contain exactly as\nmany \"F\"s as the number of amino acids in `template.pdb`. \"X\"s specify\namino acids that can be freely designed.\n\nPlease make sure to insert \"X\"s only at positions where the template\nPDB file has gaps in the amino acid chain, otherwise you will generate\nnonsensical structures.\n\n**NOTE: as the constraint string specifies the number and sequence**\n**positions of freely designable amino acids, this overrides**\n**--num_aa and is currently incompatible with --dssp specifications.**\n**We plan to address this in a future release.**\n\n**For multi-motif scaffolding, please use the `eval_motif_benchmark*.py` scripts.**\n\n## Large protein generation with domain-shaped noise\nTo generate proteins of size up to 1,000 amino acids, you can use the\nscript `salad/training/eval_ve_large.py`. It accepts the same \narguments as `salad/training/eval_ve_large.py`, but it should be used\nonly with `--config default_ve_scaled` and the corresponding set of\nparameters `--params params/default_ve_scaled-200k.jax`.\n\nGenerating a set of 1,000 amino acid proteins is as simple as running:\n```bash\npython salad/training/eval_ve_large.py \\\n    --config default_ve_scaled \\\n    --params params/default_ve_scaled-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_aa 1000 \\\n    --num_steps 500 \\\n    --out_steps 400 \\\n    --num_designs 10 \\\n    --prev_threshold 0.99 \\\n    --timescale_pos \"ve(t, sigma_max=80.0)\"\n```\n\n**NOTE: while this script technically accepts secondary structure**\n**and template constraints, as well as multi-chain and cyclic** **--num_aa, it has not been tested with that in mind, so we cannot**\n**give any guarantees that it will produce meaningful structures**\n**if used this way. We plan to address this in a future release.**\n\n## Shape-conditioned protein generation\nTo generate proteins with a fixed shape, you can use the script\n`salad/training/eval_ve_shape.py`. This script is compatible with\nVE models (`default_ve_scaled`, `params/default_ve_scaled-200k.jax`).\nInstead of taking a `--num_aa` input, it takes a list of coordinates\nand amino acid counts in CSV format:\n```csv\nx,y,z,num_aa\n0.0,-11.79,-16.45,100\n0.0,-12.09,-3.08,100\n0.0,-0.87,-2.98,100\n0.0,-11.59,10.40,100\n0.0,0.46,21.17,100\n0.0,12.01,10.65,100\n0.0,11.56,-3.24,100\n0.0,12.31,-16.45,100\n```\nEach row specifies a number of consecutive amino acids to place\nat the specified coordinates. VE noise will then arrange amino acids\nin a normal distribution around each of the specified locations.\n\nThe CSV-file is passed to the script using `--blobs blobs.csv`.\nThe coordinates can be optionally scaled by a factor using `--blob_scale \u003cfactor\u003e`.\n\nTo generate letter shapes as described in the manuscript, you can\nrun:\n```\npython salad/training/eval_ve_shape.py \\\n    --params params/default_ve_scaled-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --blobs letters/s_shape.csv \\\n    --blob_scale 1.8 \\\n    --num_designs 50 \\\n    --num_steps 500 \\\n    --out_steps 400\n```\n\nThe `letters/` directory contains example shape specifications for\na number of letters. Those letter shapes were manually generated and\n`--blob_scale` was gradually adjusted until generated protein \nstructures had legible letter-shapes.\n\nThis script takes the same denoising-process arguments as\n`eval_noise_schedule_benchmark.py`: --num_designs, --num_steps,\n--out_steps, --prev_threshold, --config, --params, --timescale_pos.\nAll values are set to reasonable defaults and changes are unlikely\nto be necessary.\n\n## Multi-motif scaffolding\nTo scaffold one or more protein motifs, you can use the scripts\n`salad/training/eval_motif_benchmark.py` and `salad/training/eval_motif_benchmark_nocond.py`.\nThe former implements (multi-)motif scaffolding for models which\nwere trained with multi-motif conditioning, while the latter uses\nstructure-editing to scaffold motifs without explicit motif conditioning.\n\nBoth scripts take the same denoising-process arguments as above\n(--num_designs, --num_steps,\n--out_steps, --prev_threshold, --config, --params, --timescale_pos).\nInstead of providing `--num_aa`, both scripts accept a `--template`\nargument, e.g. `--template motif_structure.pdb`.\n\nHere, `motif_structure.pdb` must be a motif-annotated PDB file as\ndescribed in [the Genie2 repository](https://github.com/aqlaboratory/genie2?tab=readme-ov-file#format-of-motif-scaffolding-problem-definition-file).\nThe scripts then generate designs according to the specification in that `motif_structure.pdb` file. For example:\n```\npython salad/training/eval_motif_benchmark.py \\\n    --config multimotif_vp \\\n    --params params/multimotif_vp-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_steps 500 --out_steps 400 --prev_threshold 0.8 \\\n    --num_designs 1000 --timescale_pos \"cosine(t)\" \\\n    --template motif_structure.pdb\n```\nUsing `eval_motif_benchmark.py` requires a multi-motif conditioned\ncheckpoint, such as `multimotif_vp-200k.jax`. To use _any_ salad\nmodel, you can instead use `eval_motif_benchmark_nocond.py`:\n```bash\nconfig=default_vp\npython salad/training/eval_motif_benchmark_nocond.py \\\n    --config $config \\\n    --params params/${config}-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_steps 500 --out_steps 400 --prev_threshold 0.8 \\\n    --num_designs 1000 --timescale_pos \"cosine(t)\" \\\n    --template motif_structure.pdb\n```\nUsage of both scripts is the same.\n\n## Symmetric / repeat protein generation\nTo generate rotation or superhelical (screw) symmetric repeat proteins,\nyou can use the script `salad/training/eval_sym.py`.\nCurrently, this script only supports cyclic or screw-axis symmetries.\nBy specifying an angle of rotation, distance from center and translation\nalong the symmetry axis, this allows generating superhelical repeat\nproteins with a fully specified geometry.\n\nThis script takes the usual denoising process settings (--num_designs, --num_steps, --out_steps, --prev_threshold, --config, --params, --timescale_pos),\nas well as `--num_aa` compatible with `eval_noise_schedule_benchmark.py`.\nIn addition, it requires inputs for `--screw_radius`, which specifies\nthe repeat unit distance from the symmetry axis in angstroms (center-of-mass radius\nof the resulting superhelix); `--screw_angle`, which specifies the\nrotation angle around the symmetry axis for successive repeat units;\n`--screw_translation`, which specifies the translation along the symmetry\naxis in angstroms between two successive repeat units.\n\nFor example, generating C3-symmetric repeat proteins works as follows:\n```bash\nmonomer_size=50\nrepeat_count=3\ntotal_count=$(($monomer_size * $repeat_count))\npython salad/training/eval_sym.py \\\n    --config default_ve_scaled \\\n    --params params/default_ve_scaled \\\n    --num_aa $total_count \\\n    --out_path path_to_output/ \\\n    --num_steps 500 --out_steps \"400\" \\\n    --prev_threshold 0.98 --num_designs 20 --timescale_pos \"ve(t)\" \\ \n    --screw_radius 10.0 --screw_angle 120.0 \\\n    --screw_translation 0.0 --replicate_count 3 \\\n    --mode \"screw\" --mix_output couple \\\n    --sym_noise True\n```\nTo generate a screw-symmetric structure instead, it suffices to set\n`--screw_translation` to a non-zero value.\n\nWhile this script can be used with any model configuration, generating\nscrew-symmetric proteins works best with VE models.\n\nA detailed explanation of all arguments to `eval_sym.py` is given below.\n\n### --num_aa\nThis works as in `eval_noise_schedule_benchmark.py`. However, care needs\nto be taken that the chain lengths specified are compatible with the\napplied symmetry operation. E.g. for --replicate_count 3, --num_aa needs\nto be divisible by 3, or consist of 3 chains of equal length, e.g.:\n```\n--num_aa 150\n--num_aa 50:50:50\n```\n\n### --replicate_count\nNumber of copies of a repeat. E.g. for a C3-symmetric repeat protein,\none should use --replicate_count 3.\n\n### --screw_translation\nTranslation along the central axis of rotation between two successive\nrepeat units. If this is 0.0, this will result in a repeat protein\nor assembly with cyclic symmetry. Otherwise it will result in a \nsuperhelical (screw) symmetry.\n\n### --screw_radius\nRadius in angstroms at which repeats should be placed from the central\naxis of rotation. Allows to control the radius of a cyclic or \nscrew-symmetric repeat protein or assembly.\n\n### --screw_angle\nAngle of rotation between two successive repeat units around the central\naxis of rotation. For designs with --screw_translation 0.0 this needs to\nbe less than or equal to 360 / --replicate_count, as it will result in\nclashing structures otherwise.\n\n### --fixcenter_threshold\nNoise level above which the center of mass of each repeat unit is held\nfixed exactly at the specified --screw_radius. Default: 0.0001.\n\n### --compact_lr\nLearning rate for compactness loss, which improves repeat unit globularity.\nDefault: 0.0 (disabled). Reasonable values lie in the range 1e-4 to 1e-3.\n\n### --clash_lr\nLearning rate for anti-clash loss, which counteracts clashes in generated structures.\nDefault: 0.0 (disabled). Reasonable values lie in the range 5e-3 to 1e-1.\n\n### --f_small\nMaximum fraction of small amino acids (ALA / GLY) in generated structures.\nThis filters high-ALA/GLY structures which are unlikely to be designable\nwithout the need for structure prediction.\nSmall amino acids can also be reduced by increasing --clash_lr.\nDefault: 1.0 (disabled). Reasonable values are 0.10 (strict), 0.20 (lax)\n\n### --f_strand\nMaximum fraction of beta strand residues in generated structures.\nThis filters high-strand structures which are unlikely to be designable\nwithout the need for structure prediction.\nDefault: 1.0 (disabled). An example reasonable value is 0.3, if you want no all-beta folds.\n\n### --mix_output (couple or replicate)\nSpecifies if repeat units should be averaged for symmetrization (couple)\nor if one repeat unit should be picked and replicated, discarding all \nother repeat units (replicate).\n\n### --mode (screw or rotation)\nIf --mode is \"screw\", aligns repeat units by moving them all to the\norigin and inverting the applied rotation. Then, the repeat units (optionally averaged) are moved to a fixed radius from the center\nand rotation and translation are applied.\n\nIf --mode is \"rotation\", aligns repeat units by inverting rotation only.\nThis disables `--screw_radius`, but is closer to the way structures are \nsymmetrized in [RFdiffusion](https://github.com/RosettaCommons/RFdiffusion).\n\n### --sym_noise (default: True)\nIf True, applies symmetry operation (screw or rotation) to the input\nnoise. Otherwise, only applies symmetry operations to the _output_ of\nthe denoising model.\n\n## Multi-state design\nThe script `salad/training/eval_split_change.py` allows the user to\nproduce backbones for the multi-state protein design task described by\n[Lisanza _et al._, 2024](https://www.nature.com/articles/s41587-024-02395-w): Here, the goal is to design a _parent_ protein with\na specified secondary structure, which results in two _child_ proteins\nwhen split into an N-terminal and C-terminal part. These child proteins\nshould adopt a different secondary structure from the parent.\n\nIn addition to the usual denoising process arguments (--num_designs, --num_steps, --out_steps, --prev_threshold, --config, --params, --timescale_pos),\nit takes an argument `--dssp`, which specifies the secondary structures\nfor all parent and child structures, separated by \"/\".\nThe secondary structure is specified as a string of\n\"l/L\", \"h/H\", \"e/E\" and \"_\" for loops, helices, sheets and unspecified\nsecondary structure.\nResidues with capitalized secondary structure in the parent sequence\nare constrained to have the same _tertiary_ structure in the corresponding\nchild structure.\n\nFor the example described by [Lisanza _et al._, 2024](https://www.nature.com/articles/s41587-024-02395-w) this looks as follows:\n```bash\n  --dssp \"_HHHHHHHHHHHHHlllHHHHHHHHHHHHHllleeeeeellleeeee_______eeeeellleeeeeelllHHHHHHHHHHHHHlllHHHHHHHHHHHHH_/_HHHHHHHHHHHHHLLLHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHHHHHH_/_HHHHHHHHHHHHHHLLLHHHHHHHHHHHHHLLLHHHHHHHHHHHHH_\"\n```\nAs the first two and last two helices of the parent are shared with the\nrespective child, we can fix their tertiary structure during design.\n\nAll in all, running multi-state design for this example works as follows:\n```bash\nconfig=default_vp\nconstraint=\"_HHHHHHHHHHHHHlllHHHHHHHHHHHHHllleeeeeellleeeee_______eeeeellleeeeeelllHHHHHHHHHHHHHlllHHHHHHHHHHHHH_/_HHHHHHHHHHHHHLLLHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHHHHHH_/_HHHHHHHHHHHHHHLLLHHHHHHHHHHHHHLLLHHHHHHHHHHHHH_\"\npython salad/training/eval_split_change.py\n    --config $config \\\n    --params params/${config}-200k.jax \\\n    --out_path path_to_outputs/ \\\n    --num_steps 500 --out_steps 400 \\\n    --prev_threshold 0.8 --num_designs 1000 \\\n    --timescale_pos \"cosine(t)\" \\\n    --dssp $constraint\n```\nIn principle, any config and set of parameters can be used for this\nscript, adjusting the denoising settings as described above.\n\n# salad datasets and training\nWhat follows is some minimal documentation on setting up PDB for training. We will expand this section with all the details of how to customize training datasets and scripts soon.\n\n## PDB data setup\nThe directory `data/allpdb/` contains all scripts required to download and set up data files for training `salad` models.\nTo set up the dataset, copy the contents of that directory to the location where you want to store the data. Then activate the `salad` conda environment you created during installation.\n\nYou can now download the current version of PDB:\n```bash\ncd data/allpdb/\n# get PDB structures\nbash download_allpdb.sh\n# get PDB sequences\nbash download_seqres.sh\n# get chemical component dictionary\nwget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz\n```\n\nThen, you can convert all PDB files to the npz format used by the `salad` data loaders:\n```bash\nbash cifparser.sh\n```\nThis could take a couple of hours to finish.\n\nFinally, you generate a list of successfully converted biological assemblies:\n```bash\npython get_chain_assemblies.py\n```\n\nTogether with the files provided in this repository this should be all the data needed to use the `salad.data.allpdb.AllPDB` dataloaders.\n\n## training salad models\nHaving set up PDB data, you can then train `salad` models using:\n```bash\npython -m salad.training.train_noise_schedule_benchmark.py \\\n    --data_path /path/to/data/allpdb/ \\\n    --config default_vp \\\n    --path /path/to/output/ \\\n    --suffix \"specific-model-name\" \\\n    --num_aa 1024 \\\n    --p_complex 0.5 \\\n    --multigpu True \\\n    --rebatch 4\n```\n\nThis will run training on PDB for the `default_vp` config, showing the model up to 1024 amino acids per micro-batch, 4 micro-batches per GPU. This requires an nvidia GPU with at least 24 GB VRAM. We have found training to work well even on RTX 3090 GPUs. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmjendrusch%2Fsalad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmjendrusch%2Fsalad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmjendrusch%2Fsalad/lists"}