{"id":18887142,"url":"https://github.com/thunlp-mt/dymean","last_synced_at":"2025-04-06T15:13:44.226Z","repository":{"id":172415217,"uuid":"644282437","full_name":"THUNLP-MT/dyMEAN","owner":"THUNLP-MT","description":"This repo contains the codes for our paper \"End-to-End Full-Atom Antibody Design\"","archived":false,"fork":false,"pushed_at":"2025-02-26T04:19:48.000Z","size":12258,"stargazers_count":103,"open_issues_count":8,"forks_count":13,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-28T10:54:15.147Z","etag":null,"topics":["antibody-design","drug-discovery","generative-ai"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2302.00203","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUNLP-MT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-05-23T07:45:14.000Z","updated_at":"2025-03-24T00:34:07.000Z","dependencies_parsed_at":"2024-03-16T15:20:48.846Z","dependency_job_id":"46236e76-39b1-464e-88be-c0cb97f4f845","html_url":"https://github.com/THUNLP-MT/dyMEAN","commit_stats":null,"previous_names":["thunlp-mt/dymean"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUNLP-MT%2FdyMEAN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUNLP-MT%2FdyMEAN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUNLP-MT%2FdyMEAN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUNLP-MT%2FdyMEAN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUNLP-MT","download_url":"https://codeload.github.com/THUNLP-MT/dyMEAN/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247500470,"owners_count":20948880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["antibody-design","drug-discovery","generative-ai"],"created_at":"2024-11-08T07:34:44.904Z","updated_at":"2025-04-06T15:13:44.201Z","avatar_url":"https://github.com/THUNLP-MT.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dyMEAN: End-to-End Full-Atom Antibody Design\n\nThis repo contains the codes for our paper [End-to-End Full-Atom Antibody Design](https://arxiv.org/abs/2302.00203).\n\n\n## Quick Links\n\n- [Setup](#setup)\n- [Experiments](#experiments)\n    - [Data Preprocessing](#data-preprocessing)\n    - [CDR-H3 Design](#cdr-h3-design)\n    - [Complex Structure Prediction](#complex-structure-prediction)\n    - [Affinity Optimization](#affinity-optimization)\n- [Proof-of-Concept Applications](#proof-of-concept-applications)\n    - [Inference API](#inference-api)\n    - [*In Silico* \"Display\"](#in-silico-display)\n- [Contact](#contact)\n- [Others](#others)\n\n\n## Setup\n\nThere are 3 necessary and 1 optional prerequisites: setting up conda environment (necessary), obtaining scorers (necessary), preparing antibody pdb data (necessary), and downloading baselines (optional).\n\n**1. Environment**\n\nWe have provided the `env.yml` for creating the runtime conda environment just by running:\n\n```bash\nconda env create -f env.yml\n```\n\n**2. Scorers**\n\nPlease first prepare the scorers for TMscore and DockQ as follows:\n\nThe source code for assessing TMscore is at `evaluation/TMscore.cpp`. Please compile it by:\n```bash\ng++ -static -O3 -ffast-math -lm -o evaluation/TMscore evaluation/TMscore.cpp\n```\n\nTo prepare the DockQ scorer, please clone its [official github](https://github.com/bjornwallner/DockQ) and compile the prerequisites according to its instructions. After that, please revise the `DOCKQ_DIR` variable in the `configs.py` to point to the directory containing the DockQ project (e.g. ./DockQ).\n\nThe lDDT scorer is in the conda environment, and the $\\Delta\\Delta G$ scorer is integrated into our codes, therefore they don't need additional preparations.\n\n**3. PDB data**\n\nPlease download all the structure data of antibodies from the [download page of SAbDab](http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/search/?all=true). Please enter the *Downloads* tab on the left of the web page and download the archived zip file for the structures, then decompress it:\n\n```bash\nwget https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/archive/all/ -O all_structures.zip\nunzip all_structures.zip\n```\n\nYou should get a folder named *all_structures* with the following hierarchy:\n\n```\n├── all_structures\n│   ├── chothia\n│   ├── imgt\n│   ├── raw\n```\n\nEach subfolder contains the pdb files renumbered with the corresponding scheme. We use IMGT in the paper, so the imgt subfolder is what we care about.\n\nSince pdb files are heavy to process, usually people will generate a summary file for the structural database which records the basic information about each structure for fast access. We have provided the summary of the dataset retrieved at November 12, 2022 (`summaries/sabdab_summary.tsv`). Since the dataset is updated on a weekly basis, if you want to use the latest version, please download it from the [official website](http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/about/).\n\n\n**(Optional) 4. Baselines**\n\nIf you are interested in the pipeline baselines, including the following projects and integrate their dependencies according to your needs:\n\n- framework structure prediction:\n    - [IgFold](https://github.com/Graylab/IgFold/tree/main/igfold)\n- docking:\n    - [HDock](http://huanglab.phys.hust.edu.cn/software/hdocklite/)\n- CDR design:\n    - [MEAN](https://github.com/THUNLP-MT/MEAN)\n    - [Diffab](https://github.com/luost26/diffab)\n    - [Rosetta](https://new.rosettacommons.org/demos/latest/tutorials/install_build/install_build)\n- side-chain packing:\n    - [Rosetta](https://new.rosettacommons.org/demos/latest/tutorials/install_build/install_build)\n\nAfter adding these projects, please also remember to revise the corresponding paths in `./configs.py`. We have also provided the scripts for cascading the modules in `./scripts/pipeline_inference.sh`.\n\n\n## Experiments\n\nThe trained checkpoints for each task are provided at the [github release page](https://github.com/THUNLP-MT/dyMEAN/releases/tag/v1.0.0). To use them, please download the ones you are interested in and save them into the folder `./checkpoints`. We provide the names, training configurations (under `./scripts/train/configs`), and descriptions of the checkpoints as follows:\n\n| checkpoint(s)                                | configure              | description                                    |\n| -------------------------------------------- | ---------------------- | ---------------------------------------------- |\n| cdrh3_design.ckpt                            | single_cdr_design.json | Epitope-binding CDR-H3 design                  |\n| struct_prediction.ckpt                       | struct_prediction.json | Complex structure prediction                   |\n| affinity_opt.ckpt \u0026 ddg_predictor.ckp        | single_cdr_opt.json    | Affinity optimization on CDR-H3                |\n| multi_cdr_design.ckpt                        | multi_cdr_design.json  | Design all 6 CDRs simultaneously               |\n| multi_cdr_opt.ckpt \u0026 multi_cdr_ddg_predictor | multi_cdr_opt.json     | Optimize affinity on all 6 CDRs simultaneously |\n| full_design.ckpt                             | full_design.json       | Design the entire variable domain, including the framework region |\n\n### Data Preprocessing\n\n**Data**\n\nTo preprocess the raw data, we need to first generate summaries for each benchmark in json format, then split the datasets into train/validation/test sets, and finally transform the pdb data to python objects. We have provided the script for all these procedures in `scripts/data_preprocess.sh`. Suppose the IMGT-renumbered pdb data are located at `all_structures/imgt/`, and that you want to store the processed data (~5G) at `all_data`, you can simply run:\n\n```bash\nbash scripts/data_preprocess.sh all_structures/imgt all_data\n```\nwhich takes about 1 hour to process SAbDab, RAbD, Igfold test set, and SKEMPI V2.0. It is normal to see reported errors in this process because some antibody structures are wrongly annotated or have wrong format, which will be dropped out in the data cleaning phase.\n\n**(Optional) Conserved Template**\n\nWe have provided the conserved template from SAbDab in `./data/template.json`. If you are interested in the extracting process, it is also possible to extract a conserved template from a specified dataset (e.g. the training set for the CDR-H3 design task) by running the following command:\n\n```bash\npython -m data.framework_templates \\\n    --dataset ./all_data/RAbD/train.json \\\n    --out ./data/new_template.json\n```\n\n\n### CDR-H3 Design\nWe use SAbDab for training and RAbD for testing. Please first revise the settings in `scripts/train/configs/cdr_design.json` (path to datasets and other hyperparameters) and then run the below command for training:\n```bash\nGPU=0,1 bash scripts/train/train.sh scripts/train/configs/single_cdr_design.json\n```\nNormally the training procedure takes about 7 hours on 2 GeForce RTX 2080 Ti GPUs. We have also provided the trained checkpoint at `checkpoints/cdrh3_design.ckpt`. Then please revise the path to the test set in `scripts/test/test.sh` and run the following command for testing:\n```bash\nGPU=0 bash scripts/test/test.sh ./checkpoints/cdrh3_design.ckpt ./all_data/RAbD/test.json ./results\n```\nwhich will save the generated results to `./results`.\n\n### Complex Structure Prediction\nWe use SAbDab for training and IgFold for testing. The training and testing procedure are similar to those of CDR-H3 design. After revising the settings in `scripts/train/configs/cdr_design.json` and `scripts/test/test.sh` as mentioned before, please run the following command for training:\n\n```bash\nGPU=0,1 bash scripts/train/train.sh scripts/train/configs/struct_prediction.json\n```\nNormally the training procedure takes about 8 hours on 2 GeForce RTX 2080 Ti GPUs. We have also provided the trained checkpoint at `checkpoints/struct_prediction.ckpt`. Then please run the following command for testing:\n```bash\nGPU=0 bash scripts/test/test.sh ./checkpoints/struct_prediction.ckpt ./all_data/IgFold/test.json ./results\n```\n\n### Affinity Optimization\nWe use SAbDab for training and the antibodies in SKEMPI V2.0 for testing. Similarly, please first revise the settings in `scripts/train/configs/affinity_opt.json`, `scripts/test/optimize_test.sh`, and additionally `scripts/train/train_predictor.sh`. Then please conduct training of dyMEANOpt (~ 5h):\n```bash\nGPU=0,1 bash scripts/train/train.sh scripts/train/configs/single_cdr_opt.json\n```\nThen we need to train a predictor of ddg on the representations of generated complex (~ 40min):\n```bash\nGPU=0 bash scripts/train/train_predictor.sh checkpoints/cdrh3_opt.ckpt\n```\nWe have provided the trained checkpoints at `checkpoints/cdrh3_opt.ckpt` and `checkpoints/cdrh3_ddg_predictor.ckpt`. The optimization test can be conducted through:\n```bash\nGPU=0 bash scripts/test/optimize_test.sh checkpoints/cdrh3_opt.ckpt checkpoints/cdrh3_ddg_predictor.ckpt ./all_data/SKEMPI/test.json 0 50\n```\nwhich will do 50 steps of gradient search without restrictions on the maximum number of changed residues (change 0 to any number to restrict the upperbound of $\\Delta L$).\n\n\n## Proof-of-Concept Applications\n\nWe also provide inference API and *in silico* demos for common applications in the real world problems, which are located in the `./api` and `./demos`.\n\n### Inference API\n\nWe provide the **design** API and the **optimize** API in `./api`, which can be easily integrated into python codes.\n\n#### Design\n\nThe **design** API (`./api/design.py`) can be used to generate CDRs given the sequences of the framework region, the PDB file of the antigen as well as the epitope definitions. We will use the an interesting scenario to illustrate the usage of the **design** API.\n\nWe want to design an antibody combining to the open state of the transient receptor potential cation channel subfamily V member 1 (TRPV1), which plays a critical role in acute and persistent pain. Instead of handcraft the epitope on TRPV1, we try to mimic an existing binder which is a double-knot toxin (DkTx). Therefore, we need to first extract the epitope definition by analyzing the binding pattern of the toxin, then design an antibody with given sequences of the framework regions.\n\n**1. Extract the Epitope Definition**\n\nWe provide the PDB file of the complex of the transient receptor potential cation channel subfamily V member 1 (TRPV1, chain ABCD) and the double-knot toxin (DkTx, chain EF) in `./demos/data/7l2m.pdb`. The original PDB has 4 symmetric units, so we manually split the two toxins (chain EF) in the middle to form 4 symmetric chains e,f,E,F. Each antibody only need to focus on one unit. Here we choose the chain E as an example.\n\nWe generate the epitope definition by analyzing the binding interface of chain E to the TRPV1:\n\n```bash\npython -m api.binding_interface \\\n    --pdb ./demos/data/7l2m.pdb \\\n    --receptor A B C D \\\n    --ligand E \\\n    --out ./demos/data/E_epitope.json\n```\n\nNow the epitope definition (i.e. the residues of TRPV1 on the binding interface) is saved to `./demos/data/E_epitope.json`. By changing the value of the argument \"ligand\" to e, f, and F, we can obtain the epitope definitions for other units (don't forget to revise the output path as well).\n\n**2. Obtain the Sequences of the Framework Regions**\n\nDepending on the final purposes of designing the antibody, framework regions with different physiochemical properties may be desired. Since here we are only providing a proof-of-concept case, we randomly pick up one from the existing dataset:\n\n```bash\nheavy chain (H): 'QVQLKESGPGLLQPSQTLSLTCTVSGISLSDYGVHWVRQAPGKGLEWMGIIGHAGGTDYNSNLKSRVSISRDTSKSQVFLKLNSLQQEDTAMYFC----------WGQGIQVTVSSA'\nlight chain (L): 'YTLTQPPLVSVALGQKATITCSGDKLSDVYVHWYQQKAGQAPVLVIYEDNRRPSGIPDHFSGSNSGNMATLTISKAQAGDEADYYCQSWDGTNSAWVFGSGTKVTVLGQ'\n```\n\nThe original CDR-H3 is masked by \"-\". Designing multiple CDRs are also supported, which will be illustrated later.\n\n**3. Design the CDRs**\n\nThe last step is to design the CDRs with the **design** API:\n\n```python\nfrom api.design import design\n\nckpt = './checkpoints/cdrh3_design.ckpt'\nroot_dir = './demos/data'\npdbs = [os.path.join(root_dir, '7l2m.pdb') for _ in range(4)]\ntoxin_chains = ['E', 'e', 'F', 'f']\nremove_chains = [toxin_chains for _ in range(4)]\nepitope_defs = [os.path.join(root_dir, c + '_epitope.json') for c in toxin_chains]\nidentifiers = [f'{c}_antibody' for c in toxin_chains]\n\n# use '-' for masking amino acids\nframeworks = [\n    (\n        ('H', 'QVQLKESGPGLLQPSQTLSLTCTVSGISLSDYGVHWVRQAPGKGLEWMGIIGHAGGTDYNSNLKSRVSISRDTSKSQVFLKLNSLQQEDTAMYFC----------WGQGIQVTVSSA'),\n        ('L', 'YTLTQPPLVSVALGQKATITCSGDKLSDVYVHWYQQKAGQAPVLVIYEDNRRPSGIPDHFSGSNSGNMATLTISKAQAGDEADYYCQSWDGTNSAWVFGSGTKVTVLGQ')\n    ) \\\n    for _ in pdbs\n]  # the first item of each tuple is heavy chain, the second is light chain\n\ndesign(ckpt=ckpt,  # path to the checkpoint of the trained model\n       gpu=0,      # the ID of the GPU to use\n       pdbs=pdbs,  # paths to the PDB file of each antigen (here antigen is all TRPV1)\n       epitope_defs=epitope_defs,  # paths to the epitope definitions\n       frameworks=frameworks,      # the given sequences of the framework regions\n       out_dir=root_dir,           # output directory\n       identifiers=identifiers,    # name of each output antibody\n       remove_chains=remove_chains,# remove the original ligand\n       enable_openmm_relax=True,   # use openmm to relax the generated structure\n       auto_detect_cdrs=False)  # manually use '-'  to represent CDR residues\n```\n\nThese codes are also added as an example in `./api/design.py`, so you can directly run it by:\n\n```bash\npython -m api.design\n```\n\nHere we use \"-\" to mark the CDR-H3 manually, but you can also set `auto_detect_cdrs=True` to let the CDR be automatically decided by the IMGT numbering system. The types of the CDRs to design will be automatically derived from the given checkpoint. Currently the API support re-designing single or multiple CDRs, as well as designing the full antibody (by passing `\"-\" * length` as the input).\n\nEnabling Openmm relax will slow down the generation process a lot, but will rectify the bond lengths and angles to conform to the physical constraints.\n\n#### Optimize\n\nThe **optimize** API (`./api/optimize.py`) is straight-forward. We optimize `./demos/data/1nca.pdb` as an example:\n\n```python\n\nfrom api.optimize import optimize, ComplexSummary\n\nckpt = './checkpoints/cdrh3_opt.ckpt'\npredictor_ckpt = './checkpoints/cdrh3_ddg_predictor.ckpt'\nroot_dir = './demos/data/1nca_opt'\nsummary = ComplexSummary(\n    pdb='./demos/data/1nca.pdb',\n    heavy_chain='H',\n    light_chain='L',\n    antigen_chains=['N']\n)\noptimize(\n    ckpt=ckpt,  # path to the checkpoint of the trained model\n    predictor_ckpt=predictor_ckpt,  # path to the checkpoint of the trained ddG predictor\n    gpu=0,      # the ID of the GPU to use\n    cplx_summary=summary,   # summary of the complex as well as its PDB file\n    num_residue_changes=[1, 2, 3, 4, 5],  # generate 5 samples, changing at most 1, 2, 3, 4, and 5 residues, respectively\n    out_dir=root_dir,  # output directory\n    batch_size=16,     # batch size\n    num_workers=4,     # number of workers to use\n    optimize_steps=50  # number of steps for gradient desend\n)\n```\n\nCodes for this example is also added to `./api/optimize.py`, so you can directly run them by:\n\n```bash\npython -m api.optimize\n```\n\nThen you will get the following results:\n```\n├── demos/data/1nca_opt\n│   ├── 1nca_0_1.pdb\n│   ├── 1nca_1_2.pdb\n│   ├── 1nca_2_3.pdb\n│   ├── 1nca_3_4.pdb\n│   ├── 1nca_4_5.pdb\n│   ├── 1nca_original.pdb\n```\nwhere the `1nca_original.pdb` is the original complex, and `1nca_a_b.pdb` means the $a$-th candiates with constraints of changing up to $b$ residues.\n\n#### Complex Structure Prediction\n\nThe **complex structure prediction** API (`./api/structure_prediction.py`) predicts the complex structure given the antigen, the sequences of the heavy chain and the light chain, and the definition of the epitope. Global docking is still very challenging so we narrow the scope to the epitope of interest. We predict`./demos/data/1nca.pdb` as an example:\n\n```python\n\nfrom api.structure_prediction import structure_prediction\n\nckpt = './checkpoints/struct_prediction.ckpt'\nroot_dir = './demos/data'\nn_sample = 10  # sample 10 conformations\npdbs = [os.path.join(root_dir, '1nca_antigen.pdb') for _ in range(n_sample)]\nepitope_defs = [os.path.join(root_dir, '1nca_epitope.json') for _ in range(n_sample)]\nidentifiers = [f'1nca_model_{i}' for i in range(n_sample)]\n\nseqs = [\n    (\n        ('H', 'QIQLVQSGPELKKPGETVKISCKASGYTFTNYGMNWVKQAPGKGLKWMGWINTNTGEPTYGEEFKGRFAFSLETSASTANLQINNLKNEDTATFFCARGEDNFGSLSDYWGQGTTVTVSS'),\n        ('L', 'DIVMTQSPKFMSTSVGDRVTITCKASQDVSTAVVWYQQKPGQSPKLLIYWASTRHIGVPDRFAGSGSGTDYTLTISSVQAEDLALYYCQQHYSPPWTFGGGTKLEIK')\n    ) \\\n    for _ in pdbs\n]  # the first item of each tuple is heavy chain, the second is light chain\n\nstructure_prediction(\n    ckpt=ckpt,  # path to the checkpoint of the trained model\n    gpu=0,      # the ID of the GPU to use\n    pdbs=pdbs,  # paths to the PDB file of each antigen (here antigen is all TRPV1)\n    epitope_defs=epitope_defs,  # paths to the epitope definitions\n    seqs=seqs,      # the given sequences of the framework regions\n    out_dir=root_dir,           # output directory\n    identifiers=identifiers,    # name of each output antibody\n    enable_openmm_relax=True)   # use openmm to relax the generated structure\n\n```\n\nCodes for this example is also added to `./api/structure_prediction.py`, so you can directly run them by:\n\n```bash\npython -m api.structure_prediction\n```\n\nThen you will get the following results:\n```\n├── demos/data\n│   ├── 1nca_model_0.pdb\n│   ├── 1nca_model_1.pdb\n│   ├── 1nca_model_2.pdb\n│   ├── ...\n```\nwhere there should be a total of 10 sampled conformations. Note that the first or last few residues might be discarded in the results if they are out of the variable domain according to the IMGT numbering system.\n\n\n### *In Silico* \"Display\"\n\n*In vitro* display are commonly used for selecting binding mutants from antibody libraries. Here we implement an *in silico* version with the **design** API by generating and filtering candidates from existing dataset against the antigen with an epitope definition. Further, we need an metric to evaluate how well the generated antibody binds to the target. Here we use FoldX as the affinity predictor, so to run this demo, you may need to first download the it from the [official website](https://foldxsuite.crg.eu/products#foldx) and revise the path in `./configs.py` correspondingly. We still use the TRPV1 example in the previous section, and use the RAbD benchmark as the antibody library providing the framework regions:\n\n```bash\npython -m demos.display \\\n    --ckpt checkpoints/multi_cdr_design.ckpt \\\n    --pdb demos/data/7l2m.pdb \\\n    --epitope_def demos/data/E_epitope.json \\\n    --library ./all_data/rabd_all.json \\\n    --n_sample 30 \\\n    --save_dir demos/display \\\n    --gpu 0\n```\n\nwhich will results in 30 candidates with their affinity predicted by FoldX.\n\n\n## Contact\n\nThank you for your interest in our work!\n\nPlease feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at jackie_kxz@outlook.com.\n\n## Others\nThe files below are borrowed from existing repositories:\n\n- `evaluation/TMscore.cpp`: https://zhanggroup.org/TM-score/\n- `evaluation/ddg`: https://github.com/HeliXonProtein/binding-ddg-predictor\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp-mt%2Fdymean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthunlp-mt%2Fdymean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp-mt%2Fdymean/lists"}