{"id":18975739,"url":"https://github.com/babaid/aaperturb","last_synced_at":"2025-06-10T14:06:13.502Z","repository":{"id":194094948,"uuid":"690097849","full_name":"babaid/AAPerturb","owner":"babaid","description":"A C++ library for the creation of a large dataset of amino acid sidechain perturbations, own PDB Parser code included and some other things related.","archived":false,"fork":false,"pushed_at":"2024-07-19T13:58:55.000Z","size":8890,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-02T14:08:34.716Z","etag":null,"topics":["big-data","gnn","molecular-dynamics","perturbation-methods","structural-bioinformatics"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/babaid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-11T14:23:09.000Z","updated_at":"2024-07-19T13:58:59.000Z","dependencies_parsed_at":"2023-09-11T20:30:48.981Z","dependency_job_id":"452d8920-70b9-4df8-933a-2af319b0c789","html_url":"https://github.com/babaid/AAPerturb","commit_stats":null,"previous_names":["babaid/aaperturb"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babaid%2FAAPerturb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babaid%2FAAPerturb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babaid%2FAAPerturb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babaid%2FAAPerturb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/babaid","download_url":"https://codeload.github.com/babaid/AAPerturb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babaid%2FAAPerturb/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259088563,"owners_count":22803657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","gnn","molecular-dynamics","perturbation-methods","structural-bioinformatics"],"created_at":"2024-11-08T15:20:20.223Z","updated_at":"2025-06-10T14:06:13.457Z","avatar_url":"https://github.com/babaid.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AAPerturb\nA C++ library for the creation of a large dataset of amino acid sidechain perturbations, own PDB Parser code included and some other things related as partly described in [Pre-training of Graph Neural Network for Modeling Effects of Mutations on Protein-Protein Binding Affinity](https://arxiv.org/abs/2008.12473).\n\nThe approach described in the paper consists of perturbing a sidechain and trying to reconstruct the original sidechain via an autoencoder.\n\n\n## PDB Parser\nIt has an own PDB parser which is mostly educational/for myself as there is probably something available that does this.\n\nThe proteins are basically represented as a map chainID -\u003e Vector of Residues, where the Residues contain the atoms.\nA particular reason I implemented this myself, is that most libraries that are available in python overcomplexify the task of parsing PDBs,\nintroducing weird classes with crazy functionality that is useful in many cases but in my case I just want to do geometric transormations on simple ATOM records.\n\nFor some of the implementations refer to Biopython, rdkit, and the really useful Biopandas.\n\n\n## Geometrical operations\n\nThere are different ways to implement rotations, translations and calculations of distances for molecules.\nAlthough it would have been more efficient to work with matrices of atomic coordinates, this would mean to keep track of exactly how those matrices are ordered in terms of atoms and residues.\nSo to keep it simple, and bookkeeping reasons I decided to work with single coordinates of the atoms. Each split into residues and chains.\n\nRotating around bonds is a well described [problem](https://sites.google.com/site/glennmurray/glenn-murray-ph-d/rotation-matrices-and-formulas/rotation-about-an-arbitrary-axis-in-3-dimensions), for which you have to use the rotation matrix found at the previous link.\n\n## Random Perturbations\n\nThe main goal of the package is to perform random perturbations the sidechain of a random amino acid, this amino acid should reside\non the interface between two chains of the protein-protein complex.\nThe created data set can be used after for an autoencoder-like machine learning approach to capture PPI's and effects of mutations proteins.\n\nThe execution flow is as follows:\n1. Find interface residues on the PP complex, given a cutoff value\n2. Choose random interface residue\n3. Perturb chosen residue\n\nFor now this perturbation is just torsion about the sidechain axes which could in theory freely rotate.\nIn my head the only condition for acceptance of a conformation after a perturbation, is that there are no clashes between atoms.\nIt is also possible to sample the perturbation \\Chi angles from physically relevant distributions as defined for example in [Dunbrack (2011)](http://dunbrack.fccc.edu/lab/bbdep2010).\n\n\nFinally, to be specific about what I would like to implement in the near future, as some extra functionality.\nJust like in these geometrical transformations are perturbations, we can look at mutations in the same way. While the concept proposed in the first mentioned paper is to recreate atomic coordinates after a perturbation, a similar operation could be used in an alchemical way, recreating both coordinates, and adding atoms/graph nodes to a molecular graph, with the goal to reduce its strain/energy or recreate the original structure.\nIn principle this could be used to predict the probability of certain mutations, which is obviously something cool and useful.\n\nAn example of such a random perturbation can be seen in the next picture.\n\n![image](image/image2.png)\n\n## Requirements\n\nThe library was built using C++23, with g++ and gcc version 13.\nTo build it clone this project and inside the project directory:\n\n\n```\ngit submodule update --init --recursive\n\nmkdir build\ncd build\ncmake -DCMAKE_INSTALL_PREFIX=\"PATH TO YOUR PREFERRED INSTALL DIR\" ..\nmake -jN aaperturb install \n\n```\n\n## Cleaning of PDB files.\n\nTo produce output that is useful, a preprocessing step may be required, which deals with removal of waters, deprotonation, removal of alternate locations, removal of insertions and reindexing atoms and residues.\nThis is crucial to the main program to run without bugs and random errors, and also the less atoms there are the less expensive calculations get.\nThe residue numbering has to start in each chain at 1.\nThe part of cleaning the PDBs seemed easier to do with python with the already available PandasPDB package. The script is located at scripts/pdbcleaner.py.\nIt needs following packages: biopandas, pandas, numpy, alive_progress\nYou can run it as follows:\n```\npython pdbcleaner.py -i [INPUT_DIR/FILE] -o [OUTPUT_DIR/FILE]\n```\n\nIt is going to clean all the files and save them. Specifying the input dir to be the same as the output dir is prohibited due to the safety of your dataset.\n\n## Running AAPerturb\n\n```\naaperturb -i [STR] -o [STR] --max-bbangle [FLOAT] --max-schangle [FLOAT] \n```\n\nYou have to provide an input directory (where the cleaned PDB's are at), and an output directory. The last two arguments are how big of a torsion we allow on the backbone and the sidechains. Youd should choose small angles less than 20. This is because for small angles it is safe to assume that there are no clashes after perturbation, that is, we don't have to claculate distances between atoms and everything runs a lot faster.\n\n\n#### Note 1\n\nOf course you may want the perturbations to be \"realistic\" like the mentioned paper did. They took a shurtcut and used FoldX (which uses the Dunbrack rotamers) to mutate/perturb residues as it seems. Given you have the torsion angle distributions from the Dunbrack library you can straightforwardly use them instead the uniform distribution that I did. Currently I use the appropriate experimental Chi1 and Chi2 angles, but for the other torsions I use an approximation of small perturbations.\n\nOf course this means that you have to be more careful about clashes, and also introduce some other stuff into the code.\n\n#### Note 2\n\nNo one has actually tested if small perturbations work as well as the rotamers, which I am working on currently using a structural autoencoder.\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabaid%2Faaperturb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbabaid%2Faaperturb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabaid%2Faaperturb/lists"}