{"id":13751774,"url":"https://github.com/aqlaboratory/rgn","last_synced_at":"2025-04-07T17:08:46.320Z","repository":{"id":61819435,"uuid":"124916059","full_name":"aqlaboratory/rgn","owner":"aqlaboratory","description":"Recurrent Geometric Networks for end-to-end differentiable learning of protein structure","archived":false,"fork":false,"pushed_at":"2019-08-01T14:17:59.000Z","size":36414,"stargazers_count":327,"open_issues_count":18,"forks_count":87,"subscribers_count":39,"default_branch":"master","last_synced_at":"2025-03-31T14:14:12.275Z","etag":null,"topics":["deep-learning","deep-neural-networks","protein-structure","protein-structure-prediction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aqlaboratory.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-12T16:09:01.000Z","updated_at":"2025-01-15T09:18:45.000Z","dependencies_parsed_at":"2022-10-21T20:00:18.277Z","dependency_job_id":null,"html_url":"https://github.com/aqlaboratory/rgn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aqlaboratory%2Frgn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aqlaboratory%2Frgn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aqlaboratory%2Frgn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aqlaboratory%2Frgn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aqlaboratory","download_url":"https://codeload.github.com/aqlaboratory/rgn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247694876,"owners_count":20980733,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deep-neural-networks","protein-structure","protein-structure-prediction"],"created_at":"2024-08-03T09:00:54.483Z","updated_at":"2025-04-07T17:08:46.294Z","avatar_url":"https://github.com/aqlaboratory.png","language":"Python","readme":"# Recurrent Geometric Networks\nThis is the reference (TensorFlow) implementation of recurrent geometric networks (RGNs), described in the paper [End-to-end differentiable learning of protein structure](https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6). \n\n## Installation and requirements\nExtract all files in the [model](https://github.com/aqlaboratory/rgn/tree/master/model) directory in a single location and use `protling.py`, described further below, to train new models and predict structures. Below are the language requirements and package dependencies:\n\n* Python 2.7\n* TensorFlow \u003e= 1.4 (tested up to 1.12)\n* setproctitle\n\n## Usage\nThe [`protling.py`](https://github.com/aqlaboratory/rgn/blob/master/model/protling.py) script facilities training of and prediction using RGN models. Below are typical use cases. The script also accepts a number of command-line options whose functionality can be queried using the `--help` option.\n\n### Train a new model or continue training an existing model\nRGN models are described using a configuration file that controls hyperparameters and architectural choices. For a list of available options and their descriptions, see its [documentation](https://github.com/aqlaboratory/rgn/blob/master/CONFIG.md). Once a configuration file has been created, along with a suitable dataset (download a ready-made [ProteinNet](https://github.com/aqlaboratory/proteinnet) data set or create a new one from scratch using the [`convert_to_tfrecord.py`](https://github.com/aqlaboratory/rgn/blob/master/model/convert_to_tfrecord.py) script), the following directory structure must be created:\n\n```\n\u003cbaseDirectory\u003e/runs/\u003crunName\u003e/\u003cdatasetName\u003e/\u003cconfigurationFile\u003e\n\u003cbaseDirectory\u003e/data/\u003cdatasetName\u003e/[training,validation,testing]\n```\n\nWhere the first path points to the configuration file and the second path to the directories containing the training, validation, and possibly test sets. Note that `\u003crunName\u003e` and `\u003cdatasetName\u003e` are user-defined variables specified in the configuration file that encode the name of the model and dataset, respectively.\n\nTraining of a new model can then be invoked by calling:\n\n```\npython protling.py \u003cconfigurationFilePath\u003e -d \u003cbaseDirectory\u003e\n```\n\nDownload a pre-trained model for an example of a correctly defined directory structure. Note that ProteinNet training sets come in multiple \"thinnings\" and only one should be used at a time by placing it in the main training directory.\n\nTo resume training an existing model, run the command above for a previously trained model with saved checkpoints.\n\n### Predict sequences in ProteinNet TFRecords format using a trained model\nTo predict the structures of proteins already in ProteinNet `TFRecord` format using an existing model with a saved checkpoint, call:\n\n```\npython protling.py \u003cconfigFilePath\u003e -d \u003cbaseDirectory\u003e -p -g0\n```\n\nThis predicts the structures of the dataset specified in the configuration file. By default only the validation set is predicted, but this can be changed using the `-e` option, e.g. `-e weighted_testing` to predict the test set. The `-g0` option sets the GPU to be used to the one with index 0. If a different GPU is available change the setting appropriately.\n\n### Predict structure of a single new sequence using a trained model\nIf all you have is a single sequence for which you wish to make a prediction, there are multiple steps that must be performed. First, a PSSM needs to be created by running JackHMMer (or a similar tool) against a sequence database, the resulting PSSM must be combined with the sequence in a ProteinNet record, and the file must be converted to the `TFRecord` format. Predictions can then be made as previously described.\n\nBelow is an example of how to do this using the supplied scripts (in [data_processing](https://github.com/aqlaboratory/rgn/upload/master/data_processing)) and one of the pre-trained models, assumed to be unzipped in `\u003cbaseDirectory\u003e`. HMMER must also be installed. The raw sequence databases (`\u003cfastaDatabase\u003e`) used in building PSSMs can be obtained from [here](https://github.com/aqlaboratory/proteinnet/blob/master/docs/raw_data.md). The script below assumes that `\u003csequenceFile\u003e` only contains a single sequence in the FASTA file format.\n\n```\njackhmmer.sh \u003csequenceFile\u003e \u003cfastaDatabase\u003e\npython convert_to_proteinnet.py \u003csequenceFile\u003e\npython convert_to_tfrecord.py \u003csequenceFile\u003e.proteinnet \u003csequenceFile\u003e.tfrecord 42\ncp \u003csequenceFile\u003e.tfrecord \u003cbaseDirectory\u003e/data/\u003cdatasetName\u003e/testing/\npython protling.py \u003cbaseDirectory\u003e/runs/\u003crunName\u003e/\u003cdatasetName\u003e/\u003cconfigurationFile\u003e -d \u003cbaseDirectory\u003e -p -e weighted_testing -g0\n```\n\nThe first line searches the supplied database for matches to the supplied sequence and extracts a PSSM out of the results. It will generate multiple new files. These are then used in the second line to construct a text-based ProteinNet file (with 42 entries per evolutionary profile, compatible with the pre-trained RGN models). The third line converts the file to `TFRecords` format, and the fourth line copies the file to the testing directory of a pre-trained model. Finally the fifth line predicts the structure using the pre-trained RGN model. The outputs will be placed in  `\u003cbaseDirectory\u003e/runs/\u003crunName\u003e/\u003cdatasetName\u003e/\u003clatestIterationNumber\u003e/outputsTesting/` and will be comprised of two files: a `.tertiary` file which contains the atomic coordinates, and `.recurrent_states` file which contains the RGN latent representation of the sequence. The `-g0` option sets the GPU to be used to the one with index 0. If a different GPU is available change the setting appropriately.\n\n## Pre-trained models\nBelow we make available pre-trained RGN models using the [ProteinNet](https://github.com/aqlaboratory/proteinnet) 7 - 12 datasets as checkpointed TF graphs. These models are identical to the ones used in reporting results in the [_Cell Systems_ paper](https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6), except for the CASP 11 model which is slightly different due to using a newer codebase.\n\n| [CASP7](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN7.tar.gz) | [CASP8](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN8.tar.gz) | [CASP9](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN9.tar.gz) | [CASP10](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN10.tar.gz) | [CASP11](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN11.tar.gz) | [CASP12](https://sharehost.hms.harvard.edu/sysbio/alquraishi/rgn_models/RGN12.tar.gz) |\n| --- | --- | --- | --- | --- | --- |\n\nTo train new models from scratch using the same hyperparameter choices as the above models, use the appropriate configuration file from [here](https://github.com/aqlaboratory/rgn/tree/master/configurations).\n\n## PyTorch implementation\nThe reference RGN implementation is currently only available in TensorFlow, however the [OpenProtein](https://github.com/OpenProtein/openprotein) project implements various aspects of the RGN model in PyTorch, and [PyTorch-RGN](https://github.com/conradry/pytorch-rgn) is a work-in-progress implementation of the RGN model.\n\n## Reference\n[End-to-end differentiable learning of protein structure, Cell Systems 2019](https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6)\n\n## Funding\nThis work was supported by NIGMS grant P50GM107618 and NCI grant U54-CA225088.\n","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faqlaboratory%2Frgn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faqlaboratory%2Frgn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faqlaboratory%2Frgn/lists"}