{"id":13685237,"url":"https://github.com/luoyunan/ECNet","last_synced_at":"2025-05-01T01:31:06.852Z","repository":{"id":41306632,"uuid":"352772541","full_name":"luoyunan/ECNet","owner":"luoyunan","description":"An evolutionary context-integrated deep learning framework for protein engineering","archived":false,"fork":false,"pushed_at":"2022-06-01T22:04:42.000Z","size":817,"stargazers_count":63,"open_issues_count":9,"forks_count":16,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-12T06:34:27.942Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luoyunan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-29T20:14:07.000Z","updated_at":"2024-10-26T03:17:52.000Z","dependencies_parsed_at":"2022-08-27T08:00:45.780Z","dependency_job_id":null,"html_url":"https://github.com/luoyunan/ECNet","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luoyunan%2FECNet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luoyunan%2FECNet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luoyunan%2FECNet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luoyunan%2FECNet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luoyunan","download_url":"https://codeload.github.com/luoyunan/ECNet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251808418,"owners_count":21647287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T14:00:47.170Z","updated_at":"2025-05-01T01:31:06.531Z","avatar_url":"https://github.com/luoyunan.png","language":"Python","readme":"# ECNet\nAn evolutionary context-integrated deep learning framework for protein engineering\n\n- [ECNet](#ecnet)\n  - [Overview](#overview)\n  - [Installation](#installation)\n  - [Dependencies](#dependencies)\n  - [Quick Example](#quick-example)\n  - [Running on your own data](#running-on-your-own-data)\n  - [Generate local features using HHblits and CCMPred](#generate-local-features-using-hhblits-and-ccmpred)\n  - [Train on dataset A and test on dataset B](#train-on-dataset-a-and-test-on-dataset-b)\n  - [Citation](#citation)\n  - [Contact](#contact)\n\n## Overview\nECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our *Nature Communications* [paper](https://doi.org/10.1038/s41467-021-25976-8) for details.\n![ECNet](doc/overview.png)\n## Installation\nClone and export the GitHub repository directory to python path\n```bash\ngit clone https://github.com/luoyunan/ECNet.git\ncd ECNet\nexport PYTHONPATH=$PWD:$PYTHONPATH\n```\n## Dependencies\nThis package is tested with `Python 3.7` and `CUDA 10.1` on `Ubuntu 18.04`, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see `requirements.txt` for necessary python dependencies, all of which can be easily installed with `pip` or `conda`. Due to an issue of installing `pytorch 1.4.0` with `pip`, please install `pytorch` with `conda` first.\n```bash\nconda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch\npip install -r requirements.txt\n```\n\n## Quick Example\n1. Download example data (~5.4MB) from Dropbox.\n    ```\n    wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz\n    tar xf data.tar.gz\n    ```\n2. Run the example script. The following script trains an ECNet model using the fitness data of the\nsecond RRM domain of Pab1 ([source](https://rnajournal.cshlp.org/content/19/11/1537.long)). The scripts randomly splits 70% as training data, 10% as validation data, and 20% as test data.\n    ```bash\n    CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \\\n        --train data/RRM_single.tsv \\\n        --fasta data/RRM.fasta \\\n        --local_feature data/RRM.braw \\\n        --output_dir ./output/RRM_CV \\\n        --save_prediction \\\n        --n_ensembles 2 \\\n        --epochs 100\n    ```\n    It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values.\n\n## Running on your own data\nECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file.\n\n1. **Sequence FASTA file** (`--fasta`, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence.\n2. **Fitness TSV file** (`--train`, required). Each line has two columns `mutation` and `score` separated by tab, describing the fitness value of a variant. The `mutation` column is a string has the format `[ref][pos][alt]`, e.g., `S100T`, meaning that the 100-th amino acid (index starting from 1) mutated from `S` to `T`. If a variant has multiple mutations, `;` is used to concatenated mutations. The `score` column is a numerical value quantifies the variant's fitness. Example:    \n    ```\n    mutation    score\n    M1S         1.0\n    F12I;L30K   2.0\n    G89A        0.06\n    ```   \n    Note: This file is supplied using the `--train` argument. If no separate test data is provided through the `--test` argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by `--split_ratio` (which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by `--split_ratio` (which are 2 float numbers).\n3. **Local features** (`--local_feature`, optional). A binary file generated by CCMPred using the `-b` option (note that to use the `-b` option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to install `libmsgpack-dev`. See instructions [below](#generate-local-features-using-hhblits-and-ccmpred)). ECNet will extract local features from this file. This file is optional. If not provided, please add `--no_local_feature` flag when running `run_example.py` (or, equivalently, set `use_local_features=False` for the `ECNet` class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred.    \n3. **Additional test TSV file** (`--test`, optional). This file has the same format as the `--train` TSV file.\n\nWe suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., `d_embed`, `d_model`, `d_h`, `n_layers`, etc.\n\n## Generate local features using HHblits and CCMPred\n1. Install [HHsuite](https://github.com/soedinglab/hh-suite) and [CCMPred](https://github.com/soedinglab/CCMpred) following their instructions. Note that CCMPred should be installed from the latest branch instead of the release, otherwise the `-b` option is not available. Also, as CCMPred uses `msgpack` to create the binary file, you may also need to install `libmsgpack-dev` on your system if it is not available. For example, on Ubuntu, you can run `sudo apt update` then `sudo apt install libmsgpack-dev`.\n2. Prepare a FASTA file `example.fasta` of the wild-type sequence of our interested protein.\n3. Search the homologous sequences of the wild-type sequence using `hhblits` in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the [DeepSequence](https://www.nature.com/articles/s41592-018-0138-4) paper.)\n    ```bash\n    hhblits -i example.fasta \\\n        -d ${path_to_hhblits_database} \\\n        -o example.hhr \\\n        -oa3m example.a3m \\\n        -n 3 \\\n        -id 99 \\\n        -cov 50 \\\n        -cpu 8\n    ```\n4. Reformat the a3m output of hhblits to PSICOV format (solution modified from [here](https://github.com/soedinglab/bbcontacts/blob/master/TUTORIAL.md#step-13-reformat-the-output-alignment)). In order to run CCMpred, the alignment must be reformatted to the \"PSICOV\" format used by CCMpred. We can first use the `reformat.pl` script from the `hh-suite/scripts` directory to get an alignment in fasta format and then the `convert_alignment.py` from the `CCMpred/scripts` directory to get the PSICOV format:\n    ```bash\n    ${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r\n    python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc\n    ```\n5. Run CCMPred\n    ```bash \n    ccmpred example.psc example.mat -b example.braw -d 0\n    ```\n6. Use the argument `--local_feature example.braw` to provide the local features to ECNet.\n\n## Train on dataset A and test on dataset B\nThe following example shows how to train ECNet on dataset A (passed via `--train`) and test it on another dataset B (passed via `--test`).\n- Example 1: train on single-mutant fitness data of RRM ([source](https://rnajournal.cshlp.org/content/19/11/1537.long)), and predict for double-mutants\n    ```\n    CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \\\n        --train data/RRM_single.tsv \\\n        --test data/RRM_double.tsv \\\n        --fasta data/RRM.fasta \\\n        --split_ratio 0.9 0.1 \\\n        --local_feature data/RRM.braw \\\n        --output_dir ./output/RRM \\\n        --save_checkpoint \\\n        --n_ensembles 2 \\\n        --epochs 100\n    ```\n- Example 2: you can also load the trained model using the `--save_model_dir` argument and predict for test dataset:\n    ```\n    CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \\\n        --test data/RRM_double.tsv \\\n        --fasta data/RRM.fasta \\\n        --local_feature data/RRM.braw \\\n        --n_ensembles 2 \\\n        --output_dir ./output/RRM \\\n        --saved_model_dir ./output/RRM\n    ```\n\n## Citation\n\u003e Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. *Nat Commun* **12**, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8\n\n```\n@article{luo2021ecnet,\n  doi = {10.1038/s41467-021-25976-8},\n  url = {https://doi.org/10.1038/s41467-021-25976-8},\n  year = {2021},\n  month = sep,\n  publisher = {Springer Science and Business Media {LLC}},\n  volume = {12},\n  number = {1},\n  author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng},\n  title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering},\n  journal = {Nature Communications}\n}\n```\n## Contact\nPlease submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.\n","funding_links":[],"categories":["Design","蛋白质结构"],"sub_categories":["网络服务_其他"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluoyunan%2FECNet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluoyunan%2FECNet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluoyunan%2FECNet/lists"}