{"id":13774179,"url":"https://github.com/songlab-cal/tape","last_synced_at":"2025-04-09T11:07:03.314Z","repository":{"id":41174121,"uuid":"226796593","full_name":"songlab-cal/tape","owner":"songlab-cal","description":"Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.","archived":false,"fork":false,"pushed_at":"2022-12-11T00:20:03.000Z","size":860,"stargazers_count":631,"open_issues_count":26,"forks_count":129,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-05-12T14:21:56.636Z","etag":null,"topics":["benchmark","dataset","deep-learning","language-modeling","protein-sequences","protein-structure","pytorch","semi-supervised-learning"],"latest_commit_sha":null,"homepage":"https://www.biorxiv.org/content/10.1101/676825v1","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/songlab-cal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-09T06:02:49.000Z","updated_at":"2024-05-12T08:22:35.000Z","dependencies_parsed_at":"2022-07-14T10:21:28.831Z","dependency_job_id":null,"html_url":"https://github.com/songlab-cal/tape","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songlab-cal%2Ftape","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songlab-cal%2Ftape/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songlab-cal%2Ftape/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songlab-cal%2Ftape/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/songlab-cal","download_url":"https://codeload.github.com/songlab-cal/tape/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248027407,"owners_count":21035594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","dataset","deep-learning","language-modeling","protein-sequences","protein-structure","pytorch","semi-supervised-learning"],"created_at":"2024-08-03T17:01:24.398Z","updated_at":"2025-04-09T11:07:03.293Z","avatar_url":"https://github.com/songlab-cal.png","language":"Python","funding_links":[],"categories":["By Methodology","Datasets \u0026 Benchmarks","Benchmarks \u0026 Datasets","Deep Learning"],"sub_categories":["Deep Learning \u0026 Protein Language Models","Text + BioMulti","Clinical Trial"],"readme":"\n\n# Tasks Assessing Protein Embeddings (TAPE)\n\n![](https://github.com/songlab-cal/tape/workflows/Build/badge.svg)\n\nData, weights, and code for running the TAPE benchmark on a trained protein embedding. We provide a pretraining corpus, five supervised downstream tasks, pretrained language model weights, and benchmarking code. This code has been updated to use pytorch - as such previous pretrained model weights and code will not work. The previous tensorflow TAPE repository is still available at [https://github.com/songlab-cal/tape-neurips2019](https://github.com/songlab-cal/tape-neurips2019).\n\nThis repository is *not* an effort to maintain maximum compatibility and reproducability with the original paper, but is instead meant to facilitate ease of use and future development (both for us, and for the community). Although we provide much of the same functionality, we have not tested every aspect of training on all models/downstream tasks, and we have also made some deliberate changes. Therefore, if your goal is to reproduce the results from our paper, please use the original code.\n\nOur paper is available at [https://arxiv.org/abs/1906.08230](https://arxiv.org/abs/1906.08230).\n\nSome documentation is incomplete. We will try to fill it in over time, but if there is something you would like an explanation for, please open an issue so we know where to focus our effort!\n\n**Update 09/26/2020:** We no longer recommend trying to train directly with TAPE's training code. It will likely still work for some time, but will not be updated for future pytorch versions. Internally, we have been working with different frameworks for training (specifically Pytorch Lightning and Fairseq). We strongly recommend using a framework like these, as it offloads the requirement of maintaining compatability with Pytorch versions. TAPE models will continue to be available, and if the code is working for you, feel free to use it. However we will not be fixing issues regarding multi-GPU errors, OOM erros, etc during training.\n\n## Contents\n\n* [Installation](#installation)\n* [Examples](#examples)\n   * [Huggingface API for Loading Pretrained Models](#huggingface-api-for-loading-pretrained-models)\n   * [Embedding Proteins with a Pretrained Model](#embedding-proteins-with-a-pretrained-model)\n   * [Training a Language Model](#training-a-language-model)\n   * [Evaluating a Language Model](#evaluating-a-language-model)\n   * [Training a Downstream Model](#training-a-downstream-model)\n   * [Evaluating a Downstream Model](#evaluating-a-downstream-model)\n   * [List of Models and Tasks](#list-of-models-and-tasks)\n   * [Adding New Models and Tasks](#adding-new-models-and-tasks)\n* [Data](#data)\n   * [LMDB Data](#lmdb-data)\n   * [Raw Data](#raw-data)\n* [Leaderboard](#leaderboard)\n    * [Secondary Structure](#secondary-structure)\n    * [Contact Prediction](#contact-prediction)\n    * [Remote Homology Detection](#remote-homology-detection)\n    * [Fluorescence](#fluorescence)\n    * [Stability](#stability)\n* [Citation Guidelines](#citation-guidelines)\n\n## Installation\n\nWe recommend that you install `tape` into a python [virtual environment](https://virtualenv.pypa.io/en/latest/) using\n\n```bash\n$ pip install tape_proteins\n```\n\n## Examples\n\n### Huggingface API for Loading Pretrained Models\n\nWe build on the excellent [huggingface repository](https://github.com/huggingface/transformers) and use this as an API to define models, as well as to provide pretrained models. By using this API, pretrained models will be automatically downloaded when necessary and cached for future use.\n\n```python\nimport torch\nfrom tape import ProteinBertModel, TAPETokenizer\nmodel = ProteinBertModel.from_pretrained('bert-base')\ntokenizer = TAPETokenizer(vocab='iupac')  # iupac is the vocab for TAPE models, use unirep for the UniRep model\n\n# Pfam Family: Hexapep, Clan: CL0536\nsequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'\ntoken_ids = torch.tensor([tokenizer.encode(sequence)])\noutput = model(token_ids)\nsequence_output = output[0]\npooled_output = output[1]\n\n# NOTE: pooled_output is *not* trained for the transformer, do not use\n# w/o fine-tuning. A better option for now is to simply take a mean of\n# the sequence output\n```\n\nCurrently available pretrained models are:\n\n* bert-base (Transformer model)\n* babbler-1900 ([UniRep](https://www.biorxiv.org/content/10.1101/589333v1) model)\n* xaa, xab, xac, xad, xae ([trRosetta](https://www.pnas.org/content/117/3/1496) model)\n\nIf there is a particular pretrained model that you would like to use, please open an issue and we will try to add it!\n\n### Embedding Proteins with a Pretrained Model\n\nGiven an input fasta file, you can generate a `.npz` file containing embedding proteins via the `tape-embed` command.\n\nSuppose this is our input fasta file:\n\n```\n\u003eseq1\nGCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ\n\u003eseq2\nRTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR\n```\n\nThen we could embed it with the UniRep babbler-1900 model like so:\n\n```bash\ntape-embed unirep my_input.fasta output_filename.npz babbler-1900 --tokenizer unirep\n```\n\nThere is no need to download the pretrained model manually - it will be automatically downloaded if needed. In addition, note the change of tokenizer to the `unirep` tokenizer. UniRep uses a different vocabulary, and so requires this tokenzer. If you get a cublas runtime error, please double check that you changed tokenizer correctly.\n\nThe embed function is fully batched and will automatically distribute across as many GPUs as the machine has available. On a Titan Xp, it can process around 200 sequences / second.\n\nOnce we have the output file, we can load it into numpy like so:\n\n```python\narrays = np.load('output_filename.npz', allow_pickle=True)\n\nlist(arrays.keys())  # Will output the name of the keys in your fasta file (or if unnamed then '0', '1', ...)\n\narrays[\u003cprotein_id\u003e]  # Returns a dictionary with keys 'pooled' and 'avg', (or 'seq' if using the --full_sequence_embed flag)\n```\n\nBy default to save memory TAPE returns the average of the sequence embedding along with the pooled embedding generated through the pooling function. For some models (like UniRep), the pooled embedding is trained, and so can be used out of the box. For other models (like the transformer), the pooled embedding is not trained, and so the average embedding should be used. We will be looking into methods of self-supervised training the pooled embedding for all models in the future.\n\nIf you would like the full embedding rather than the average embedding, this can be specified to `tape-embed` by passing the `--full_sequence_embed` flag.\n\n### Training a Language Model\n\nTape provides two commands for training, `tape-train` and `tape-train-distributed`. The first command uses standard pytorch data distribution to distributed across all available GPUs. The second one uses `torch.distributed.launch`-style multiprocessing to distributed across the number of specified GPUs (and could also be used for distributing across multiple nodes). We generally recommend using the second command, as it can provide a 10-15% speedup, but both will work.\n\nTo train the transformer on masked language modeling, for example, you could run this\n\n```bash\ntape-train-distributed transformer masked_language_modeling --batch_size BS --learning_rate LR --fp16 --warmup_steps WS --nproc_per_node NGPU --gradient_accumulation_steps NSTEPS\n```\n\nThere are a number of features used in training:\n\n    * Distributed training via multiprocessing\n    * Half-precision training\n    * Gradient accumulation\n    * Gradient-allreduce post accumulation\n    * Automatic batch by sequence length\n\nThe first feature you are likely to need is the `gradient_accumulation_steps`. TAPE specifies a relatively high batch size (1024) by default. This is the batch size that will be used *per backwards pass*. This number will be divided by the number of GPUs as well as the gradient accumulation steps. So with a batch size of 1024, 2 GPUs, and 1 gradient accumulation step, you will do 512 examples per GPU. If you run out of memory (and you likely will), TAPE provides a clear error message and will tell you to increase the gradient accumulation steps.\n\nThere are additional features as well that are not talked about here. See `tape-train-distributed --help` for a list of all commands.\n\n### Evaluating a Language Model\n\nOnce you've trained a language model, you'll have a pretrained weight file located in the `results` folder. To evaluate this model, you can do one of two things. One option is to directly evaluate the language modeling accuracy / perplexity. `tape-train` will report the perplexity over the training and validation set at the end of each epoch. However, we find empirically that language modeling accuracy and perplexity are poor measures of performance on downstream tasks. Therefore, to evaluate the language model we strongly recommend training your model on one or all of our provided tasks.\n\n### Training a Downstream Model\n\nTraining a model on a downstream task can also be done with the `tape-train` command. Simply use the same syntax as with training a language model, adding the flag `--from_pretrained \u003cpath_to_your_saved_results\u003e`. To train a pretrained transformer on secondary structure prediction, for example, you would run\n\n```bash\ntape-train-distributed transformer secondary_structure \\\n\t--from_pretrained results/\u003cpath_to_folder\u003e \\\n\t--batch_size BS \\\n\t--learning_rate LR \\\n\t--fp16 \\\n  \t--warmup_steps WS \\\n  \t--nproc_per_node NGPU \\\n  \t--gradient_accumulation_steps NSTEPS \\\n  \t--num_train_epochs NEPOCH \\\n  \t--eval_freq EF \\\n  \t--save_freq SF\n```\n\nFor training a downstream model, you will likely need to experiment with hyperparameters to achieve the best results (optimal hyperparameters vary per-task and per-model). The set of parameters to consider are\n\n```\n* Batch size\n* Learning rate\n* Warmup steps\n* Num train epochs\n```\n\nThese can all have significant effects on performance, and by default are set to maximize performance on language modeling rather than downstream tasks. In addition the `eval_freq` and `save_freq` parameters can be useful, as they reduce the frequency of running validation passes and saving the model, respectively. Since downstream task epochs are much shorter (and you're likely to need more of them), it makes sense to increase these values so that training takes less time.\n\n### Evaluating a Downstream Model\n\nTo evaluate your downstream task model, we provide the `tape-eval` command. This command will output your model predictions along with a set of metrics that you specify. At the moment, we support  mean squared error (`mse`), mean absolute error (`mae`), Spearman's rho (`spearmanr`), and accuracy (`accuracy`). Precision @ L/5 will be added shortly.\n\nThe syntax for the command is\n\n```bash\ntape-eval MODEL TASK TRAINED_MODEL_FOLDER --metrics METRIC1 METRIC2 ...\n```\n\nso to evaluate a transformer trained on trained secondary structure, we can run\n\n```bash\ntape-eval transformer secondary_structure results/\u003cpath_to_trained_model\u003e --metrics accuracy\n```\n\nThis will report the overall accuracy, and will also dump a `results.pkl` file into the trained model directory for you to analyze however you like.\n\n### trRosetta\n\nWe have recently re-implemented the trRosetta model from Yang et. al. (2020). A link to the original repository, which was used as a basis for this re-implementation, can be found [here](https://github.com/gjoni/trRosetta). We provide a pytorch implementation and dataset to allow you to play around with the model. Data is available [here](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/trrosetta.tar.gz). This is the same as the data in the original paper, however we've added train / val split files to allow you to train your own model reproducibly. To use this model\n\n```python\nfrom tape import TRRosetta\nfrom tape.datasets import TRRosettaDataset\n\n# Download data and place it under `\u003cdata_path\u003e/trrosetta`\n\ntrain_data = TRRosettaDatset('\u003cdata_path\u003e', 'train')  # will subsample MSAs\nvalid_data = TRRosettaDatset('\u003cdata_path\u003e', 'valid')  # will not subsample MSAs\n\nmodel = TRRosetta.from_pretrained('xaa')  # valid choices are 'xaa', 'xab', 'xac', 'xad', 'xae'. Each corresponds to one of the ensemble models.\n\nbatch = train_data.collate_fn([train_data[0]])\nloss, predictions = model(**batch)\n```\n\nThe predictions can be saved as `.npz` files and then fed into the [structure modeling scripts](https://yanglab.nankai.edu.cn/trRosetta/download/) provided by the Yang Lab.\n\n\n### List of Models and Tasks\n\nThe available models are:\n\n- `transformer` (pretrained available)\n- `resnet`\n- `lstm`\n- `unirep` (pretrained available)\n- `onehot` (no pretraining required)\n- `trrosetta` (pretrained available)\n\nThe available standard tasks are:\n\n- `language_modeling`\n- `masked_language_modeling`\n- `secondary_structure`\n- `contact_prediction`\n- `remote_homology`\n- `fluorescence`\n- `stability`\n- `trrosetta` (can only be used with `trrosetta` model)\n\nThe available models and tasks can be found in `tape/datasets.py` and `tape/models/modeling*.py`.\n\n### Adding New Models and Tasks\n\nWe have made some efforts to make the new repository easier to understand and extend. See the `examples` folder for an example on how to add a new model and a new task to TAPE. If there are other examples you would like or if there is something missing in the current examples, please open an issue.\n\n## Data\nData should be placed in the `./data` folder, although you may also specify a different data directory if you wish.\n\nThe supervised data is around 120MB compressed and 2GB uncompressed.\nThe unsupervised Pfam dataset is around 7GB compressed and 19GB uncompressed. The data for training is hosted on AWS. By default we provide data as LMDB - see `tape/datasets.py` for examples on loading the data. If you wish to download all of TAPE, run `download_data.sh` to do so. We also provide links to each individual dataset below in both LMDB format and JSON format.\n\n### LMDB Data\n\n[Pretraining Corpus (Pfam)](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/pfam.tar.gz) __|__ [Secondary Structure](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/secondary_structure.tar.gz) __|__ [Contact (ProteinNet)](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/proteinnet.tar.gz) __|__ [Remote Homology](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/remote_homology.tar.gz) __|__ [Fluorescence](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/fluorescence.tar.gz) __|__ [Stability](http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/stability.tar.gz)\n\n### Raw Data\n\nRaw data files are stored in JSON format for maximum portability. This data is JSON-ified, which removes certain constructs (in particular numpy arrays). As a result they cannot be directly loaded into the provided pytorch datasets (although the conversion should be quite easy by simply adding calls to `np.array`).\n\n[Pretraining Corpus (Pfam)](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/pfam.tar.gz) __|__ [Secondary Structure](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/secondary_structure.tar.gz) __|__ [Contact (ProteinNet)](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/proteinnet.tar.gz) __|__ [Remote Homology](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/remote_homology.tar.gz) __|__ [Fluorescence](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/fluorescence.tar.gz) __|__ [Stability](http://s3.amazonaws.com/songlabdata/proteindata/data_raw_pytorch/stability.tar.gz)\n\n\n## Leaderboard\n\nWe will soon have a leaderboard available for tracking progress on the core five TAPE tasks, so check back for a link here. See the main tables in our paper for a sense of where performance stands at this point. Publication on the leaderboard will be contingent on meeting the following citation guidelines.\n\nIn the meantime, here's a temporary leaderboard for each task. All reported models on this leaderboard use unsupervised pretraining.\n\n### Secondary Structure\n\n| Ranking | Model | Accuracy (3-class) |\n|:-:|:-:|:-:|\n| 1. | One Hot + Alignment | 0.80 |\n| 2. | LSTM | 0.75 |\n| 2. | ResNet | 0.75 |\n| 4. | Transformer | 0.73 |\n| 4. | Bepler | 0.73 |\n| 4. | Unirep | 0.73 |\n| 7. | One Hot | 0.69 |\n\n### Contact Prediction\n\n| Ranking | Model | L/5 Medium + Long Range |\n|:-:|:-:|:-:|\n| 1. | One Hot + Alignment | 0.64 |\n| 2. | Bepler | 0.40 |\n| 3. | LSTM | 0.39 |\n| 4. | Transformer | 0.36 |\n| 5. | Unirep | 0.34 |\n| 6. | ResNet | 0.29 |\n| 6. | One Hot | 0.29 |\n\n### Remote Homology Detection\n\n| Ranking | Model | Top 1 Accuracy |\n|:-:|:-:|:-:|\n| 1. | LSTM | 0.26 |\n| 2. | Unirep | 0.23 |\n| 3. | Transformer | 0.21 |\n| 4. | Bepler | 0.17 |\n| 4. | ResNet | 0.17 |\n| 6. | One Hot + Alignment | 0.09 |\n| 6. | One Hot | 0.09 |\n\n### Fluorescence\n\n| Ranking | Model | Spearman's rho |\n|:-:|:-:|:-:|\n| 1. | Transformer | 0.68 |\n| 2. | LSTM | 0.67 |\n| 2. | Unirep | 0.67 |\n| 4. | Bepler | 0.33 |\n| 5. | ResNet | 0.21 |\n| 6. | One Hot | 0.14 |\n\n### Stability\n\n| Ranking | Model | Spearman's rho |\n|:-:|:-:|:-:|\n| 1. | Transformer | 0.73 |\n| 1. | Unirep | 0.73 |\n| 1. | ResNet | 0.73 |\n| 4. | LSTM | 0.69 |\n| 5. | Bepler | 0.64 |\n| 6. | One Hot | 0.19 |\n\n## Citation Guidelines\n\nIf you find TAPE useful, please cite our corresponding paper. Additionally, __anyone using the datasets provided in TAPE must describe and cite all dataset components they use__. Producing these data is time and resource intensive, and we insist this be recognized by all TAPE users. For convenience,`data_refs.bib` contains all necessary citations. We also provide each individual citation below.\n\n__TAPE (Our paper):__\n```\n@inproceedings{tape2019,\nauthor = {Rao, Roshan and Bhattacharya, Nicholas and Thomas, Neil and Duan, Yan and Chen, Xi and Canny, John and Abbeel, Pieter and Song, Yun S},\ntitle = {Evaluating Protein Transfer Learning with TAPE},\nbooktitle = {Advances in Neural Information Processing Systems}\nyear = {2019}\n}\n```\n\n__Pfam (Pretraining):__\n```\n@article{pfam,\nauthor = {El-Gebali, Sara and Mistry, Jaina and Bateman, Alex and Eddy, Sean R and Luciani, Aur{\\'{e}}lien and Potter, Simon C and Qureshi, Matloob and Richardson, Lorna J and Salazar, Gustavo A and Smart, Alfredo and Sonnhammer, Erik L L and Hirsh, Layla and Paladin, Lisanna and Piovesan, Damiano and Tosatto, Silvio C E and Finn, Robert D},\ndoi = {10.1093/nar/gky995},\nfile = {::},\nissn = {0305-1048},\njournal = {Nucleic Acids Research},\nkeywords = {community,protein domains,tandem repeat sequences},\nnumber = {D1},\npages = {D427--D432},\npublisher = {Narnia},\ntitle = {{The Pfam protein families database in 2019}},\nurl = {https://academic.oup.com/nar/article/47/D1/D427/5144153},\nvolume = {47},\nyear = {2019}\n}\n```\n__SCOPe: (Remote Homology and Contact)__-\n```\n@article{scop,\n  title={SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures},\n  author={Fox, Naomi K and Brenner, Steven E and Chandonia, John-Marc},\n  journal={Nucleic acids research},\n  volume={42},\n  number={D1},\n  pages={D304--D309},\n  year={2013},\n  publisher={Oxford University Press}\n}\n```\n__PDB: (Secondary Structure and Contact)__\n```\n@article{pdb,\n  title={The protein data bank},\n  author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, Talapady N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E},\n  journal={Nucleic acids research},\n  volume={28},\n  number={1},\n  pages={235--242},\n  year={2000},\n  publisher={Oxford University Press}\n}\n```\n\n__CASP12: (Secondary Structure and Contact)__\n```\n@article{casp,\nauthor = {Moult, John and Fidelis, Krzysztof and Kryshtafovych, Andriy and Schwede, Torsten and Tramontano, Anna},\ndoi = {10.1002/prot.25415},\nissn = {08873585},\njournal = {Proteins: Structure, Function, and Bioinformatics},\nkeywords = {CASP,community wide experiment,protein structure prediction},\npages = {7--15},\npublisher = {John Wiley {\\\u0026} Sons, Ltd},\ntitle = {{Critical assessment of methods of protein structure prediction (CASP)-Round XII}},\nurl = {http://doi.wiley.com/10.1002/prot.25415},\nvolume = {86},\nyear = {2018}\n}\n```\n\n__NetSurfP2.0: (Secondary Structure)__\n```\n@article{netsurfp,\n  title={NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning},\n  author={Klausen, Michael Schantz and Jespersen, Martin Closter and Nielsen, Henrik and Jensen, Kamilla Kjaergaard and Jurtz, Vanessa Isabell and Soenderby, Casper Kaae and Sommer, Morten Otto Alexander and Winther, Ole and Nielsen, Morten and Petersen, Bent and others},\n  journal={Proteins: Structure, Function, and Bioinformatics},\n  year={2019},\n  publisher={Wiley Online Library}\n}\n```\n\n__ProteinNet: (Contact)__\n```\n@article{proteinnet,\n  title={ProteinNet: a standardized data set for machine learning of protein structure},\n  author={AlQuraishi, Mohammed},\n  journal={arXiv preprint arXiv:1902.00249},\n  year={2019}\n}\n```\n\n__Fluorescence:__\n```\n@article{sarkisyan2016,\n  title={Local fitness landscape of the green fluorescent protein},\n  author={Sarkisyan, Karen S and Bolotin, Dmitry A and Meer, Margarita V and Usmanova, Dinara R and Mishin, Alexander S and Sharonov, George V and Ivankov, Dmitry N and Bozhanova, Nina G and Baranov, Mikhail S and Soylemez, Onuralp and others},\n  journal={Nature},\n  volume={533},\n  number={7603},\n  pages={397},\n  year={2016},\n  publisher={Nature Publishing Group}\n}\n```\n\n__Stability:__\n```\n@article{rocklin2017,\n  title={Global analysis of protein folding using massively parallel design, synthesis, and testing},\n  author={Rocklin, Gabriel J and Chidyausiku, Tamuka M and Goreshnik, Inna and Ford, Alex and Houliston, Scott and Lemak, Alexander and Carter, Lauren and Ravichandran, Rashmi and Mulligan, Vikram K and Chevalier, Aaron and others},\n  journal={Science},\n  volume={357},\n  number={6347},\n  pages={168--175},\n  year={2017},\n  publisher={American Association for the Advancement of Science}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonglab-cal%2Ftape","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsonglab-cal%2Ftape","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonglab-cal%2Ftape/lists"}