{"id":22969731,"url":"https://github.com/tbepler/prose","last_synced_at":"2025-09-11T20:33:26.408Z","repository":{"id":44403037,"uuid":"369158236","full_name":"tbepler/prose","owner":"tbepler","description":"Multi-task and masked language model-based protein sequence embedding models.","archived":false,"fork":false,"pushed_at":"2021-06-16T17:17:43.000Z","size":232,"stargazers_count":99,"open_issues_count":2,"forks_count":20,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-12-28T08:24:52.112Z","etag":null,"topics":["deep-learning","language-model","protein-embedding","protein-sequences","representation-learning","sequence-embedding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tbepler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-20T09:47:59.000Z","updated_at":"2024-12-17T02:51:29.000Z","dependencies_parsed_at":"2022-07-15T05:46:16.389Z","dependency_job_id":null,"html_url":"https://github.com/tbepler/prose","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tbepler%2Fprose","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tbepler%2Fprose/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tbepler%2Fprose/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tbepler%2Fprose/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tbepler","download_url":"https://codeload.github.com/tbepler/prose/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232666175,"owners_count":18557991,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","language-model","protein-embedding","protein-sequences","representation-learning","sequence-embedding"],"created_at":"2024-12-14T21:38:27.079Z","updated_at":"2025-01-06T02:57:42.543Z","avatar_url":"https://github.com/tbepler.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Protein Sequence Embeddings (ProSE)\nMulti-task and masked language model-based protein sequence embedding models.\n\nThis repository contains code and links to download pre-trained models and data accompanying our paper, [Learning the protein language: Evolution, structure, and function](https://doi.org/10.1016/j.cels.2021.05.017). This extends from previous work, [Learning protein sequence embeddings using information from structure](https://openreview.net/pdf?id=SygLehCqtm).\n\n## At a glance\n\nTrain bidirectional language model using the masked LM objective:\n```\npython train_prose_masked.py\n```\n\nTrain bidirectional language model using the masked LM objective _and_ structure tasks:\n```\npython train_prose_multitask.py\n```\n\nEmbed sequences using the pre-trained models:\n```\npython embed_sequences.py\n```\n\nThe embedding script accepts sequences in fasta format and writes embeddings out as an HDF5 file using the sequence names as keys. Each sequence will have one dataset in the HDF5. Optionally, embeddings can be aggregated over the sequence positions to generate a fixed sized embedding for each sequence using the --pool argument.\n\nFor example, to embed the demo sequences in data/demo.fa to a file named data/demo.h5 using average pooling over each sequence (first, follow the instructions below to download the pre-trained models and install the python dependencies):\n```\npython embed_sequences.py --pool avg -o data/demo.h5 data/demo.fa\n```\n\nNote: your resulting demo.h5 may not match the provided demo.h5 exactly due to differences in rounding and non-determinism on different hardware, but your results should be close.\n\nThis uses the pre-trained multi-task model by default, to use a different model, set the --model flag.\n\nUse the --help flag to get complete usage information.\n\n\n## Setup instructions\n\n### Download the pre-trained embedding models\n\nThe pre-trained embedding models can be downloaded [here](http://bergerlab-downloads.csail.mit.edu/prose/saved_models.zip).\n\nThey should be unzipped in the project base directory. By default, prose looks for the pre-trained models in the saved_models/ directory.\n\n### Setup python environment\n\nThis code requires Python 3. I prefer Anaconda for ease of use. If you don't have conda installed already, get it [here](https://docs.conda.io/en/latest/miniconda.html).\n\n1. (Optional but recommended) Make an anaconda environment for this workshop and activate it:\n```\nconda create -n prose python=3\nsource activate prose\n```\n\n2. Install the dependencies\n```\nconda env update --file environment.yml\n```\nor with pip\n```\npip install -r requirements.txt\n```\n\nSee the pytorch install [documentation](https://pytorch.org/get-started/locally/) for information on installing pytorch for different CUDA versions.\n\n## Datasets\n\nThe training datasets are available at the links below.\n- [SCOP data](http://bergerlab-downloads.csail.mit.edu/bepler-protein-sequence-embeddings-from-structure-iclr2019/scope.tar.gz)\n- UniProt data: UniRef90 is available on the UniProt [downloads website](https://www.uniprot.org/downloads)\n\n## Author\nTristan Bepler (\u003ctbepler@gmail.com\u003e)\n\n## References\n\nPlease cite the following references if you use this code or pre-trained models in your work.\n\nBepler, T., Berger, B. Learning the protein language: evolution, structure, and function. Cell Systems 12, 6 (2021). https://doi.org/10.1016/j.cels.2021.05.017\n\n\u003cdetails\u003e\u003csummary\u003eBibtex\u003c/summary\u003e\u003cp\u003e\n\n```\n@article{BEPLER2021654,\ntitle = {Learning the protein language: Evolution, structure, and function},\njournal = {Cell Systems},\nvolume = {12},\nnumber = {6},\npages = {654-669.e3},\nyear = {2021},\nissn = {2405-4712},\ndoi = {https://doi.org/10.1016/j.cels.2021.05.017},\nurl = {https://www.sciencedirect.com/science/article/pii/S2405471221002039},\nauthor = {Tristan Bepler and Bonnie Berger}\n}\n```\n\n\u003c/p\u003e\u003c/details\u003e\n\n\nBepler, T., Berger, B. Learning protein sequence embeddings using information from structure. International Conference on Learning Representations (2019). https://openreview.net/pdf?id=SygLehCqtm\n\n\n\u003cdetails\u003e\u003csummary\u003eBibtex\u003c/summary\u003e\u003cp\u003e\n\n```\n@inproceedings{\nbepler2018learning,\ntitle={Learning protein sequence embeddings using information from structure},\nauthor={Tristan Bepler and Bonnie Berger},\nbooktitle={International Conference on Learning Representations},\nyear={2019},\n}\n```\n\n\u003c/p\u003e\u003c/details\u003e\n\n\n## License\n\nThe source code and trained models are provided free for non-commercial use under the terms of the CC BY-NC 4.0 license. See [LICENSE](LICENSE) file and/or https://creativecommons.org/licenses/by-nc/4.0/legalcode for more information.\n\n\n## Contact\n\nIf you have any questions, comments, or would like to report a bug, please file a Github issue or contact me at tbepler@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftbepler%2Fprose","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftbepler%2Fprose","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftbepler%2Fprose/lists"}