https://github.com/multiomics-analytics-group/course_protein_language_modeling
https://github.com/multiomics-analytics-group/course_protein_language_modeling
course education machine-learning protein-sequences slides
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/multiomics-analytics-group/course_protein_language_modeling
- Owner: Multiomics-Analytics-Group
- License: apache-2.0
- Created: 2023-07-04T13:53:38.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-10T14:32:06.000Z (5 months ago)
- Last Synced: 2025-04-12T06:53:54.351Z (about 2 months ago)
- Topics: course, education, machine-learning, protein-sequences, slides
- Language: Jupyter Notebook
- Homepage:
- Size: 61.6 MB
- Stars: 33
- Watchers: 5
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Protein Language Modeling Course

----
1. [Topics](topics.md)
2. [Course Structure](#course-structure)
3. [References](#references)
4. [Resources](#other-resources)## Course Structure
| **Theory** | | Link | Data |Models |
|--------------|-------------------------|------------------------------|-----------------|-----------------|
| | [Topics](topics.md) | [slides](slides/intro_pLM.pdf) | | |
| **Hands-on** | | | | |
| | **Sequence Analysis** | [Seq. Analysis Notebook](notebooks/seq_analysis.ipynb) [](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/seq_analysis.ipynb) | Exploring protein sequence and structure data | |
| | **Fine-tuning a model** | [Model Training Notebook](notebooks/model_training.ipynb) [](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/model_training.ipynb) | Taking an existing model and tuning it for other prediction/classification tasks | [Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) |
| | **Working with Embeddings** | [Embeddings Notebook](notebooks/embeddings.ipynb) [](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/embeddings.ipynb) | Accessing Protein representations (embeddings) generated by existing LMs | [ProTrans](https://github.com/agemagician/ProtTrans) |
| | **Predictions** | [pML Predictions Notebook](notebooks/prediction.ipynb) [](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/prediction.ipynb) | Using embeddings for predicting features or classifying sequences | [ProTrans](https://github.com/agemagician/ProtTrans) |
| | **Protein Design** | [Protein Design Notebook](notebooks/prot_design.ipynb)[](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/prot_design.ipynb) | *De novo* protein design and engineering using a LLM | [ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), [ESM](https://github.com/facebookresearch/esm) |#### Working Locally
1. Install [Docker](https://docs.docker.com/engine/install/)
2. From the root of the repository in the terminal run:
```
docker compose up --build -d
```
3. Open JupyterLab in any browser via `0.0.0.0:8888`To stop run: `docker compose stop`
To stop and remove containers: `docker compose down`#### Working VM
1. Once you have a VM, download docker in it
- (e.g., [apt install](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) via ssh)
- e.g. in VM shell run below
```
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
2. Clone this repo into the VM or to keep things slim you can move only the below needed files onto VM (e.g., via `scp`)
- notebooks and data folders
- Dockerfile, docker-compose.yml
- requirements.txt
3. Then run `docker compose up --build -d`
- if you encounter permission denied you may need to run the above command with elevated permissions e.g. `sudo docker compose up --build -d`
- Note: for troubleshooting you can use `docker logs ` e.g., `docker logs jupyter`
4. The URL to the JupyterLab will be: `:8888` e.g. `12.26.38.408:8888`
- Note: may have to expose port 8888 via the VM's network security group
5. On your local device, open a browser of your choosing. In the address bar input the URL to JupyterLab. And you should be good to go :)
6. **Remember to stop the VM container once you are done**## References
----
1. [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio2. [Attention Is All You Need](https://arxiv.org/abs/1706.03762) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
3. [MSA Transformer](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3) Roshan Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu, Alexander Rives
4. [Transformer-based deep learning for predicting protein properties in the life sciences](https://elifesciences.org/articles/82819) Abel Chandra, Laura Tünnermann, Tommy Löfstedt, Regina Gratz
5. [BERTology Meets Biology: Interpreting Attention in Protein Language Models](https://arxiv.org/abs/2006.15222) Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
6. [Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models
](https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2) Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, David Baker7. [Large language models generate functional protein sequences across diverse families](https://www.nature.com/articles/s41587-022-01618-2) Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr., Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser & Nikhil Naik
8. [Learning functional properties of proteins with language models](https://www.nature.com/articles/s42256-022-00457-9)Serbulent Unsal, Heval Atas, Muammer Albayrak, Kemal Turhan, Aybar C. Acar & Tunca Doğan
9. [Learning the protein language: Evolution, structure, and function](https://www.sciencedirect.com/science/article/pii/S2405471221002039) Tristan Bepler, Bonnie Berger
10. [The language of proteins: NLP, machine learning & protein sequences](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8050421/) Dan Ofer, Nadav Brandes, Michal Linial
11. [Evolutionary-scale prediction of atomic-level protein structure with a language model](https://www.science.org/doi/10.1126/science.ade2574) ZEMING LIN, HALIL AKIN, ROSHAN RAO, BRIAN HIE, ZHONGKAI ZHU, WENTING LU, NIKITA SMETANIN, ROBERT VERKUIL, ORI KABELI, ..., ALEXANDER RIVES
12. [Generative power of a protein language model trained on multiple sequence alignments](https://elifesciences.org/articles/79854) Damiano Sgarbossa, Umberto Lupo, Anne-Florence Bitbol
13. [How Huge Protein Language Models Could Disrupt Structural Biology](https://towardsdatascience.com/how-huge-protein-language-models-could-disrupt-structural-biology-6b98193f880b)
14. [Embeddings from protein language models predict conservation and variant effects](https://link.springer.com/article/10.1007/s00439-021-02411-y) Céline Marquet, Michael Heinzinger, Tobias Olenyi, Christian Dallago, Kyra Erckert, Michael Bernhofer, Dmitrii Nechaev & Burkhard Rost
15. [Collectively encoding protein properties enriches protein language models](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05031-z) Jingmin An & Xiaogang Weng
16. [ProGen: Language Modeling for Protein Generation](https://www.biorxiv.org/content/10.1101/2020.03.07.982272v2) Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher
17. [Transformer protein language models are unsupervised structure learners](https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1) Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, Alexander Rives
18. [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/doi/full/10.1073/pnas.2016239118) Alexander Rives, Joshua Meier, Tom Sercu and Rob Fergus
19. [NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning](https://academic.oup.com/nar/article/50/W1/W510/6596854?) Magnus Haraldson Høie, Erik Nicolas Kiehl, Bent Petersen, Morten Nielsen, Ole Winther, Henrik Nielsen, Jeppe Hallgren, Paolo Marcatili
20. [Modeling Protein Using Large-scale Pretrain Language Model](https://arxiv.org/abs/2108.07435) Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang [github](https://github.com/THUDM/ProteinLM)
21. [Deciphering antibody affinity maturation with language models and weakly supervised learning](https://arxiv.org/abs/2112.07782) Jeffrey A. Ruffolo, Jeffrey J. Gray, Jeremias Sulam [github](https://github.com/dohlee/antiberty-pytorch)
22. [Protein embeddings improve phage-host interaction prediction](https://www.biorxiv.org/content/10.1101/2023.02.26.530154v1) Mark Edward M. Gonzales, Jennifer C. Ureta, View ORCID ProfileAnish M.S. Shrestha [github](https://github.com/bioinfodlsu/phage-host-prediction)
23. [ProteinBERT: a universal deep-learning model of protein sequence and function](https://academic.oup.com/bioinformatics/article/38/8/2102/6502274) Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial [github](https://github.com/nadavbra/protein_bert)
24. [ProtGPT2 is a deep unsupervised language model for protein design](https://www.nature.com/articles/s41467-022-32007-7) Noelia Ferruz, Steffen Schmidt & Birte Höcker [Hugging Face](https://huggingface.co/nferruz/ProtGPT2?)
25. [Protein-Protein Interaction Prediction is Achievable with Large Language Models](https://www.biorxiv.org/content/10.1101/2023.06.07.544109v1.full) Logan Hallee, Jason P. Gleghorn
26. [Accurate prediction of virus-host protein-protein interactions via a Siamese neural network using deep protein sequence embeddings](https://www.sciencedirect.com/science/article/pii/S2666389922001568?via%3Dihub) Sumit Madan, Victoria Demina, Marcus Stapf, Oliver Ernst, Holger Fröhlich
27. [Structure-informed Language Models Are Protein Designers](https://arxiv.org/abs/2302.01649) Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei YE, Quanquan Gu
28. [Graph-BERT and language model-based framework for protein–protein interaction identification](https://www.nature.com/articles/s41598-023-31612-w) Kanchan Jha, Sourav Karmakar & Sriparna Saha
29. [Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling](https://arxiv.org/abs/2301.06568) Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost
30. [Contrastive learning in protein language space predicts interactions between drugs and protein targets](https://www.pnas.org/doi/10.1073/pnas.2220778120) Rohit Singh, Samuel Sledzieski, Bryan Bryson, Bonnie Berger
31. [De novo design of protein structure and function with RFdiffusion](https://www.nature.com/articles/s41586-023-06415-8) Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Sergey Ovchinnikov, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek & David Baker
32. [Single-sequence protein structure prediction using a language model and deep learning](https://www.nature.com/articles/s41587-022-01432-w) Ratul Chowdhury, Nazim Bouatta, Surojit Biswas, Christina Floristean, Anant Kharkar, Koushik Roy, Charlotte Rochereau, Gustaf Ahdritz, Joanna Zhang, George M. Church, Peter K. Sorger & Mohammed AlQuraishi
33. [Before and after AlphaFold2: An overview of protein structure prediction](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10011655/) Letícia M. F. Bertoline, Angélica N. Lima, Jose E. Krieger, and Samantha K. Teixeira
34. [Evaluating Protein Transfer Learning with TAPE](https://arxiv.org/abs/1906.08230) Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song
35. [FLIP: Benchmark tasks in fitness landscape inference for proteins](https://www.biorxiv.org/content/10.1101/2021.11.09.467890v2) Christian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, Kevin K. Yang
36. [Standards, tooling and benchmarks to probe representation learning on proteins](https://openreview.net/forum?id=adODyN-eeJ8) Joaquin Gomez Sanchez, Sebastian Franz, Michael Heinzinger, Burkhard Rost, Christian Dallago
37. [Learning meaningful representations of protein sequences](https://www.nature.com/articles/s41467-022-29443-w) Nicki Skafte Detlefsen, Søren Hauberg & Wouter Boomsma
38. [Language modelling for biological sequences – curated datasets and baselines](https://www.biorxiv.org/content/10.1101/2020.03.09.983585v1) Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, Henrik Nielsen
39. [Language models generalize beyond natural proteins](https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1) Robert Verkuil, Ori Kabeli, Yilun Du, Basile I. M. Wicky, Lukas F. Milles, Justas Dauparas, David Baker, Sergey Ovchinnikov, Tom Sercu, Alexander Rives
40. [ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction](https://arxiv.org/abs/2010.09885) Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
41. [SETH predicts nuances of residue disorder from protein embeddings](https://www.frontiersin.org/articles/10.3389/fbinf.2022.1019597/full) Dagmar Ilzhöfer, Michael Heinzinger, Burkhard Rost
42. [Exploiting pretrained biochemical language models for targeted drug design](https://academic.oup.com/bioinformatics/article/38/Supplement_2/ii155/6702010) Gökçe Uludoğan, Elif Ozkirimli, Kutlu O Ulgen, Nilgün Karalı, Arzucan Özgür
## Other Resources
----
- [Protein Language Workshop](https://github.com/dimiboeckaerts/ProteinLanguageWorkshop)- [Awesome protein representation learning](https://github.com/LirongWu/awesome-protein-representation-learning)
- [Evolutionary Scale Modeling - ESM](https://github.com/facebookresearch/esm)
- [Huggingface ESM](https://huggingface.co/docs/transformers/model_doc/esm)
- [Ankh](https://github.com/agemagician/Ankh)
- [Protein Sequence Embeddings (ProSE)](https://github.com/tbepler/prose)
- [ProTrans](https://github.com/agemagician/ProtTrans)
- [ConPlex](https://github.com/samsledje/ConPLex)
- [Deep Learning with Proteins](https://github.com/huggingface/blog/blob/main/deep-learning-with-proteins.md)
- [Awesome AI-based for Protein Design](https://github.com/opendilab/awesome-AI-based-protein-design)
- [Amazon SageMaker Protein Classification](https://github.com/aws-samples/amazon-sagemaker-protein-classification/tree/main)
- [Huggingface Protein Language Modeling - pytorch](https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb)
- [Huggingface Protein Language Modeling - tensorflow](https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling-tf.ipynb)
- [Building a Text Generation Model from Scratch](https://wingedsheep.com/building-a-language-model/)
- [Training the Transformer Model](https://machinelearningmastery.com/training-the-transformer-model/)
- [GearNet: Geometry-Aware Relational Graph Neural Network](https://github.com/DeepGraphLearning/GearNet)
- [UniLanguage](https://github.com/alrojo/UniLanguage)
- [SETH](https://github.com/DagmarIlz/SETH)