Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/salesforce/provis
Official code repository of "BERTology Meets Biology: Interpreting Attention in Protein Language Models."
https://github.com/salesforce/provis
Last synced: 1 day ago
JSON representation
Official code repository of "BERTology Meets Biology: Interpreting Attention in Protein Language Models."
- Host: GitHub
- URL: https://github.com/salesforce/provis
- Owner: salesforce
- License: other
- Created: 2020-06-15T14:22:26.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-06-12T21:27:47.000Z (over 1 year ago)
- Last Synced: 2025-01-07T07:12:07.286Z (8 days ago)
- Language: Python
- Homepage: https://arxiv.org/abs/2006.15222
- Size: 7.19 MB
- Stars: 304
- Watchers: 19
- Forks: 48
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
- StarryDivineSky - salesforce/provis
README
# BERTology Meets Biology: Interpreting Attention in Protein Language Models
This repository is the official implementation of [BERTology Meets Biology: Interpreting Attention in Protein Language Models](https://arxiv.org/abs/2006.15222).
## Table of Contents
- [ProVis Attention Visualizer](#provis-attention-visualizer)
* [Installation](#installation)
* [Execution](#execution)
- [Experiments](#experiments)
* [Installation](#installation-2)
* [Datasets](#datasets)
* [Attention Analysis](#attention-analysis)
+ [Tape BERT Model](#tape-bert-model)
+ [ProtTrans Models](#prottrans-models)
* [Probing Analysis](#probing-analysis)
+ [Training](#training)
+ [Reports](#reports)
- [License](#license)
- [Acknowledgments](#acknowledgments)
- [Citation](#citation)## ProVis Attention Visualizer
This section provides instructions for generating visualizations of attention projected onto 3D protein structure.
![Image](images/vis3d_binding_sites.png?raw=true) ![Image](images/vis3d_contact_map.png?raw=true)
### Installation
**General requirements**:
* Python >= 3.7```
pip install biopython==1.77
pip install tape-proteins==0.5
pip install jupyterlab==3.0.14
pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix
```If you run into problems installing nglview, please refer to their
[installation instructions](https://github.com/arose/nglview#released-version) for additional installation details
and options.### Execution
```
cd /notebooks
jupyter notebook provis.ipynb
```If you get an error running the notebook, you may need to execute the notebook as follows:
```
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000
```
See nglview [installation instructions](https://github.com/arose/nglview#released-version) for more details.You may edit the notebook to choose other proteins, attention heads, etc. The visualization tool is based on the
excellent [nglview](https://github.com/arose/nglview) library.---
## Experiments
This section describes how to reproduce the experiments in the paper.
### Installation
```setup
cd
python setup.py develop
```To download additional required datasets from [TAPE](https://github.com/songlab-cal/tape):
```setup
cd /data
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/secondary_structure.tar.gz
tar -xvf secondary_structure.tar.gz && rm secondary_structure.tar.gz
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/proteinnet.tar.gz
tar -xvf proteinnet.tar.gz && rm proteinnet.tar.gz
```### Attention Analysis
The following steps will reproduce the attention analysis experiments and generate the reports currently found in
/reports/attention_analysis. This includes all experiments besides the probing experiments
(see [Probing Analysis](#probing-analysis)).Before performing steps, navigate to appropriate directory:
```
cd /protein_attention/attention_analysis
```#### Tape BERT Model
The following executes the attention analysis (may run for several hours):
```
sh scripts/compute_all_features_tape_bert.sh
```
The above script create a set of extract files in /data/cache corresponding to various properties
being analyzed. You may edit the script files to remove properties that you are not interested in. If you wish to run the
analysis without a GPU, you must specify the `--no_cuda` flag.The following generate reports based on the files created in previous step:
```
sh scripts/report_all_features_tape_bert.sh
```
If you removed steps from the analysis script above, you will need to update the reporting script accordingly.#### ProtTrans Models
In order to generate reports for the ProtTrans models, follow the instructions as for the TapeBert
model above, but substitute the following commands:**ProtBert:**
```
sh scripts/compute_all_features_prot_bert.sh
sh scripts/report_all_features_prot_bert.sh
```
**ProtBertBFD:**
```
sh scripts/compute_all_features_prot_bert_bfd.sh
sh scripts/report_all_features_prot_bert_bfd.sh
```**ProtAlbert:**
```
sh scripts/compute_all_features_prot_albert.sh
sh scripts/report_all_features_prot_albert.sh
```**ProtXLNet:**
```
sh scripts/compute_all_features_prot_xlnet.sh
sh scripts/report_all_features_prot_xlnet.sh
```### Probing Analysis
The following steps will recreate the figures from the probing analysis, currently found in /reports/probing
Navigate to directory:
```
cd /protein_attention/probing
```#### Training
Train diagnostic classifiers. Each script will write out an extract file with evaluation results. Note: each of these scripts may run for several hours.
```
sh scripts/probe_ss4_0_all
sh scripts/probe_ss4_1_all
sh scripts/probe_ss4_2_all
sh scripts/probe_sites.sh
sh scripts/probe_contacts.sh
```
#### Reports
```
python report.py
```## License
This project is licensed under BSD3 License - see the [LICENSE](LICENSE) file for details
## Acknowledgments
This project incorporates code from the following repo:
* https://github.com/songlab-cal/tape## Citation
When referencing this repository, please cite [this paper](https://arxiv.org/abs/2006.15222).
```
@misc{vig2020bertology,
title={BERTology Meets Biology: Interpreting Attention in Protein Language Models},
author={Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani},
year={2020},
eprint={2006.15222},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2006.15222}
}
```