awesome-biochem-ai

Curated list on Deep Transformers Applications on Biology and Chemistry
https://github.com/ratthachat/awesome-biochem-ai

Last synced: 20 days ago
JSON representation

Molecule Representation
- SELFIES - based molecule representation invented very recently in 2020. It is well-known that in molecule generation applications, generated SMILES-based strings are often invalid because of either syntactic violation (e.g. incomplete brackets) or semantic violation (e.g. an atom has more bonds than its physical capability). SELFIES is designed to be robust in token-mutation and token-permutation to solve these problems of SMILES strings. Recent application such as [ChemGPT (2022)](https://chemrxiv.org/engage/chemrxiv/article-details/627bddd544bdd532395fb4b5) also used SELFIES representation.
- SMILES - model machine learning. Therefore, state-of-the-arts language models like BERT, GPT and Langauge-translation Transformers are commonly seen in literatures.
- Graph - hot atom-vector and adjacency bond-matrices. Even 3D-structure molecule can also be represented by a graph by either using adhoc kNN adjacency or [directly embedding 3D coordinates into a graph (EquiBIND, 2022)](https://arxiv.org/pdf/2202.05146.pdf).
- Graph - hot atom-vector and adjacency bond-matrices. Even 3D-structure molecule can also be represented by a graph by either using adhoc kNN adjacency or [directly embedding 3D coordinates into a graph (EquiBIND, 2022)](https://arxiv.org/pdf/2202.05146.pdf).
Molecule Retrosynthesis Pathways
- BioNavi-NP (2022) - NP is based on the recent work of [Retro* (ICML 2020)](https://github.com/binghong-ml/retro_star) which is a framework for multi-step pathway predictor based on [A*-algorithm](https://en.wikipedia.org/wiki/A*_search_algorithm). BioNavi-NP concretize Retro* in natural-product (Bio-synthesis) pathways by (1) using standard transformers as 1-step predictor, (2) pretrain from organic chemical pathway then finetune to Bio-synthesis pathways and (3) make a [web-based extension](http://biopathnavi.qmclab.com/) which connects to other systems such as [Selenzyme](http://selenzyme.synbiochem.co.uk/) to predict an enzyme for each reaction in a pathway.
- Root-aligned SMILES (2022) - SMILES** which is adjusted SMILES representations of the input molecules (i.e. make the two molecules align by starting with the same root atom), discrepancy (e.g. edit-distance) of the product-reactant SMILES will become much closer to the default canonical SMILES. By simply apply vanilla encoder-decoder transformers with R-SMILES, the author claims to get state-of-the-arts results over standard benchmark (USPTO). Note that there in a bug in Fig.1 in the published journal paper, so readers should read [an updated arxiv version](https://arxiv.org/abs/2203.11444)
- AiZynthFinder (2020)
- RetroXpert (2020) - Information Leak as noted in the Github repo.
- Graph-to-Graph Retrosynthesis (2020) - [Paper - ICML 2020](https://arxiv.org/pdf/2003.12725.pdf)
Atom Mapping
- RXNMapper (2021) - based encoder-decoder transformers to perform atom-atom mapping between two SMILES strings.
- Procrustes (2022) - atom mapping as can be seen in the [tutorial here](https://procrustes.qcdevs.org/notebooks/Atom_Atom_Mapping.html). Unlike, RXNMapper which works with string-based molecule, Procrustes works with Graph-based representation of molecules.
- Maximum Common Substructure (MCS)
Molecule Similarity Metrics
- MAP4 (2020) - title paper, [One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00445-4) claimed to be an improvement to existing [standard fingerprints](https://rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf). Once a molecule is converted to its fingerprint, we can use [Tanimoto Similarity](https://en.wikipedia.org/wiki/Chemical_similarity) to measure the similarity for any two fingerprints.
- Graph Matching Network (ICML 2019) - learning model based for approximated graph matching. However, we need to train the model to perform this task.
- Levenshtein - similarity-the-basic-know-your-algorithms-guide-3de3d7346227), see [TextDistance repository](https://github.com/life4/textdistance)
- MCS-distance
- maximum common substructure
Libraries on Molecule AI
- 3D
  - DeepPurpose
  - ProtTrans (2021) - acid characters without using other biochemistry knowledge (for example, AlphaFold2 uses Multiple Sequence Alignment). Pure protein strings are feeded to BERT/T5 and protein vectors are produced as a result. These protein vectors then are in turn fed to standard MLPs for protein property prediction tasks.
  - Ankh (2023) - [paper](https://arxiv.org/abs/2301.06568) The 2023 protein language model by the main author of ProtTrans. Ankh, which employs T5-like architecture, claims to be superior to both ProtTrans and ESM/ESM2 on standard protein property predicitons.
  - ChemicalX - pair scoring models and applications.
  - ESM2 and ESMFold (2022) - Amazing work from Meta AI which, similar to ProtTrans, uses only pure protein strings as input to the model. Not only property prediction, ESMFold is shown to be capablel of producing 3D protein folding structure comparable to AlphaFold2, but much faster running time.
  - OmegaFold (2022) - [paper](https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1.full.pdf) Another work similar to ESMFold which is able to predict 3D protein folding structure from pure protein strings.
  - TorchDrug - to-use models and datasets on molecules graph datasets and models, molecule generative models, protein graph datasets, and protein models
  - DIG - supervised learning, explainability, and 3D-graph models
  - DGL-LifeSci
  - DeepChem - focus models, now DeepChem is an exhausive Tensorflow library supporting several other fields e.g. material science and life science.
  - MolGraph
  - ML for protein engineering seminar series (youtube channel)
  - Huggingface official implementation of ESMfold
  - AlphaFold2 (2021) - breaking work capable of predict highly accurate 3D structure protein folding by Google's DeepMind. The official implementation is in Jax. See pytorch replication [here](https://github.com/lucidrains/alphafold2). We can run on a free Colab machine with [Colabfold](https://github.com/sokrypton/ColabFold)
Protein-Protein Docking
- 3D
  - EquiDock (ICLR 2022)
Protein-Ligand Binding
- 3D
  - EquiBind (ICML 2022) - ligand binding/docking problem, these two works use "direct prediction" of binding-site instead of "sampling and score" used by traditional approaches.
  - MIT news
Protein Design
- 3D
  - RF Diffusion (2022) - By Baker lab, University of Washington, a powerful new way to design proteins by combining structure prediction networks and generative diffusion models. The team demonstrated extremely high computational success and tested hundreds of A.I.-generated proteins in the lab, finding that many may be useful as medications, vaccines, or even new nanomaterials. Read more from the [blog of Baker lab](https://www.bakerlab.org/latest/)
Table of Contents
- Deep Learning for Molecules and Materials - stark/).
Property and Interaction Prediction
- ChemBERTa (2020) - language model self-supervised learning where [pretrained models on several millions of molecules are available](https://huggingface.co/models?sort=downloads&search=chem).
- MolGraph (2022) paper - ailab/grover), [MolCLR (2021)](https://github.com/yuyangw/MolCLR) and [GeoGNN (2022)](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/pretrained_compound/ChemRL/GEM).
- Self-Attention DTI (2019) - like transformers to encode drugs and use CNNs to enocode protein strings.
- ChemBERTa (2020) - language model self-supervised learning where [pretrained models on several millions of molecules are available](https://huggingface.co/models?sort=downloads&search=chem).
Enzymatic Reaction
- Metabolite Translator (2020) - string translation" as perfectly analogous to "language translation".
- EC Numbers
- Enzymatic Transformers (2021)
- Improved Substrate Encodings and Convolutional Pooling (2022)
- ECFP6 - ["Count" encoder](https://stackoverflow.com/questions/54809506/how-can-i-compute-a-count-morgan-fingerprint-as-numpy-array) to capture molecular substructure information.
- RXNAAMapper (2021) - acids (aa) to predict the active site of the enzyme. This work used string based for both molecules (SMILES) and enzymes (amino acid strings).
- MolSyn Transformers (2022) - acid sequence using state-of-the-arts [protein ESM transformers](https://github.com/facebookresearch/esm).
Molecule Generation
- Graph-based Motif-level
  - JunctionTree VAE (ICML 2018) - A classic work in graph-based molecule generation in tree-like manner. Unlike atom-level approaches, it can generate a ring into a molecule in one-step.
  - MoLeR (ICLR 2022) - generation)] from Microsoft which is claimed to be better than HGraph2Graph, e.g. HGraph2Graph cannot make arbitrary cyclic structure.
  - JunctionTree VAE (ICML 2018) - A classic work in graph-based molecule generation in tree-like manner. Unlike atom-level approaches, it can generate a ring into a molecule in one-step.
  - HGraph2Graph (ICML 2020) - An improved framework over JT-VAE by the same authors. The differences are (1) HGraph2Graph allows to use large motifs where as JT-VAE uses only small motifs such as rings. (2) HGraph2Graph is cleverly designed to avoid combinatorial problem in molecule generation e.g. it remembers specific atoms in each motif vocabulary which can be connect; hence, it avoids considering all connection possibilities in each motif. For details see this [nice video clip by the author](https://www.youtube.com/watch?v=Y5ZLbJDsuEU).
  - MoLeR (ICLR 2022) - generation)] from Microsoft which is claimed to be better than HGraph2Graph, e.g. HGraph2Graph cannot make arbitrary cyclic structure.
  - HGraph2Graph (ICML 2020) - An improved framework over JT-VAE by the same authors. The differences are (1) HGraph2Graph allows to use large motifs where as JT-VAE uses only small motifs such as rings. (2) HGraph2Graph is cleverly designed to avoid combinatorial problem in molecule generation e.g. it remembers specific atoms in each motif vocabulary which can be connect; hence, it avoids considering all connection possibilities in each motif. For details see this [nice video clip by the author](https://www.youtube.com/watch?v=Y5ZLbJDsuEU).
- 3D
  - Equivariant Diffusion for Molecule Generation in 3D (ICML 2022)
  - Equivariant Diffusion for Molecule Generation in 3D (ICML 2022)
Protein Generation
- 3D
  - ProGen2 (2022) - ProGen: Language Modeling for Protein Engineering by SalesForce. This work employs GPT-like (decoder-only) architecture. Model weights from 151M to 6.4B are available in the repo.

Programming Languages

Python 12 Jupyter Notebook 4 Roff 1 CSS 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-biochem-ai

Molecule Representation

Molecule Retrosynthesis Pathways

Atom Mapping

Molecule Similarity Metrics

Libraries on Molecule AI

3D

Protein-Protein Docking

3D

Protein-Ligand Binding

3D

Protein Design

3D

Table of Contents

Property and Interaction Prediction

Enzymatic Reaction

Molecule Generation

Graph-based Motif-level

3D

Protein Generation

3D