Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-biochem-ai

Curated list on Deep Transformers Applications on Biology and Chemistry
https://github.com/ratthachat/awesome-biochem-ai

  • Deep Learning for Molecules and Materials - stark/).
  • SMILES - model machine learning. Therefore, state-of-the-arts language models like BERT, GPT and Langauge-translation Transformers are commonly seen in literatures.
  • SELFIES - based molecule representation invented very recently in 2020. It is well-known that in molecule generation applications, generated SMILES-based strings are often invalid because of either syntactic violation (e.g. incomplete brackets) or semantic violation (e.g. an atom has more bonds than its physical capability). SELFIES is designed to be robust in token-mutation and token-permutation to solve these problems of SMILES strings. Recent application such as [ChemGPT (2022)](https://chemrxiv.org/engage/chemrxiv/article-details/627bddd544bdd532395fb4b5) also used SELFIES representation.
  • Graph - hot atom-vector and adjacency bond-matrices. Even 3D-structure molecule can also be represented by a graph by either using adhoc kNN adjacency or [directly embedding 3D coordinates into a graph (EquiBIND, 2022)](https://arxiv.org/pdf/2202.05146.pdf).
  • ChemBERTa (2020) - language model self-supervised learning where [pretrained models on several millions of molecules are available](https://huggingface.co/models?sort=downloads&search=chem).
  • MolGraph (2022) paper - ailab/grover), [MolCLR (2021)](https://github.com/yuyangw/MolCLR) and [GeoGNN (2022)](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/pretrained_compound/ChemRL/GEM).
  • Self-Attention DTI (2019) - like transformers to encode drugs and use CNNs to enocode protein strings.
  • Metabolite Translator (2020) - string translation" as perfectly analogous to "language translation".
  • EC Numbers
  • Enzymatic Transformers (2021)
  • Improved Substrate Encodings and Convolutional Pooling (2022)
  • ECFP6 - ["Count" encoder](https://stackoverflow.com/questions/54809506/how-can-i-compute-a-count-morgan-fingerprint-as-numpy-array) to capture molecular substructure information.
  • RXNAAMapper (2021) - acids (aa) to predict the active site of the enzyme. This work used string based for both molecules (SMILES) and enzymes (amino acid strings).
  • MolSyn Transformers (2022) - acid sequence using state-of-the-arts [protein ESM transformers](https://github.com/facebookresearch/esm).
  • BioNavi-NP (2022) - NP is based on the recent work of [Retro* (ICML 2020)](https://github.com/binghong-ml/retro_star) which is a framework for multi-step pathway predictor based on [A*-algorithm](https://en.wikipedia.org/wiki/A*_search_algorithm). BioNavi-NP concretize Retro* in natural-product (Bio-synthesis) pathways by (1) using standard transformers as 1-step predictor, (2) pretrain from organic chemical pathway then finetune to Bio-synthesis pathways and (3) make a [web-based extension](http://biopathnavi.qmclab.com/) which connects to other systems such as [Selenzyme](http://selenzyme.synbiochem.co.uk/) to predict an enzyme for each reaction in a pathway.
  • Root-aligned SMILES (2022) - SMILES** which is adjusted SMILES representations of the input molecules (i.e. make the two molecules align by starting with the same root atom), discrepancy (e.g. edit-distance) of the product-reactant SMILES will become much closer to the default canonical SMILES. By simply apply vanilla encoder-decoder transformers with R-SMILES, the author claims to get state-of-the-arts results over standard benchmark (USPTO). Note that there in a bug in Fig.1 in the published journal paper, so readers should read [an updated arxiv version](https://arxiv.org/abs/2203.11444)
  • AiZynthFinder (2020)
  • RetroXpert (2020) - Information Leak as noted in the Github repo.
  • Graph-to-Graph Retrosynthesis (2020) - [Paper - ICML 2020](https://arxiv.org/pdf/2003.12725.pdf)
  • RXNMapper (2021) - based encoder-decoder transformers to perform atom-atom mapping between two SMILES strings.
  • Procrustes (2022) - atom mapping as can be seen in the [tutorial here](https://procrustes.qcdevs.org/notebooks/Atom_Atom_Mapping.html). Unlike, RXNMapper which works with string-based molecule, Procrustes works with Graph-based representation of molecules.
  • Maximum Common Substructure (MCS)
  • HGraph2Graph (2020)
  • MAP4 (2020) - title paper, [One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00445-4) claimed to be an improvement to existing [standard fingerprints](https://rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf). Once a molecule is converted to its fingerprint, we can use [Tanimoto Similarity](https://en.wikipedia.org/wiki/Chemical_similarity) to measure the similarity for any two fingerprints.
  • Levenshtein - similarity-the-basic-know-your-algorithms-guide-3de3d7346227), see [TextDistance repository](https://github.com/life4/textdistance)
  • MCS-distance
  • maximum common substructure
  • Graph Matching Network (ICML 2019) - learning model based for approximated graph matching. However, we need to train the model to perform this task.
  • HGraph2Graph (2020)
  • JunctionTree VAE (ICML 2018) - A classic work in graph-based molecule generation in tree-like manner. Unlike atom-level approaches, it can generate a ring into a molecule in one-step.
  • HGraph2Graph (ICML 2020) - An improved framework over JT-VAE by the same authors. The differences are (1) HGraph2Graph allows to use large motifs where as JT-VAE uses only small motifs such as rings. (2) HGraph2Graph is cleverly designed to avoid combinatorial problem in molecule generation e.g. it remembers specific atoms in each motif vocabulary which can be connect; hence, it avoids considering all connection possibilities in each motif. For details see this [nice video clip by the author](https://www.youtube.com/watch?v=Y5ZLbJDsuEU).
  • MoLeR (ICLR 2022) - generation)] from Microsoft which is claimed to be better than HGraph2Graph, e.g. HGraph2Graph cannot make arbitrary cyclic structure.
  • Equivariant Diffusion for Molecule Generation in 3D (ICML 2022)
  • DeepPurpose
  • TorchDrug - to-use models and datasets on molecules graph datasets and models, molecule generative models, protein graph datasets, and protein models
  • DIG - supervised learning, explainability, and 3D-graph models
  • ChemicalX - pair scoring models and applications.
  • DGL-LifeSci
  • DeepChem - focus models, now DeepChem is an exhausive Tensorflow library supporting several other fields e.g. material science and life science.
  • MolGraph
  • ML for protein engineering seminar series (youtube channel)
  • AlphaFold2 (2021) - breaking work capable of predict highly accurate 3D structure protein folding by Google's DeepMind. The official implementation is in Jax. See pytorch replication [here](https://github.com/lucidrains/alphafold2). We can run on a free Colab machine with [Colabfold](https://github.com/sokrypton/ColabFold)
  • ProtTrans (2021) - acid characters without using other biochemistry knowledge (for example, AlphaFold2 uses Multiple Sequence Alignment). Pure protein strings are feeded to BERT/T5 and protein vectors are produced as a result. These protein vectors then are in turn fed to standard MLPs for protein property prediction tasks.
  • Ankh (2023) - [paper](https://arxiv.org/abs/2301.06568) The 2023 protein language model by the main author of ProtTrans. Ankh, which employs T5-like architecture, claims to be superior to both ProtTrans and ESM/ESM2 on standard protein property predicitons.
  • ESM2 and ESMFold (2022) - Amazing work from Meta AI which, similar to ProtTrans, uses only pure protein strings as input to the model. Not only property prediction, ESMFold is shown to be capablel of producing 3D protein folding structure comparable to AlphaFold2, but much faster running time.
  • Huggingface official implementation of ESMfold
  • OmegaFold (2022) - [paper](https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1.full.pdf) Another work similar to ESMFold which is able to predict 3D protein folding structure from pure protein strings.
  • EquiDock (ICLR 2022)
  • EquiBind (ICML 2022) - ligand binding/docking problem, these two works use "direct prediction" of binding-site instead of "sampling and score" used by traditional approaches.
  • MIT news
  • ProGen2 (2022) - ProGen: Language Modeling for Protein Engineering by SalesForce. This work employs GPT-like (decoder-only) architecture. Model weights from 151M to 6.4B are available in the repo.
  • RF Diffusion (2022) - By Baker lab, University of Washington, a powerful new way to design proteins by combining structure prediction networks and generative diffusion models. The team demonstrated extremely high computational success and tested hundreds of A.I.-generated proteins in the lab, finding that many may be useful as medications, vaccines, or even new nanomaterials. Read more from the [blog of Baker lab](https://www.bakerlab.org/latest/)