Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-dl4g

(Soon to be) community-curated list of software packages and data resources for deep learning for genomics (DL4G)
https://github.com/ML4GLand/awesome-dl4g

Last synced: 5 days ago
JSON representation

  • Software packages

    • DL4G Packages

      • DragoNN - [TensorFlow] - Predictive modeling of regulatory genomics, nucleotide-resolution feature discovery, and simulations for systematic development and benchmarking. (2016)
      • pysster - [TensorFlow] - A Python package for training and interpretation of convolutional neural networks on biological sequence data. (2018)
      • GOPHER - [TensorFlow] - scripts for data preprocessing, training deep learning models for DNA sequence to epigenetic function prediction and evaluation of models. (2022)
      • ENNGene - [TensorFlow] - An application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. (2022)
      • DeepChem - [PyTorch, TensorFlow, jax] - Open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. (2019)
      • Kipoi - [PyTorch, TensorFlow] - An API and a repository of ready-to-use trained models for genomics. Also allows for usage via the command line or R. (2019)
      • Selene - [PyTorch] - Python library and command line interface for training deep neural networks from biological sequence data such as genomes. (2019)
      • DeepAccess - [TensorFlow] - Training and interpreting CNNs for predicting cell type-specific accessibility. (2021)
      • Janggu - [Keras] - Package that facilitates deep learning in the context of genomics. Janggu provides special Genomics datasets and compatibiltity with NumPy, sklearn, and Keras. (2021)
      • EUGENe - [PyTorch Lightning] - An API for running DL4G workflows with sequence-to-function models. Uses SeqData to containerize sequence data and integrates functions for data loading, model training and model intereptation from several libraries (2022)
    • Data wrangling

      • Nucleus - Library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF.
      • scikit-bio - An open-source, BSD-licensed, python package providing data structures, algorithms, and educational resources for bioinformatics.
      • BioNumPy - A Python library for easy and efficient representation and analysis of biological data. (2022)
      • seqgra - A deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models
      • simdna - This is a tool for generating simulated regulatory sequence for use in experiments/analyses.
      • genome-loader - Pipeline for efficient genomic data processing.
      • BioPython - Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
      • scikit-bio - An open-source, BSD-licensed, python package providing data structures, algorithms, and educational resources for bioinformatics.
      • kipoiseq - Standard set of data-loaders for training and making predictions for DNA sequence-based models.
      • PyRanges - GenomicRanges and genomic Rle-objects for Python.
      • BedTools - Swiss-army knife of tools for a wide-range of genomics analysis tasks
      • scikit-bio - An open-source, BSD-licensed, python package providing data structures, algorithms, and educational resources for bioinformatics.
    • Model zoos

      • kipoi models - repository hosts predictive models for genomics and serves as a model source for [Kipoi](https://kipoi.org/groups/)
      • HuggingFace Transformers - [PyTorch, TensorFlow, JAX] - Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. (2021)
    • Visualizations

      • vizsequence - Collecting commonly-repeated sequence visualization code here. (2019)
      • seqlogo - Python port of Bioconductor's seqLogo served by WebLogo. (2020)
      • logomaker - a Python package for generating publication-quality sequence logos. (2019)
      • TensorBoard - TensorFlow's visualization toolkit
    • Interpretability

      • TF-MoDISco - Biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model,
      • fastISM - [Keras] - Keras implementation for fast in-silico saturated mutagenesis (ISM) for convolution-based architectures
      • yuzu - [PyTorch] - a compressed sensing-based approach that can make in-silico saturation mutagenesis calculations on DNA, RNA, and proteins an order of magnitude faster
      • Scrambler - Interpretation method for sequence-predictive models based on deep generative masking
      • DFIM - Epistatic feature interactions from neural network models of regulatory DNA sequence
      • Captum - [PyTorch] - General library for model interpretability in PyTorch
      • SHAP - SHapley Additive exPlanations game theoretic approach to explain the output of any machine learning model
      • ExpectedPatternEffect - [TensorFlow] - interpretation of trained DeepAccess models
      • Global importanace analysis - model interpretability with global importance analysis
    • Deep learning frameworks

      • Tensorflow - Developed by the Google Brain team (released in 2015), has a reputation as a well-documented framework with powerful visualization tools (TensorBoard) and an abundance of trained models (TensorFlow Hub). Also known to be complex and have a steep learning curve. Often used for deploying trained models to production (TensforFlow Server). Version 2.0 was released in 2019.
      • Keras - An API written in Python to simplify training models. Passes low-level computations to Backend library, which is often Tensorflow.
      • PyTorch - Developed by Facebook AI (released in 2017), has a reputation for simplicity, ease of use, flexibility, efficient memory usage and dynamic computational graphs. Often used for prototyping models and for research.
      • JAX - JAX is Autograd and XLA, brought together for high-performance numerical computing and machine learning research
    • Utilities

      • MEME suite - Motif-based sequence analysis tools
      • HOMER - suite of tools for Motif Discovery and next-gen sequencing analysis
      • RayTune - Python library for experiment execution and hyperparameter tuning at any scale
  • Similar lists and collections

  • Models

    • Convolutional

      • DeepBind - with-PyTorch), [EUGENe](https://github.com/cartercompbio/EUGENe/blob/refactor_models/eugene/models/_sequence_to_function.py#L1087)] - One of the seminal convolutional based architectures trained to predict the binding of transcription factors and rna binding proteins.
    • Hybrid

      • DanQ - cbcl/DanQ/blob/master/DanQ_train.py), [Selene](https://github.com/FunctionLab/selene/blob/master/models/danQ.py), [DeepATT](https://github.com/ljw-struggle/Bioinfor-DeepATT/blob/main/model/model.py), [evo_aug](https://github.com/nkl27/genomic_augmentations/blob/main/supervised.py)] - Trained on the same dataset as DeepSEA to predict binarized epigenomic tracks from ENCODE and Roadmap. Added in a bi-directional LSTM layer after the convolutions and experimented with initializing convoultional filter weights with motifs.
  • Datasets and databases

    • RNA binding

      • RNA complete - in vitro RNA-binding protein assay of 244 RNA binding proteins. The dataset is downloaded as a single TSV file with RNA probes as rows and RNA binding proteins (RBP) as columns. Each entry in the table is an intensity measurement (can be normalized or raw) of the binding of each protein to each probe. There are over 244 RBP columns and 241,357 sequences spanning two sets (SetA and SetB)
  • Journal articles of general interest