https://kkrismer.github.io/seqgra/

seqgra: Synthetic rule-based biological sequence data generation for architecture evaluation and search
https://kkrismer.github.io/seqgra/

deep-learning interpretable-machine-learning neural-networks pytorch tensorflow

Last synced: 2 months ago
JSON representation

seqgra: Synthetic rule-based biological sequence data generation for architecture evaluation and search

Host: GitHub
URL: https://kkrismer.github.io/seqgra/
Owner: kkrismer
License: mit
Created: 2019-09-17T14:26:55.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-05-30T18:35:51.000Z (about 3 years ago)
Last Synced: 2024-10-28T22:23:22.398Z (9 months ago)
Topics: deep-learning, interpretable-machine-learning, neural-networks, pytorch, tensorflow
Language: Python
Homepage: https://kkrismer.github.io/seqgra/
Size: 8.87 MB
Stars: 0
Watchers: 6
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-dl4g - seqgra - A deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models (Software packages / Data wrangling)

README

        # seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

[![license: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![DOI](https://img.shields.io/badge/DOI-10.1093%2Fbioinformatics%2Fbtac101-blue.svg)](https://doi.org/10.1093/bioinformatics/btac101) [![PyPI version](https://badge.fury.io/py/seqgra.svg)](https://badge.fury.io/py/seqgra) [![Travis build status](https://travis-ci.com/kkrismer/seqgra.svg?branch=master)](https://travis-ci.com/kkrismer/seqgra)

https://kkrismer.github.io/seqgra/

## What is seqgra?

Sequence models based on deep neural networks have achieved state-of-the-art 

performance on regulatory genomics prediction tasks, such as chromatin 

accessibility and transcription factor binding. But despite their high 

accuracy, their contributions to a mechanistic understanding of the biology 

of regulatory elements is often hindered by the complexity of the predictive 

model and thus poor interpretability of its decision boundaries. To address 

this, we introduce seqgra, a deep learning pipeline that incorporates the 

rule-based simulation of biological sequence data and the training and 

evaluation of models, whose decision boundaries mirror the rules from the 

simulation process. The method can be used to (1) generate data under the 

assumption of a hypothesized model of genome regulation, (2) identify neural 

network architectures capable of recovering the rules of said model, and (3) 

analyze a model's predictive performance as a function of training set size, 

noise level, and the complexity of the rules behind the simulated data.

## Installation

seqgra is a Python package that is part of [PyPI](https://pypi.org/), 

the package repositories behind [pip](https://pip.pypa.io/en/stable/).

To install seqgra with pip, run:

```

pip install seqgra

```

To install seqgra directly from this repository, run:

```

git clone https://github.com/gifford-lab/seqgra

cd seqgra

pip install .

```

### System requirements

- Python 3.7 (or higher)

- *R 3.5 (or higher)*

    - *R package `ggplot2` 3.3.0 (or higher)*

    - *R package `gridExtra` 2.3 (or higher)*

    - *R package `scales` 1.1.0 (or higher)*

The ``tensorflow`` package is only required if TensorFlow models are used 

and will not be automatically installed by ``pip install seqgra``. Same is 

true for packages ``torch`` and ``pytorch-ignite``, which are only 

required if PyTorch models are used.

R is a soft dependency, in the sense that it is used to create a number 

of plots (grammar-model-agreement plots, 

grammar heatmaps, and motif similarity matrix plots) and if not available, 

these plots will be skipped.

seqgra depends upon the Python package [lxml](https://lxml.de/), which in turn 

depends on system libraries that are not always present. On a 

Debian/Ubuntu machine you can satisfy those requirements using:

```

sudo apt-get install libxml2-dev libxslt-dev

```

## Usage

Check out the following help pages:

* [Usage examples](https://kkrismer.github.io/seqgra/examples.html): seqgra example analyses with data definitions and model definitions

* [Command line utilities](https://kkrismer.github.io/seqgra/cmd.html): argument descriptions for `seqgra`, `seqgras`, `seqgrae`, and `seqgraa` commands

* [Data definition](https://kkrismer.github.io/seqgra/dd.html): detailed description of the data definition language that is used to formalize grammars

* [Model definition](https://kkrismer.github.io/seqgra/md.html): detailed description of the model definition language that is used to describe neural network architectures and hyperparameters for the optimizer, the loss, and the training process

* [Simulators, Learners, Evaluators, Comparators](https://kkrismer.github.io/seqgra/slec.html): brief descriptions of the most important classes

* [seqgra API reference](https://kkrismer.github.io/seqgra/seqgra.html): detailed description of the seqgra API

* [Source code](https://github.com/gifford-lab/seqgra): seqgra source code repository on GitHub 

## Citation

If you use seqgra in your work, please cite:

**seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks**  

Konstantin Krismer, Jennifer Hammelman, and David K. Gifford  

Bioinformatics, Volume 38, Issue 9, 1 May 2022, Pages 2381–2388; DOI: https://doi.org/10.1093/bioinformatics/btac101

## Funding

We gratefully acknowledge funding from NIH grants 1R01HG008754 and 

1R01NS109217.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://kkrismer.github.io/seqgra/

Awesome Lists containing this project

README