An open API service indexing awesome lists of open source software.

https://github.com/zavolanlab/mirflowz

Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.
https://github.com/zavolanlab/mirflowz

bioinformatics isomirs mirna snakemake workflow

Last synced: 5 months ago
JSON representation

Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.

Awesome Lists containing this project

README

          

# _MIRFLOWZ_

_MIRFLOWZ_ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs.

## Table of Contents

1. [Installation](#installation)
- [Cloning the repository](#cloning-the-repository)
- [Dependencies](#dependencies)
- [Setting up the virtual environment](#setting-up-the-virtual-environment)
- [Testing your installation](#testing-your-installation)
2. [Usage](#usage)
- [Preparing inputs](#preparing-inputs)
- [Running the workflow](#running-the-workflow)
- [Expected output files](#expected-output-files)
- [Creating a Snakemake report](#creating-a-snakemake-report)
3. [Workflow description](#workflow-description)
- [Prepare module](#prepare-module)
- [Map module](#map-module)
- [Quantify module](#quantify-module)
- [ASCII-style pileups module](#ascii-style-pileups-module)
4. [Contributing](#contributing)
5. [License](#license)
6. [Contact](#contact)

## Installation

The workflow lives inside this repository and will be available for you to run
after following the installation instructions laid out in this section.

### Cloning the repository

Traverse to the desired path on your file system, then clone the repository and
change into it with:

```bash
git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz
```

### Dependencies

For improved reproducibility and reusability of the workflow, as well as an
easy means to run it on a high performance computing (HPC) cluster managed,
e.g., by [Slurm][slurm], all steps of the workflow run inside isolated
environments ([Apptainer][apptainer] containers or [Conda][conda]
environments). As a consequence, running this workflow has only a few
individual dependencies. These are managed by the package manager Conda, which
needs to be installed on your system before proceeding.

If you do not already have Conda installed globally on your system,
we recommend that you install [Miniconda][miniconda-installation]. For faster
creation of the environment (and Conda environments in general), you can also
install [Mamba][mamba] on top of Conda. In that case, replace `conda` with
`mamba` in the commands below (particularly in `conda env create`).

### Setting up the virtual environment

Create and activate the environment with necessary dependencies with Conda:

```bash
conda env create -f environment.yml
conda activate mirflowz
```

If you plan to run _MIRFLOWZ_ via Conda, we recommend using the following
command for a faster environment creation, specially if you will run it on an
HPC cluster.

```bash
conda config --set channel_priority strict
```

If you would like to contribute to _MIRFLOWZ_ development, you may find it
useful to create your environment with the development dependencies:

```bash
conda env create -f environment.dev.yml
```

### Testing your installation

Several tests are provided to check the integrity of the installation. Follow
the instructions in this section to make sure the workflow is ready to use.

#### Run test workflow on local machine

Execute one of the following commands to run the test workflow on your local
machine:

- Test workflow on local machine with **Apptainer**:

```bash
bash test/test_workflow_local_with_apptainer.sh
```

- Test workflow on local machine with **Conda**:

```bash
bash test/test_workflow_local_with_conda.sh
```

#### Run test workflow on a cluster via Slurm

Execute one of the following commands to run the test workflow on a
Slurm-managed high-performance computing (HPC) cluster:

- Test workflow with **Apptainer**:

```bash
bash test/test_workflow_slurm_with_apptainer.sh
```

- Test workflow with **Conda**:

```bash
bash test/test_workflow_slurm_with_conda.sh
```

#### Rule graph

Execute the following command to generate a rule graph image for the workflow.
The output will be found in the `images/` directory in the repository root.

```bash
bash test/test_rule_graph.sh
```

You can see the rule graph below in the
[workflow description](#workflow-description) section.

#### Clean up test results

After successfully running the tests above, you can run the following command
to remove all artifacts generated by the test runs:

```bash
bash test/test_cleanup.sh
```

## Usage

Now that your virtual environment is set up and the workflow is deployed and
tested, you can go ahead and run the workflow on your samples.

### Preparing inputs

It is suggested to have all the input files for a given run (or hard links
pointing to them) inside a dedicated directory, for instance under the
_MIRFLOWZ_ root directory. This way, it is easier to keep the data together,
set up Apptainer access to them and reproduce analyses.

#### 1. Prepare a sample table

Refer to `test/test_files/sample_table.tsv` to know what this file
must look like, or use it as a template.

```bash
touch path/to/your/sample/table.tsv
```
> Fill the sample table according to the following requirements:
>
> - `sample`. Arbitrary name for the miRNA sequencing library.
> - `sample_file`. Path to the miRNA sequencing library file. The path must be
> relative to the directory where the workflow will be run.
> - `adapter`. Sequence of the 3'-end adapter used during library preparation.
> - `format`. One of `fa`/`fasta` or `fq`/`fastq`, if the library file is in
> FASTA or FASTQ format, respectively.

#### 2. Prepare genome resources

There are 4 files you must provide:

1. A **`gzip`ped FASTA** file containing **reference sequences**, typically the
genome of the source/organism from which the library was extracted.

2. A **`gzip`ped GTF** file with matching **gene annotations** for the
reference sequences above.

> _MIRFLOWZ_ expects both the reference sequence and gene annotation files to
> follow [Ensembl][ensembl] style/formatting. If you obtained these files from
> a source other than Ensembl, you must ensure that they adhere to the
> expected format by converting them, if necessary.

3. An **uncompressed GFF3** file with **microRNA annotations** for the reference
sequences above.

> _MIRFLOWZ_ expects the miRNA annotations to follow [miRBase][mirbase]
> style/formatting. If you obtained this file from a source other than miRBase,
> you must ensure that it adheres to the expected format by converting it, if
> necessary.

4. An **uncompressed tab-separated file** with a **mapping between the
reference names** used in the miRNA annotation file (column 1; "UCSC style")
and in the gene annotations and reference sequence files (column 2; "Ensembl
style"). Values in column 1 are expected to be unique, no header is
expected, and any additional columns will be ignored. [This resource][chrMap]
provides such files for various organisms, and in the expected format.

5. **OPTIONAL**: A **BED6** file with regions for which to produce
[ASCII-style pileups][ascii-pileups]. If not provided, no pileups are
generated. See [here][bed-format] for the expected format.

> General note: If you want to process the genome resources before use (e.g.,
> filtering), you can do that, but make sure the formats of any modified
> resource files meet the formatting expectations outlined above!

#### 3. Prepare a configuration file

We recommend creating a copy of the
[configuration file template](config/config_template.yaml):

```bash
cp config/config_template.yaml path/to/config.yaml
```

Open the new copy in your editor of choice and adjust the configuration
parameters to your liking. The template explains what each of the
parameters mean and how you can meaningfully adjust them.

### Running the workflow

With all the required files in place, you can now run the workflow locally
via Apptainer with the following command:

```bash
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--software-deployment-method apptainer \
--apptainer-args "--bind ${PWD}/../" \
--printshellcmds \
--rerun-incomplete \
--verbose
```

Likewise, you can run the workflow locally via Conda with the following
command:

```bash
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--software-deployment-method conda \
--printshellcmds \
--rerun-incomplete \
--verbose
```

> **NOTE:** Depending on your working directory, you do not need to use the
> parameters `--snakefile` and `--configfile`. For instance, if the `Snakefile`
> is in the same directory or the `workflow/` directory is beneath the current
> working directory, there's no need for the `--snakefile` directory. Refer to
> the [Snakemake documentation][snakemakeDocu] for more information.

After successful execution of the workflow, results and logs will be found in
the `results/` and `logs/` directories, respectively.

### Expected output files

Upon successful execution of _MIRFLOWZ_, the tool automatically removes all
intermediate files generated during the process. The final outputs comprise:

1. A SAM file containing alignments intersecting a pri-miR locus. These
alignments intersect with extended start and/or end positions specified in the
provided pri-miR annotations. Please note that they may not contribute to the
final counting and may not appear in the final table.

2. A SAM file containing alignments intersecting a mature miRNA locus. Similar
to the previous file, these alignments intersect with extended start and/or end
positions specified in the provided miRNA annotations. They may not contribute
to the final counting and might be absent from the final table.

3. A BAM file containing the set of alignments contributing to the final
counting and its corresponding index file (`.bam.bai`).

4. Table(s) containing the counting data from all libraries for (iso)miRs
and/or pri-miRs. Each row corresponds to a miRNA species, and each column
represents a sample library. Each read is counted towards all the annotated
miRNA species it aligns to, with 1/n, where n is the number of genomic and/or
transcriptomic loci that read aligns to.

5. **OPTIONAL**. ASCII-style pileups of read alignments produced for individual
libraries, combinations of libraries and/or all libraries of a given run. The
exact number and nature of the outputs depends on the workflow
inputs/parameters. See the
[pileups section](pipeline_documentation.md/#pileup-workflow) for a detailed
description.

To retain all intermediate files, include `--no-hooks` in the workflow call.

```bash
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--software-deployment-method conda \
--printshellcmds \
--rerun-incomplete \
--no-hooks \
--verbose
```

After successful execution of the workflow, the intermediate files will be
found in the `results/intermediates` directory.

### Creating a Snakemake report

Snakemake provides the option to generate a detailed HTML report on runtime
statistics, workflow topology and results. If you want to create a Snakemake
report, you must run the following command:

```bash
snakemake \
--snakefile="path/to/Snakefile" \
--configfile="path/to/config.yaml" \
--report="snakemake_report.html"
```

> **NOTE:** The report creation must be done after running the workflow in
> order to have the runtime statistics and the results.

## Workflow description

_MIRFLOWZ_ consists of a main `Snakefile` and four functional modules. In the
`Snakefile`, the configuration file is validated, and the various modules are
imported. In addition, a handler for both, a successful and a failed run are
set. If the workflow finishes without any errors, all the intermediate files
are removed, otherwise, a log file is created. To keep the intermediate files
upon completion, use the `--no-hooks` CLI argument when running the pipeline.

The modules [(1)](#prepare-module) process the genome resources,
[(2)](#map-module) map and [(3)](#quantify-module) quantify the reads, and
[(4)](#ascii-style-pileups-module) generate pileups, as described in detail
below.

> **NOTE:** _MIRFLOWZ_ uses the notation provided by miRBase (_i.e._
> "miRNA primary transcript" for precursors and "miRNA" for the canonical
> mature miRNA). This implies that precursors are named "pri-miRs" across the
> workflow instead of pre-miR. This decision is made upon the lack of
> guarantee that "miRNA primary transcripts" are full pre-miR (and pre-miR
> only) sequences.

### Prepare module

The _MIRFLOWZ_ workflow initially processes and indexes the genome resources
provided by the user. The regions corresponding to mature miRNAs are extended
by a fixed but user-adjustable number of nucleotides on both sides to
accommodate isomiR species with shifted start and/or end positions. If
necessary, pri-miR loci are extended to adjust to the new miRNA coordinates.
In addition, to account for the different genomic locations a miRNA sequence
can be annotated, the name of these sequences are modified to have the format
`SPECIES-mir-NUMBER[LETTER]-#` for pri-miRs, and
`SPECIES-miR-NUMBER[LETTER]-#-ARM` or `SPECIES-miR-NUMBER[LETTER]-#` for mature
miRNAs with both or just one arm respectively, where `#` is the paralog number
(replica/locus index), included when multiple loci express the same or similar
miRNAs, and `LETTER` denotes a sequence variant of the mature miRNA
(paralogous variant with similar but not identical sequences).

### Map module

The user-provided short-read small RNA-seq libraries undergo quality filtering
(skipped if libraries are provided in FASTA rather than FASTQ), followed by
adapter removal. The resulting reads are independently mapped to both the
genome and the transcriptome using two distinct aligners: Segemehl and our
in-house tool Oligomap.

Segemehl implements a fast heuristic strategy that returns the alignment(s)
with the smallest edit distance. Oligomap, on the other hand, implements a
slower and more restricted approach that reports all the alignments with an
edit distance of at most 1. The combination of the fast and flexible results
and the strict selection ensures results with a higher fidelity than if only
one of the tools was to be used.

Two merging steps are done in order to have all the alignments in a single
file. In the first one, the transcriptome and the genome mappings from both
aligners are fused and only those alignments with a smaller NH than the one
provided are kept. For the second step, transcriptomic coordinates are turned
into genomic ones and alignments are combined into a single file. Duplicate
alignments resulting from the partially redundant mapping strategy are
discarded and only the best alignments for each read are retained (_i.e._ the
ones with the smallest edit distance). In addition, and due to the alignment's
aggregation, a second filtering according to the new NH is performed.
If a read has been aligned beyond a specified threshold, it is removed due to
(1) performance reasons as the file size can rapidly increase, and (2) the fact
that each read contributes to each count `1/N` where `N` is the number of
genomic loci it aligns to and a large `N` makes the contribution negligible.

A final filter is made to further increase the classification accuracy and
reduce the amount of multimappers (defined here as alignments of the same read
aligning to different genomic loci with the same edit distance). Given that
isomiRs are known to contain more InDels than mismatches when compared to the
canonical sequence they come from, as demonstrated by
[Saunders et al. (2017)][cite_saunders], [Neilsen et al. (2012)][cite_neilsen]
and [Schumauch et al. (2024)][cite_schumauch] only those multimappers that
contain a higher or equal number of InDels compared to mismatches are retained.
Note that some multimappers might still be present if the number of InDels is
the same across alignments.

### Quantify module

The filtered alignments are subsequently intersected with the user-provided,
pre-processed miRNA annotation files using BEDTools. Each alignment is
classified according to the miRNA species it fully intersects with in order
to do the counts.

Counts are tabulated separately for reads consistent with either miRNA
precursors, mature miRNA and/or isomiRs, and all library counts are fused
into a single table. Note that an alignment is only counted towards a given
miRNA (or isomiR) species if one of its alignments fully falls within the
(previously extended) locus annotated for that miRNA. Specifically, reads
contribute with `1/N` for each miRNA for which that is the case, where `N` is
the total number of genomic loci the read aligns to. Under this criterion, the
precursor counts contain reads that intersect with its mature arm(s), its
hairpin sequence and/or the whole precursor itself.

#### isomiRs notation

A sequence is considered to be an isomiR if it has a shift on either end, an
InDel or a mismatch on its sequence when compared to the canonical miRNA it
maps and intersects with.

_MIRFLOWZ_ employs an unambiguous notation to classify isomiRs using the
format `miRNA_name|5p-shift|3p-shift|CIGAR|MD|READ_SEQ`, where `5p-shift` and
`3p-shift` represent the differences between the annotated mature miRNA
start and end positions and those of the read alignment, respectively.

### ASCII-style pileups module

Finally, to visualize the distribution of read alignments around miRNA
loci, ASCII-style alignment pileups are optionally generated for user-defined
regions of interest.

The schema below is a visual representation of the individual workflow steps
and how they are related:

> ![rule-graph][rule-graph]

> **NOTE:** For an elaborated description of each rule along with some
> examples, please, refer to the
> [workflow documentation](pipeline_documentation.md).

## Contributing

_MIRFLOWZ_ is an open-source project which relies on community contributions.
You are welcome to participate by submitting bug reports or feature requests,
taking part in discussions, or proposing fixes and other code changes. Please
refer to the [contributing guidelines](CONTRIBUTING.md) if you are interested
in contributing.

## License

This project is covered by the [MIT License](LICENSE).

## Contact

For questions or suggestions regarding the code, please use the
[issue tracker][issue-tracker]. Do not hesitate to contact us via
[email][email] for any other inquiries.

© 2023 [Zavolab, Biozentrum, University of Basel][zavolab]

[apptainer]:
[ascii-pileups]:
[bed-format]:
[chrMap]:
[cite_neilsen]:
[cite_saunders]:
[cite_schumauch]:
[cluster execution]:
[conda]:
[email]:
[ensembl]:
[issue-tracker]:
[mamba]:
[miniconda-installation]:
[mirbase]:
[rule-graph]: images/rule_graph.svg
[slurm]:
[snakemake]:
[snakemakeDocu]:
[zavolab]: