https://github.com/sequana/downsampling

down sample NGS data
https://github.com/sequana/downsampling

Last synced: 12 months ago
JSON representation

down sample NGS data

Host: GitHub
URL: https://github.com/sequana/downsampling
Owner: sequana
License: bsd-3-clause
Created: 2020-03-09T21:00:18.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2023-12-20T12:43:58.000Z (over 2 years ago)
Last Synced: 2025-06-30T20:03:56.565Z (12 months ago)
Language: Python
Size: 144 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

          

.. image:: https://badge.fury.io/py/sequana-downsampling.svg

     :target: https://pypi.python.org/pypi/sequana_downsampling

.. image:: http://joss.theoj.org/papers/10.21105/joss.00352/status.svg

    :target: http://joss.theoj.org/papers/10.21105/joss.00352

    :alt: JOSS (journal of open source software) DOI

.. image:: https://github.com/sequana/downsampling/actions/workflows/main.yml/badge.svg

   :target: https://github.com/sequana/downsampling/actions/workflows/main.yaml 

This is is the **downsampling** pipeline from the `Sequana `_ project

:Overview: downsample NGS data sets

:Input: a set of FastQ or FASTA files 

:Output: a set of downsampled files

:Status: production

:Citation(sequana): Cokelaer et al, (2017), ‘Sequana’: a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI doi:10.21105/joss.00352

:Citation(pipeline): 

    .. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.4047837.svg

       :target: https://doi.org/10.5281/zenodo.4047837

Installation

~~~~~~~~~~~~

You must install Sequana first::

    pip install sequana

Then, just install this package::

    pip install sequana_downsampling

Usage

~~~~~

::

    sequana_downsampling --help

    sequana_downsampling --input-directory DATAPATHH

    sequana_downsampling --downsampling-method random --downsampling-max-entries 100

    sequana_downsampling --downsampling-method random_pct --downsampling-percent 10 --downsampling-input-format fasta --input-pattern "whatever*fasta"

Note that the current implementation handles fastq files (zipped or not) and

fasta files (uncompressed only)

This creates a directory with the pipeline and configuration file. You will then need 

to execute the pipeline::

    cd downsampling

    sh downsampling.sh  # for a local run

This launch a snakemake pipeline. If you are familiar with snakemake, you can 

retrieve the pipeline itself and its configuration files and then execute the pipeline yourself with specific parameters::

    snakemake -s downsampling.rules -c config.yaml --cores 4 --stats stats.txt

Or use `sequanix `_ interface.

Examples of a set of FastQ zipped files in the current directory:

    sequana_downsampling --run --downsampling-method random_pct 

    cd downsampling

    make clean

This will create a directory called **downsampling**, and randomly select 10% of

the input reads for each file with extension .fastq.gz in the current directory.

Since **-run** is used, the pipeline is executed automatically. The following

commands will enter into the directory and called a Makefile. This will clean

the directory for temporary files.

Requirements

~~~~~~~~~~~~

This pipelines requires the following executable(s):

- sequana

- pigz

.. .. image:: https://raw.githubusercontent.com/sequana/downsampling/master/sequana_pipelines/downsampling/dag.png

Details

~~~~~~~~~

This pipeline runs **downsampling** in parallel on the input fastq or fasta files (paired or not). If paired, the one-to-one mapping is conserved.

It can take as input a set of FastQ files, or FastA files. by

default, the pipeline with randomly select 1000 entries from each input files.

You can increase this number using --downsampling-max-entries option. If you

prefer to select a percentage of the entries instead, you can change the

downsamping method as follows::

    --downsampling-method random_pct

and change the value if needed (default is 10%)::

    --downsampling-percent 20

Note that input FastQ can be gzipped. Output files are gzipped. FastA input

files must be compressed for now

Rules and configuration details

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here is the `latest documented configuration file `_

to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. 

Changelog

~~~~~~~~~

========= ====================================================================

Version   Description

========= ====================================================================

0.8.5     * cope with R1/R2 paired data properly. Improved make file

0.8.4     * add missing MANIFEST to include missing requirements.txt

0.8.3     * comply with new API from sequana_pipetools 0.2.4

0.8.2     * add a --run option to execute the pipeline directly

0.8.1     * fix input and N in the random selection

0.8.0     **First release.**

========= ====================================================================

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sequana/downsampling

Awesome Lists containing this project

README