https://github.com/mdshw5/strandex

Sample an approximate number of reads from a fastq file without reading the entire file
https://github.com/mdshw5/strandex

bioinformatics fastq regex

Last synced: 5 months ago
JSON representation

Sample an approximate number of reads from a fastq file without reading the entire file

Host: GitHub
URL: https://github.com/mdshw5/strandex
Owner: mdshw5
License: mit
Created: 2015-05-15T23:54:47.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2017-06-20T00:40:22.000Z (over 8 years ago)
Last Synced: 2024-03-25T14:46:21.625Z (over 1 year ago)
Topics: bioinformatics, fastq, regex
Language: Python
Size: 22.5 KB
Stars: 11
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # strandex

**strand**-anchored reg**ex** for uniform sampling from FASTQ files (think **spandex**)

[![Build Status](https://travis-ci.org/mdshw5/strandex.svg?branch=master)](https://travis-ci.org/mdshw5/strandex)

[![PyPI](https://img.shields.io/pypi/v/strandex.svg?branch=master)](https://pypi.python.org/pypi/strandex)

[![Landscape](https://landscape.io/github/mdshw5/strandex/master/landscape.svg)](https://landscape.io/github/mdshw5/strandex/master)

[![codecov](https://codecov.io/gh/mdshw5/strandex/branch/master/graph/badge.svg)](https://codecov.io/gh/mdshw5/strandex)

## Why use this?

- You want only a few reads from a large FASTQ file (**downsampling**)

- You are constrained by I/O so that reading through the entire file is very slow

- You want to avoid sampling only the beginning or end of the file

- You want to expand a small FASTQ file to a specific number of reads (**upsampling**)

## Caveats

- For paired-end sampling, reads in both files must be in the same order and have the same length

- For sampling `n` reads approximately equal to the total available, sampling with replacement may occur

# Install

`pip install strandex`

# Examples

```

from strandex import FastqSampler

sampler = FastqSampler('read1.fastq', fastq2='read2.fastq', nreads=100000, seed=42)

for read1, read2 in sampler:

  # read1 and read2 are 4-line strings sampled from paired input

sampler = FastqSampler('read1.fastq', nreads=100000, seed=42)

  for read1, read2 in sampler:

    # read1 is a 4-line string sampled from input

    # read2 is NoneType

```

Note that you may sample more reads *than are available in your input file*. In

the event that you want to sample more reads than your input file contains, strandex

will sample the file with replacement, meaning you will get some duplicate reads.

# CLI script

```

usage: strandex [-h] [-fq2 FASTQ2] [-o2 OUT2] [-n NREADS] [-s SEED] fastq1 out

sample uniformly without reading an entire fastq file

positional arguments:

  fastq1                input fastq file

  out                   output fastq file

optional arguments:

  -h, --help            show this help message and exit

  -fq2 FASTQ2, --fastq2 FASTQ2

                        input fastq file read pairs

  -o2 OUT2, --out2 OUT2

                        output fastq file read pairs

  -n NREADS, --nreads NREADS

                        number of reads to sample from input (default: 1)

  -s SEED, --seed SEED  seed for random number generator (default: None)

  -t TRIM, --trim TRIM  trim reads to length -t (default: None)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mdshw5/strandex

Awesome Lists containing this project

README