An open API service indexing awesome lists of open source software.

https://github.com/yangao07/tidehunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/tidehunter

long-reads multiple-sequence-alignment partial-order-alignment seed-and-chain tandem-repeats

Last synced: about 2 months ago
JSON representation

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain

Awesome Lists containing this project

README

          

# TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
[![Latest Release](https://img.shields.io/github/release/yangao07/TideHunter.svg?label=Release)](https://github.com/yangao07/TideHunter/releases/latest)
[![Github All Releases](https://img.shields.io/github/downloads/yangao07/TideHunter/total.svg?label=Download)](https://github.com/yangao07/TideHunter/releases)
[![BioConda Install](https://img.shields.io/conda/dn/bioconda/tidehunter.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/tidehunter)
[![Published in Bioinformatics](https://img.shields.io/badge/Published%20in-Bioinformatics-blue.svg)](https://doi.org/10.1093/bioinformatics/btz376)
[![GitHub Issues](https://img.shields.io/github/issues/yangao07/TideHunter.svg?label=Issues)](https://github.com/yangao07/TideHunter/issues)
[![Build Status](https://img.shields.io/travis/yangao07/TideHunter/master.svg?label=Master)](https://travis-ci.org/yangao07/TideHunter)
[![License](https://img.shields.io/badge/License-MIT-black.svg)](https://github.com/yangao07/TideHunter/blob/master/LICENSE)

## Updates (v1.5.5)
* Output additional single-copy full-length sequence when 5/3 adapters are provided
* Copy number needs to be >= 2 for regular tandem repeats

## Getting started
Download the [latest release](https://github.com/yangao07/TideHunter/releases):
```
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz && cd TideHunter-v1.5.5
```
Make from source and run with test data:
```
make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa
```
Or, install via conda and run with test data:
```
conda install -c bioconda tidehunter
TideHunter ./test_data/test_50x4.fa > cons.fa
```
## Table of Contents

- [TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain](#tidehunter-efficient-and-sensitive-tandem-repeat-detection-from-noisy-long-reads-using-seed-and-chain)
- [Updates (v1.5.5)](#updates-v155)
- [Getting started](#getting-started)
- [Table of Contents](#table-of-contents)
- [Introduction](#introduction)
- [Installation](#installation)
- [Installing TideHunter via conda](#installing-tidehunter-via-conda)
- [Building TideHunter from source files](#building-tidehunter-from-source-files)
- [Pre-built binary executable file for Linux/Unix](#pre-built-binary-executable-file-for-linuxunix)
- [Getting started with toy example in `test_data`](#getting-started-with-toy-example-in-test_data)
- [Usage](#usage)
- [To generate consensus sequences in FASTA format](#to-generate-consensus-sequences-in-fasta-format)
- [To generate consensus sequences in tabular format](#to-generate-consensus-sequences-in-tabular-format)
- [To generate consensus sequences in FASTQ format](#to-generate-consensus-sequences-in-fastq-format)
- [To generate full-length consensus sequences](#to-generate-full-length-consensus-sequences)
- [To generate unit sequences in FASTA format](#to-generate-unit-sequences-in-fasta-format)
- [To generate unit sequences in tabular format](#to-generate-unit-sequences-in-tabular-format)
- [Commands and options](#commands-and-options)
- [Input](#input)
- [Adapter sequence](#adapter-sequence)
- [Output](#output)
- [Tabular format](#tabular-format)
- [FASTA format](#fasta-format)
- [FASTQ format](#fastq-format)
- [Unit sequences](#unit-sequences)
- [Contact](#contact)

## Introduction
TideHunter is an efficient and sensitive tandem repeat detection and
consensus calling tool which is designed for tandemly repeated
long-read sequence ([INC-seq](https://doi.org/10.1186/s13742-016-0140-7),
[R2C2](https://doi.org/10.1073/pnas.1806447115), [NanoAmpli-Seq](https://doi.org/10.1093/gigascience/giy140)).

It works with Pacific Biosciences (PacBio) and
Oxford Nanopore Technologies (ONT) sequencing data at error rates
up to 20% and does not have any limitation of the maximal repeat pattern size.

## Installation

### Installing TideHunter via conda
On Linux/Unix and Mac OS, TideHunter can be installed via
```
conda install -c bioconda tidehunter
```

### Building TideHunter from source files
You can also build TideHunter from source files.
Make sure you have gcc (>=6.4.0) and zlib installed before compiling.
It is recommended to download the latest release of TideHunter
from the [release page](https://github.com/yangao07/TideHunter/releases).
```
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz
cd TideHunter-v1.5.5; make
```
Or, you can use `git clone` command to download the source code.
Don't forget to include the `--recursive` to download the codes of [abPOA](https://github.com/yangao07/abPOA).
This gives you the latest version of TideHunter, which might be still under development.
```
git clone --recursive https://github.com/yangao07/TideHunter.git
cd TideHunter; make
```

### Pre-built binary executable file for Linux/Unix
If you meet any compiling issue, please try the pre-built binary file:
```
wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5_x64-linux.tar.gz
tar -zxvf TideHunter-v1.5.5_x64-linux.tar.gz
```

## Getting started with toy example in `test_data`
```
TideHunter ./test_data/test_1000x10.fa > cons.fa
```

## Usage
#### To generate consensus sequences in FASTA format
```
TideHunter ./test_data/test_1000x10.fa > cons.fa
```
#### To generate consensus sequences in tabular format
```
TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out
```
#### To generate consensus sequences in FASTQ format
```
TideHunter -f 3 ./test_data/test_1000x10.fa > cons.fq
```
#### To generate full-length consensus sequences
```
TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa
```
#### To generate unit sequences in FASTA format
```
TideHunter -u ./test_data/test_1000x10.fa > unit.fa
```
#### To generate unit sequences in tabular format
```
TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out
```
## Commands and options
```
Usage: TideHunter [options] in.fa/fq > cons.fa

Options:
Seeding:
-k --kmer-length INT k-mer length (no larger than 16) [8]
-w --window-size INT window size, set as >1 to enable minimizer seeding [1]
-H --HPC-kmer use homopolymer-compressed k-mer [False]
Tandem repeat criteria:
-c --min-copy INT minimum copy number of tandem repeat (>=2) [2]
-e --max-diverg INT maximum allowed divergence rate between two consecutive repeats [0.25]
-p --min-period INT minimum period size of tandem repeat (>=2) [30]
-P --max-period INT maximum period size of tandem repeat (<=4294967295) [10K]
Scoring parameters for partial order alignment:
-M --match INT match score [2]
-X --mismatch INT mismatch penalty [4]
-O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]
-E --gap-ext INT(,INT) gap extension penalty (E1,E2) [2,1]
TideHunter provides three gap penalty modes, cost of a g-long gap:
- convex (default): min{O1+g*E1, O2+g*E2}
- affine (set O2 as 0): O1+g*E1
- linear (set O1 as 0): g*E1
Adapter sequence:
-5 --five-prime STR 5' adapter sequence (sense strand) [NULL]
-3 --three-prime STR 3' adapter sequence (anti-sense strand) [NULL]
-a --ada-mat-rat FLT minimum match ratio of adapter sequence [0.80]
Output:
-o --output STR output file [stdout]
-m --min-len INT only output consensus sequence with min. length of [30]
-r --min-cov FLOAT|INT only output consensus sequence with at least R supporting units for all bases: [0.00]
if r is fraction: R = r * total copy number
if r is integer: R = r
-u --unit-seq only output unit sequences of each tandem repeat, no consensus sequence [False]
-l --longest only output consensus sequence of tandem repeat that covers the longest read sequence [False]
-F --full-len only output full-length consensus sequence. [False]
full-length: consensus sequence contains both 5' and 3' adapter sequence
*Note* only effective when -5 and -3 are provided.
-s --single-copy output additional single-copy full-length consensus sequence. [False]
*Note* only effective when -F is set and -5 and -3 are provided.
-f --out-fmt INT output format [1]
- 1: FASTA
- 2: Tabular
- 3: FASTQ
- 4: Tabular with quality score
for [3] and [4], qualiy score of each base represents the ratio of the consensus coverage to the # total copies.
Computing resource:
-t --thread INT number of threads to use [4]

General options:
-h --help print this help usage information
-v --version show version number
```

## Input
TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.

### Adapter sequence
Additional adapter sequence files can be provided to TideHunter with `-5` and `-3` options.

TideHunter uses adapter information to search for the full-length sequence from the generated consensus.

Once two adapters are found, TideHunter trims and reorients the consensus sequence.

## Output
TideHunter can output consensus sequence in FASTA format by default,
it can also provide output in tabular format.

### Tabular format
For tabular format, 9 columns will be generated for each consensus sequence:

| No. | Column name | Explanation |
|:---:| :--- | --- |
| 1 | readName | the original read name |
| 2 | repN | `N` is the ID number of the tandem repeat, within each read, starts from 0 |
| 3 | copyNum | copy number of the tandem repeat |
| 4 | readLen | length of the original long read |
| 5 | start | start coordinate of the tandem repeat, 1-based |
| 6 | end | end coordinate of the tandem repeat, 1-based |
| 7 | consLen | length of the consensus sequence |
| 8 | aveMatch | average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length)|
| 9 | fullLen | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |
| 10 | subPos | start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by `,`, all coordinates are 1-based, see examples below|
| 11 | consSeq | consensus sequence |

For example, here are the output for a non-full-length consensus sequence generated from [test_data/test_50x4.fa](test_data/test_50x4.fa) and the adiagram that illustrates all the coordiantes in the output:
```
test_50x4 rep0 4.0 300 51 250 50 100.0 0 59,109,159,208 CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT
```

In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long.
A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.

Another example of the output for a full-length consensus sequence generated from [test_data/full_length.fa](test_data/full_length.fa):
```
8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19 rep0 8.8 5231 31 5215 203 95.7 1 207,798,1386,1976,2563,3155,3746,4333,4930 ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA
```
In this example, the `consLen` (i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the `subPos` (i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.

### FASTA format
For FASTA output format, the read name and the comment provide detailed information of the detected tandem repeat,
i.e., the above columns 1 \~ 10.
The sequence is the consensus sequence.

The read name and comment of each consensus sequence have the following format:
```
>readName_repN_copyNum readLen_start_end_consLen_aveMatch_fullLen_subPos
```

### FASTQ format
For FASTQ output format, the read name and comment are the same as described in [FASTA format](#fasta).
TideHunter calculated a customized Phred score as the base quality score of each consensus base:



Here, is the Sigmoid-smoothed consensus calling error rate for each base:





is the Sigmoid function:




is the coverage of the consensus base and
is the number of total copies.
For example, if one base of the consensus sequence has 4 supporting copies and the total copy number is 5,
is 4 and is 5.

The Phred quality score was then shifted by 33 and converted to characters based on the ASCII value.
The quality scores range from 0 to 60 and the corresponding ASCII values range from 33 to 93.

### Unit sequences
TideHunter can output the unit sequences without performing the consensus calling step when option `-u/--unit-seq` is enabled. Then, only the following information will be output for the tabular format:

| No. | Column name | Explanation |
|:---:| :--- | --- |
| 1 | readName | the original read name |
| 2 | repN | `N` is the ID number of the tandem repeat, within each read, starts from 0 |
| 3 | subX | `X` is the ID number of the unit sequence, starts from 0 |
| 4 | unitSeq | unit sequence |

And for the FASTA format:
```
>readName_repN_subX
unitSeq X
>readName_repN_subY
unitSeq Y
```

## Contact
Yan Gao gaoy1@chop.edu

Yi Xing XINGYI@chop.edu

[github issues](https://github.com/yangao07/TideHunter/issues)