https://github.com/yangao07/tidehunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/tidehunter
long-reads multiple-sequence-alignment partial-order-alignment seed-and-chain tandem-repeats
Last synced: 6 months ago
JSON representation
TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
Host: GitHub
URL: https://github.com/yangao07/tidehunter
Owner: yangao07
License: mit
Created: 2017-12-15T16:28:41.000Z (over 8 years ago)
Default Branch: main
Last Pushed: 2024-06-17T13:47:25.000Z (about 2 years ago)
Last Synced: 2025-12-09T07:59:44.287Z (7 months ago)
Topics: long-reads, multiple-sequence-alignment, partial-order-alignment, seed-and-chain, tandem-repeats
Language: C
Homepage:
Size: 48.9 MB
Stars: 33
Watchers: 4
Forks: 4
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain

[![Latest Release](https://img.shields.io/github/release/yangao07/TideHunter.svg?label=Release)](https://github.com/yangao07/TideHunter/releases/latest)

[![Github All Releases](https://img.shields.io/github/downloads/yangao07/TideHunter/total.svg?label=Download)](https://github.com/yangao07/TideHunter/releases)

[![BioConda Install](https://img.shields.io/conda/dn/bioconda/tidehunter.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/tidehunter)

[![Published in Bioinformatics](https://img.shields.io/badge/Published%20in-Bioinformatics-blue.svg)](https://doi.org/10.1093/bioinformatics/btz376)

[![GitHub Issues](https://img.shields.io/github/issues/yangao07/TideHunter.svg?label=Issues)](https://github.com/yangao07/TideHunter/issues)

[![Build Status](https://img.shields.io/travis/yangao07/TideHunter/master.svg?label=Master)](https://travis-ci.org/yangao07/TideHunter)

[![License](https://img.shields.io/badge/License-MIT-black.svg)](https://github.com/yangao07/TideHunter/blob/master/LICENSE)

## Updates (v1.5.5)

* Output additional single-copy full-length sequence when 5/3 adapters are provided

* Copy number needs to be >= 2 for regular tandem repeats

## Getting started

Download the [latest release](https://github.com/yangao07/TideHunter/releases):

```

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz

tar -zxvf TideHunter-v1.5.5.tar.gz && cd TideHunter-v1.5.5

```

Make from source and run with test data:

```

make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa

```

Or, install via conda and run with test data:

```

conda install -c bioconda tidehunter

TideHunter ./test_data/test_50x4.fa > cons.fa

```

## Table of Contents

- [TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain](#tidehunter-efficient-and-sensitive-tandem-repeat-detection-from-noisy-long-reads-using-seed-and-chain)

  - [Updates (v1.5.5)](#updates-v155)

  - [Getting started](#getting-started)

  - [Table of Contents](#table-of-contents)

  - [Introduction](#introduction)

  - [Installation](#installation)

    - [Installing TideHunter via conda](#installing-tidehunter-via-conda)

    - [Building TideHunter from source files](#building-tidehunter-from-source-files)

    - [Pre-built binary executable file for Linux/Unix](#pre-built-binary-executable-file-for-linuxunix)

  - [Getting started with toy example in `test_data`](#getting-started-with-toy-example-in-test_data)

  - [Usage](#usage)

      - [To generate consensus sequences in FASTA format](#to-generate-consensus-sequences-in-fasta-format)

      - [To generate consensus sequences in tabular format](#to-generate-consensus-sequences-in-tabular-format)

      - [To generate consensus sequences in FASTQ format](#to-generate-consensus-sequences-in-fastq-format)

      - [To generate full-length consensus sequences](#to-generate-full-length-consensus-sequences)

      - [To generate unit sequences in FASTA format](#to-generate-unit-sequences-in-fasta-format)

      - [To generate unit sequences in tabular format](#to-generate-unit-sequences-in-tabular-format)

  - [Commands and options](#commands-and-options)

  - [Input](#input)

    - [Adapter sequence](#adapter-sequence)

  - [Output](#output)

    - [Tabular format](#tabular-format)

    - [FASTA format](#fasta-format)

    - [FASTQ format](#fastq-format)

    - [Unit sequences](#unit-sequences)

  - [Contact](#contact)

## Introduction

TideHunter is an efficient and sensitive tandem repeat detection and

consensus calling tool which is designed for tandemly repeated

long-read sequence ([INC-seq](https://doi.org/10.1186/s13742-016-0140-7),

 [R2C2](https://doi.org/10.1073/pnas.1806447115), [NanoAmpli-Seq](https://doi.org/10.1093/gigascience/giy140)). 

It works with Pacific Biosciences (PacBio) and 

Oxford Nanopore Technologies (ONT) sequencing data at error rates 

up to 20% and does not have any limitation of the maximal repeat pattern size.

## Installation

### Installing TideHunter via conda

On Linux/Unix and Mac OS, TideHunter can be installed via

```

conda install -c bioconda tidehunter

```

### Building TideHunter from source files

You can also build TideHunter from source files.

Make sure you have gcc (>=6.4.0) and zlib installed before compiling.

It is recommended to download the latest release of TideHunter 

from the [release page](https://github.com/yangao07/TideHunter/releases).

```

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz

tar -zxvf TideHunter-v1.5.5.tar.gz

cd TideHunter-v1.5.5; make

```

Or, you can use `git clone` command to download the source code. 

Don't forget to include the `--recursive` to download the codes of [abPOA](https://github.com/yangao07/abPOA).

This gives you the latest version of TideHunter, which might be still under development.

```

git clone --recursive https://github.com/yangao07/TideHunter.git

cd TideHunter; make

```

### Pre-built binary executable file for Linux/Unix 

If you meet any compiling issue, please try the pre-built binary file:

```

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5_x64-linux.tar.gz

tar -zxvf TideHunter-v1.5.5_x64-linux.tar.gz

```

## Getting started with toy example in `test_data`

```

TideHunter ./test_data/test_1000x10.fa > cons.fa

```

## Usage

#### To generate consensus sequences in FASTA format

```

TideHunter ./test_data/test_1000x10.fa > cons.fa

```

#### To generate consensus sequences in tabular format

```

TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out

```

#### To generate consensus sequences in FASTQ format

```

TideHunter -f 3 ./test_data/test_1000x10.fa > cons.fq

```

#### To generate full-length consensus sequences

```

TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa

```

#### To generate unit sequences in FASTA format

```

TideHunter -u ./test_data/test_1000x10.fa > unit.fa

```

#### To generate unit sequences in tabular format

```

TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out

```

## Commands and options

```

Usage:   TideHunter [options] in.fa/fq > cons.fa

Options:

  Seeding:

    -k --kmer-length INT    k-mer length (no larger than 16) [8]

    -w --window-size INT    window size, set as >1 to enable minimizer seeding [1]

    -H --HPC-kmer           use homopolymer-compressed k-mer [False]

  Tandem repeat criteria:

    -c --min-copy    INT    minimum copy number of tandem repeat (>=2) [2]

    -e --max-diverg  INT    maximum allowed divergence rate between two consecutive repeats [0.25]

    -p --min-period  INT    minimum period size of tandem repeat (>=2) [30]

    -P --max-period  INT    maximum period size of tandem repeat (<=4294967295) [10K]

  Scoring parameters for partial order alignment:

    -M --match    INT       match score [2]

    -X --mismatch INT       mismatch penalty [4]

    -O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]

    -E --gap-ext  INT(,INT) gap extension penalty (E1,E2) [2,1]

                            TideHunter provides three gap penalty modes, cost of a g-long gap:

                            - convex (default): min{O1+g*E1, O2+g*E2}

                            - affine (set O2 as 0): O1+g*E1

                            - linear (set O1 as 0): g*E1

  Adapter sequence:

    -5 --five-prime  STR    5' adapter sequence (sense strand) [NULL]

    -3 --three-prime STR    3' adapter sequence (anti-sense strand) [NULL]

    -a --ada-mat-rat FLT    minimum match ratio of adapter sequence [0.80]

  Output:

    -o --output      STR    output file [stdout]

    -m --min-len     INT    only output consensus sequence with min. length of [30]

    -r --min-cov  FLOAT|INT only output consensus sequence with at least R supporting units for all bases: [0.00]

                            if r is fraction: R = r * total copy number

                            if r is integer: R = r

    -u --unit-seq           only output unit sequences of each tandem repeat, no consensus sequence [False]

    -l --longest            only output consensus sequence of tandem repeat that covers the longest read sequence [False]

    -F --full-len           only output full-length consensus sequence. [False]

                            full-length: consensus sequence contains both 5' and 3' adapter sequence

                            *Note* only effective when -5 and -3 are provided.

    -s --single-copy        output additional single-copy full-length consensus sequence. [False]

                            *Note* only effective when -F is set and -5 and -3 are provided.

    -f --out-fmt     INT    output format [1]

                            - 1: FASTA

                            - 2: Tabular

                            - 3: FASTQ

                            - 4: Tabular with quality score

                              for [3] and [4], qualiy score of each base represents the ratio of the consensus coverage to the # total copies.

  Computing resource:

    -t --thread      INT    number of threads to use [4]

  General options:

    -h --help               print this help usage information

    -v --version            show version number

```

## Input

TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.

### Adapter sequence

Additional adapter sequence files can be provided to TideHunter with `-5` and `-3` options.

TideHunter uses adapter information to search for the full-length sequence from the generated consensus.

Once two adapters are found, TideHunter trims and reorients the consensus sequence.

## Output

TideHunter can output consensus sequence in FASTA format by default, 

it can also provide output in tabular format.

### Tabular format

For tabular format, 9 columns will be generated for each consensus sequence:

| No. | Column name | Explanation | 

|:---:|   :---      | ---        |

|  1  | readName    | the original read name |

|  2  | repN        | `N` is the ID number of the tandem repeat, within each read, starts from 0 |

|  3  | copyNum     | copy number of the tandem repeat |

|  4  | readLen     | length of the original long read |

|  5  | start       | start coordinate of the tandem repeat, 1-based |

|  6  | end         | end coordinate of the tandem repeat, 1-based |

|  7  | consLen     | length of the consensus sequence |

|  8  | aveMatch    | average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length)|

|  9  | fullLen     | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |

|  10  | subPos     | start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by `,`, all coordinates are 1-based, see examples below|

| 11  | consSeq     | consensus sequence |

For example, here are the output for a non-full-length consensus sequence generated from [test_data/test_50x4.fa](test_data/test_50x4.fa) and the adiagram that illustrates all the coordiantes in the output:

```

test_50x4 rep0  4.0 300 51  250 50  100.0 0 59,109,159,208  CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT

```



In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long.

A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.

Another example of the output for a full-length consensus sequence generated from [test_data/full_length.fa](test_data/full_length.fa):

```

8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19  rep0  8.8  5231  31  5215  203 95.7  1 207,798,1386,1976,2563,3155,3746,4333,4930  ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA

```

In this example, the `consLen` (i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the `subPos` (i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.

### FASTA format

For FASTA output format, the read name and the comment provide detailed information of the detected tandem repeat, 

i.e., the above columns 1 \~ 10.

The sequence is the consensus sequence.

The read name and comment of each consensus sequence have the following format:

```

>readName_repN_copyNum readLen_start_end_consLen_aveMatch_fullLen_subPos

```

### FASTQ format

For FASTQ output format, the read name and comment are the same as described in [FASTA format](#fasta).

TideHunter calculated a customized Phred score as the base quality score of each consensus base:



  

  



Here,  is the Sigmoid-smoothed consensus calling error rate for each base:



  

  

  



 is the Sigmoid function:



  

  



 is the coverage of the consensus base and

 is the number of total copies. 

For example, if one base of the consensus sequence has 4 supporting copies and the total copy number is 5,

 is 4 and  is 5.

The Phred quality score was then shifted by 33 and converted to characters based on the ASCII value.

The quality scores range from 0 to 60 and the corresponding ASCII values range from 33 to 93.

### Unit sequences

TideHunter can output the unit sequences without performing the consensus calling step when option `-u/--unit-seq` is enabled. Then, only the following information will be output for the tabular format:

| No. | Column name | Explanation | 

|:---:|   :---      | ---        |

|  1  | readName    | the original read name |

|  2  | repN        | `N` is the ID number of the tandem repeat, within each read, starts from 0 |

|  3  | subX        | `X` is the ID number of the unit sequence, starts from 0 |

|  4  | unitSeq     | unit sequence |

And for the FASTA format:

```

>readName_repN_subX

unitSeq X

>readName_repN_subY

unitSeq Y

```

## Contact

Yan Gao gaoy1@chop.edu

Yi Xing XINGYI@chop.edu

[github issues](https://github.com/yangao07/TideHunter/issues)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yangao07/tidehunter

Awesome Lists containing this project

README