{"id":34512382,"url":"https://github.com/yangao07/tidehunter","last_synced_at":"2025-12-24T04:10:33.842Z","repository":{"id":39800953,"uuid":"114390461","full_name":"yangao07/TideHunter","owner":"yangao07","description":"TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain","archived":false,"fork":false,"pushed_at":"2024-06-17T13:47:25.000Z","size":51238,"stargazers_count":33,"open_issues_count":6,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-12-09T07:59:44.287Z","etag":null,"topics":["long-reads","multiple-sequence-alignment","partial-order-alignment","seed-and-chain","tandem-repeats"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yangao07.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-15T16:28:41.000Z","updated_at":"2025-09-18T23:31:28.000Z","dependencies_parsed_at":"2023-01-19T18:03:15.935Z","dependency_job_id":"3a1bfc38-71e9-47a9-808e-46f2a0bb65cf","html_url":"https://github.com/yangao07/TideHunter","commit_stats":{"total_commits":115,"total_committers":5,"mean_commits":23.0,"dds":"0.27826086956521734","last_synced_commit":"ea0e886ab844dd60babba953d2e941f5cda0c6d5"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/yangao07/TideHunter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yangao07%2FTideHunter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yangao07%2FTideHunter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yangao07%2FTideHunter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yangao07%2FTideHunter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yangao07","download_url":"https://codeload.github.com/yangao07/TideHunter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yangao07%2FTideHunter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27994412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["long-reads","multiple-sequence-alignment","partial-order-alignment","seed-and-chain","tandem-repeats"],"created_at":"2025-12-24T04:10:31.158Z","updated_at":"2025-12-24T04:10:33.826Z","avatar_url":"https://github.com/yangao07.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain\n[![Latest Release](https://img.shields.io/github/release/yangao07/TideHunter.svg?label=Release)](https://github.com/yangao07/TideHunter/releases/latest)\n[![Github All Releases](https://img.shields.io/github/downloads/yangao07/TideHunter/total.svg?label=Download)](https://github.com/yangao07/TideHunter/releases)\n[![BioConda Install](https://img.shields.io/conda/dn/bioconda/tidehunter.svg?style=flag\u0026label=BioConda%20install)](https://anaconda.org/bioconda/tidehunter)\n[![Published in Bioinformatics](https://img.shields.io/badge/Published%20in-Bioinformatics-blue.svg)](https://doi.org/10.1093/bioinformatics/btz376)\n[![GitHub Issues](https://img.shields.io/github/issues/yangao07/TideHunter.svg?label=Issues)](https://github.com/yangao07/TideHunter/issues)\n[![Build Status](https://img.shields.io/travis/yangao07/TideHunter/master.svg?label=Master)](https://travis-ci.org/yangao07/TideHunter)\n[![License](https://img.shields.io/badge/License-MIT-black.svg)](https://github.com/yangao07/TideHunter/blob/master/LICENSE)\n\u003c!--\n[![GitHub Downloads](https://img.shields.io/github/downloads/yangao07/TideHunter/total.svg?style=social\u0026logo=github\u0026label=Download)](https://github.com/yangao07/TideHunter/releases)\n--\u003e\n\n## Updates (v1.5.5)\n* Output additional single-copy full-length sequence when 5/3 adapters are provided\n* Copy number needs to be \u003e= 2 for regular tandem repeats\n\n\n## Getting started\nDownload the [latest release](https://github.com/yangao07/TideHunter/releases):\n```\nwget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz\ntar -zxvf TideHunter-v1.5.5.tar.gz \u0026\u0026 cd TideHunter-v1.5.5\n```\nMake from source and run with test data:\n```\nmake; ./bin/TideHunter ./test_data/test_50x4.fa \u003e cons.fa\n```\nOr, install via conda and run with test data:\n```\nconda install -c bioconda tidehunter\nTideHunter ./test_data/test_50x4.fa \u003e cons.fa\n```\n## Table of Contents\n\n- [TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain](#tidehunter-efficient-and-sensitive-tandem-repeat-detection-from-noisy-long-reads-using-seed-and-chain)\n  - [Updates (v1.5.5)](#updates-v155)\n  - [Getting started](#getting-started)\n  - [Table of Contents](#table-of-contents)\n  - [Introduction](#introduction)\n  - [Installation](#installation)\n    - [Installing TideHunter via conda](#installing-tidehunter-via-conda)\n    - [Building TideHunter from source files](#building-tidehunter-from-source-files)\n    - [Pre-built binary executable file for Linux/Unix](#pre-built-binary-executable-file-for-linuxunix)\n  - [Getting started with toy example in `test_data`](#getting-started-with-toy-example-in-test_data)\n  - [Usage](#usage)\n      - [To generate consensus sequences in FASTA format](#to-generate-consensus-sequences-in-fasta-format)\n      - [To generate consensus sequences in tabular format](#to-generate-consensus-sequences-in-tabular-format)\n      - [To generate consensus sequences in FASTQ format](#to-generate-consensus-sequences-in-fastq-format)\n      - [To generate full-length consensus sequences](#to-generate-full-length-consensus-sequences)\n      - [To generate unit sequences in FASTA format](#to-generate-unit-sequences-in-fasta-format)\n      - [To generate unit sequences in tabular format](#to-generate-unit-sequences-in-tabular-format)\n  - [Commands and options](#commands-and-options)\n  - [Input](#input)\n    - [Adapter sequence](#adapter-sequence)\n  - [Output](#output)\n    - [Tabular format](#tabular-format)\n    - [FASTA format](#fasta-format)\n    - [FASTQ format](#fastq-format)\n    - [Unit sequences](#unit-sequences)\n  - [Contact](#contact)\n\n## \u003ca name=\"introduction\"\u003e\u003c/a\u003eIntroduction\nTideHunter is an efficient and sensitive tandem repeat detection and\nconsensus calling tool which is designed for tandemly repeated\nlong-read sequence ([INC-seq](https://doi.org/10.1186/s13742-016-0140-7),\n [R2C2](https://doi.org/10.1073/pnas.1806447115), [NanoAmpli-Seq](https://doi.org/10.1093/gigascience/giy140)). \n\nIt works with Pacific Biosciences (PacBio) and \nOxford Nanopore Technologies (ONT) sequencing data at error rates \nup to 20% and does not have any limitation of the maximal repeat pattern size.\n\n## \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\n### \u003ca name=\"conda\"\u003e\u003c/a\u003eInstalling TideHunter via conda\nOn Linux/Unix and Mac OS, TideHunter can be installed via\n```\nconda install -c bioconda tidehunter\n```\n\n### \u003ca name=\"build\"\u003e\u003c/a\u003eBuilding TideHunter from source files\nYou can also build TideHunter from source files.\nMake sure you have gcc (\u003e=6.4.0) and zlib installed before compiling.\nIt is recommended to download the latest release of TideHunter \nfrom the [release page](https://github.com/yangao07/TideHunter/releases).\n```\nwget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz\ntar -zxvf TideHunter-v1.5.5.tar.gz\ncd TideHunter-v1.5.5; make\n```\nOr, you can use `git clone` command to download the source code. \nDon't forget to include the `--recursive` to download the codes of [abPOA](https://github.com/yangao07/abPOA).\nThis gives you the latest version of TideHunter, which might be still under development.\n```\ngit clone --recursive https://github.com/yangao07/TideHunter.git\ncd TideHunter; make\n```\n\n### \u003ca name=\"binary\"\u003e\u003c/a\u003ePre-built binary executable file for Linux/Unix \nIf you meet any compiling issue, please try the pre-built binary file:\n```\nwget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5_x64-linux.tar.gz\ntar -zxvf TideHunter-v1.5.5_x64-linux.tar.gz\n```\n\n## \u003ca name=\"start\"\u003e\u003c/a\u003eGetting started with toy example in `test_data`\n```\nTideHunter ./test_data/test_1000x10.fa \u003e cons.fa\n```\n\n## \u003ca name=\"usage\"\u003e\u003c/a\u003eUsage\n#### \u003ca name=\"fasta_cons\"\u003e\u003c/a\u003eTo generate consensus sequences in FASTA format\n```\nTideHunter ./test_data/test_1000x10.fa \u003e cons.fa\n```\n#### \u003ca name=\"tab_cons\"\u003e\u003c/a\u003eTo generate consensus sequences in tabular format\n```\nTideHunter -f 2 ./test_data/test_1000x10.fa \u003e cons.out\n```\n#### \u003ca name=\"fq_cons\"\u003e\u003c/a\u003eTo generate consensus sequences in FASTQ format\n```\nTideHunter -f 3 ./test_data/test_1000x10.fa \u003e cons.fq\n```\n#### \u003ca name=\"full_cons\"\u003e\u003c/a\u003eTo generate full-length consensus sequences\n```\nTideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa \u003e cons_full.fa\n```\n#### \u003ca name=\"fasta_unit\"\u003e\u003c/a\u003eTo generate unit sequences in FASTA format\n```\nTideHunter -u ./test_data/test_1000x10.fa \u003e unit.fa\n```\n#### \u003ca name=\"tab_unit\"\u003e\u003c/a\u003eTo generate unit sequences in tabular format\n```\nTideHunter -u -f 2 ./test_data/test_1000x10.fa \u003e unit.out\n```\n## \u003ca name=\"cmd\"\u003e\u003c/a\u003eCommands and options\n```\nUsage:   TideHunter [options] in.fa/fq \u003e cons.fa\n\nOptions:\n  Seeding:\n    -k --kmer-length INT    k-mer length (no larger than 16) [8]\n    -w --window-size INT    window size, set as \u003e1 to enable minimizer seeding [1]\n    -H --HPC-kmer           use homopolymer-compressed k-mer [False]\n  Tandem repeat criteria:\n    -c --min-copy    INT    minimum copy number of tandem repeat (\u003e=2) [2]\n    -e --max-diverg  INT    maximum allowed divergence rate between two consecutive repeats [0.25]\n    -p --min-period  INT    minimum period size of tandem repeat (\u003e=2) [30]\n    -P --max-period  INT    maximum period size of tandem repeat (\u003c=4294967295) [10K]\n  Scoring parameters for partial order alignment:\n    -M --match    INT       match score [2]\n    -X --mismatch INT       mismatch penalty [4]\n    -O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]\n    -E --gap-ext  INT(,INT) gap extension penalty (E1,E2) [2,1]\n                            TideHunter provides three gap penalty modes, cost of a g-long gap:\n                            - convex (default): min{O1+g*E1, O2+g*E2}\n                            - affine (set O2 as 0): O1+g*E1\n                            - linear (set O1 as 0): g*E1\n  Adapter sequence:\n    -5 --five-prime  STR    5' adapter sequence (sense strand) [NULL]\n    -3 --three-prime STR    3' adapter sequence (anti-sense strand) [NULL]\n    -a --ada-mat-rat FLT    minimum match ratio of adapter sequence [0.80]\n  Output:\n    -o --output      STR    output file [stdout]\n    -m --min-len     INT    only output consensus sequence with min. length of [30]\n    -r --min-cov  FLOAT|INT only output consensus sequence with at least R supporting units for all bases: [0.00]\n                            if r is fraction: R = r * total copy number\n                            if r is integer: R = r\n    -u --unit-seq           only output unit sequences of each tandem repeat, no consensus sequence [False]\n    -l --longest            only output consensus sequence of tandem repeat that covers the longest read sequence [False]\n    -F --full-len           only output full-length consensus sequence. [False]\n                            full-length: consensus sequence contains both 5' and 3' adapter sequence\n                            *Note* only effective when -5 and -3 are provided.\n    -s --single-copy        output additional single-copy full-length consensus sequence. [False]\n                            *Note* only effective when -F is set and -5 and -3 are provided.\n    -f --out-fmt     INT    output format [1]\n                            - 1: FASTA\n                            - 2: Tabular\n                            - 3: FASTQ\n                            - 4: Tabular with quality score\n                              for [3] and [4], qualiy score of each base represents the ratio of the consensus coverage to the # total copies.\n  Computing resource:\n    -t --thread      INT    number of threads to use [4]\n\n  General options:\n    -h --help               print this help usage information\n    -v --version            show version number\n```\n\n## \u003ca name=\"input_output\"\u003e\u003c/a\u003eInput\nTideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.\n\n### \u003ca name=\"adapter\"\u003e\u003c/a\u003eAdapter sequence\nAdditional adapter sequence files can be provided to TideHunter with `-5` and `-3` options.\n\nTideHunter uses adapter information to search for the full-length sequence from the generated consensus.\n\nOnce two adapters are found, TideHunter trims and reorients the consensus sequence.\n\n## \u003ca name=\"output\"\u003e\u003c/a\u003eOutput\nTideHunter can output consensus sequence in FASTA format by default, \nit can also provide output in tabular format.\n\n### \u003ca name=\"tabular\"\u003e\u003c/a\u003eTabular format\nFor tabular format, 9 columns will be generated for each consensus sequence:\n\n| No. | Column name | Explanation | \n|:---:|   :---      | ---        |\n|  1  | readName    | the original read name |\n|  2  | repN        | `N` is the ID number of the tandem repeat, within each read, starts from 0 |\n|  3  | copyNum     | copy number of the tandem repeat |\n|  4  | readLen     | length of the original long read |\n|  5  | start       | start coordinate of the tandem repeat, 1-based |\n|  6  | end         | end coordinate of the tandem repeat, 1-based |\n|  7  | consLen     | length of the consensus sequence |\n|  8  | aveMatch    | average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length)|\n|  9  | fullLen     | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |\n|  10  | subPos     | start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by `,`, all coordinates are 1-based, see examples below|\n| 11  | consSeq     | consensus sequence |\n\nFor example, here are the output for a non-full-length consensus sequence generated from [test_data/test_50x4.fa](test_data/test_50x4.fa) and the adiagram that illustrates all the coordiantes in the output:\n```\ntest_50x4 rep0  4.0 300 51  250 50  100.0 0 59,109,159,208  CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT\n```\n\u003c!-- ![example](example_50x4.png) --\u003e\n\u003cimg src=\"img/example_50x4.png\" width=\"800\"\u003e\n\nIn this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long.\nA consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.\n\nAnother example of the output for a full-length consensus sequence generated from [test_data/full_length.fa](test_data/full_length.fa):\n```\n8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19  rep0  8.8  5231  31  5215  203 95.7  1 207,798,1386,1976,2563,3155,3746,4333,4930  ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA\n```\nIn this example, the `consLen` (i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the `subPos` (i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.\n\n### \u003ca name=\"fasta\"\u003e\u003c/a\u003eFASTA format\nFor FASTA output format, the read name and the comment provide detailed information of the detected tandem repeat, \ni.e., the above columns 1 \\~ 10.\nThe sequence is the consensus sequence.\n\nThe read name and comment of each consensus sequence have the following format:\n```\n\u003ereadName_repN_copyNum readLen_start_end_consLen_aveMatch_fullLen_subPos\n```\n\n### \u003ca name=\"fastq\"\u003e\u003c/a\u003eFASTQ format\nFor FASTQ output format, the read name and comment are the same as described in [FASTA format](#fasta).\nTideHunter calculated a customized Phred score as the base quality score of each consensus base:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"img/phred.svg\"/\u003e\n  \u003c!-- \u003cimg src=\"https://latex.codecogs.com/svg.image?Q_{phred}=-10 \\cdot log_{10}(p)\"/\u003e --\u003e\n\u003c!-- \u003cimg src=\"https://render.githubusercontent.com/render/math?math=Q_{phred}=-10 \\cdot log_{10}(p)\"\u003e --\u003e\n\u003c/p\u003e\n\nHere, \u003cimg src=\"https://latex.codecogs.com/svg.image?p\"/\u003e is the Sigmoid-smoothed consensus calling error rate for each base:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"img/p.svg\"/\u003e\n  \u003c!-- \u003cimg src=\"https://render.githubusercontent.com/render/math?math=p=1-S(N_{cons} / N_{total} \\cdot 21)\"\u003e, --\u003e\n  \u003c!-- \u003cimg src=\"https://latex.codecogs.com/svg.image?p=1-S(13.8 \\cdot (1.25 \\cdot N_{cons} / N_{total} - 0.25))\"/\u003e --\u003e\n\u003c/p\u003e\n\n\u003cimg src=\"https://latex.codecogs.com/svg.image?S\"/\u003e is the Sigmoid function:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"img/sigmoid.svg\"/\u003e\n  \u003c!-- \u003cimg src=\"https://latex.codecogs.com/svg.image?S(x)=\\frac{1}{1+e^{-x}}\"/\u003e --\u003e\n\u003c/p\u003e\n\n\u003cimg src=\"https://latex.codecogs.com/svg.image?N_{cons}\"/\u003e is the coverage of the consensus base and\n\u003cimg src=\"https://latex.codecogs.com/svg.image?N_{total}\"/\u003e is the number of total copies. \nFor example, if one base of the consensus sequence has 4 supporting copies and the total copy number is 5,\n\u003cimg src=\"https://latex.codecogs.com/svg.image?N_{cons}\"/\u003e is 4 and \u003cimg src=\"https://latex.codecogs.com/svg.image?N_{total}\"/\u003e is 5.\n\nThe Phred quality score was then shifted by 33 and converted to characters based on the ASCII value.\nThe quality scores range from 0 to 60 and the corresponding ASCII values range from 33 to 93.\n\n### \u003ca name=\"unit\"\u003e\u003c/a\u003eUnit sequences\nTideHunter can output the unit sequences without performing the consensus calling step when option `-u/--unit-seq` is enabled. Then, only the following information will be output for the tabular format:\n\n\n| No. | Column name | Explanation | \n|:---:|   :---      | ---        |\n|  1  | readName    | the original read name |\n|  2  | repN        | `N` is the ID number of the tandem repeat, within each read, starts from 0 |\n|  3  | subX        | `X` is the ID number of the unit sequence, starts from 0 |\n|  4  | unitSeq     | unit sequence |\n\n\nAnd for the FASTA format:\n```\n\u003ereadName_repN_subX\nunitSeq X\n\u003ereadName_repN_subY\nunitSeq Y\n```\n\n## \u003ca name=\"contact\"\u003e\u003c/a\u003eContact\nYan Gao gaoy1@chop.edu\n\nYi Xing XINGYI@chop.edu\n\n[github issues](https://github.com/yangao07/TideHunter/issues)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyangao07%2Ftidehunter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyangao07%2Ftidehunter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyangao07%2Ftidehunter/lists"}