{"id":38674952,"url":"https://github.com/wglab/nanorepeat","last_synced_at":"2026-01-17T10:00:53.862Z","repository":{"id":38834464,"uuid":"245545551","full_name":"WGLab/NanoRepeat","owner":"WGLab","description":"NanoRepeat: fast and accurate analysis of Short Tandem Repeats (STRs) from Oxford Nanopore sequencing data","archived":false,"fork":false,"pushed_at":"2026-01-13T06:22:30.000Z","size":857,"stargazers_count":20,"open_issues_count":6,"forks_count":3,"subscribers_count":9,"default_branch":"master","last_synced_at":"2026-01-13T08:35:14.752Z","etag":null,"topics":["bioinformatics","genome-analysis","genomics","nanopore-sequencing","pacbio-sequencing","repeat-detection","sequencing","short-tandem-repeats"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WGLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-03-07T01:06:57.000Z","updated_at":"2026-01-13T06:15:28.000Z","dependencies_parsed_at":"2024-05-02T14:06:48.718Z","dependency_job_id":"5f0a4cb8-a47e-45fa-ad96-39cfad5a9e6e","html_url":"https://github.com/WGLab/NanoRepeat","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/WGLab/NanoRepeat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FNanoRepeat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FNanoRepeat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FNanoRepeat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FNanoRepeat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WGLab","download_url":"https://codeload.github.com/WGLab/NanoRepeat/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FNanoRepeat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28505570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genome-analysis","genomics","nanopore-sequencing","pacbio-sequencing","repeat-detection","sequencing","short-tandem-repeats"],"created_at":"2026-01-17T10:00:28.712Z","updated_at":"2026-01-17T10:00:53.828Z","avatar_url":"https://github.com/WGLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NanoRepeat: quantification of Short Tandem Repeats (STRs) from long-read sequencing data (including ONT and PacBio)\n\n[![PyPI version](https://badge.fury.io/py/NanoRepeat.svg)](https://badge.fury.io/py/NanoRepeat)\n[![DOI](https://www.zenodo.org/badge/DOI/10.5281/zenodo.7024484.svg)](https://www.zenodo.org/record/7024484)\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Regular use cases](#regular-use-cases)\n  - [Joint quantification of two adjacent repeats](#joint_quantification)\n- [Citation](#citation)\n- [Limitation](#limitation)\n- [Contact Us](#contact-us)\n\n## Installation\n\n#### Prerequisites:\n\n\n```\nconda create -n nanorepeat python=3.9\nconda activate nanorepeat\ngit clone https://github.com/WGLab/NanoRepeat.git\ncd NanoRepeat\npip install .\n```\n\nIf you want to install a stable version from Python Package Index (PyPI): \n\n```\nconda create -n nanorepeat python=3.9\nconda activate nanorepeat\npip install NanoRepeat\n```\nNotice: If you want to install NanoRepeat from a PyPI mirror, please check if the version in the mirror is update to date. \n\n## Usage\n\n### Regular use cases\n\nNanoRepeat can quantify STRs from targeted sequencing or whole-genome sequencing data. We will demonstrate the usage of NanoRepeat using an example data set, which can be downloaded using the following commands. \n\n```\nwget https://github.com/WGLab/NanoRepeat/releases/download/v1.3/NanoRepeat_v1.3_example_data.tar.bz2\ntar xjf NanoRepeat_v1.3_example_data.tar.bz2\n```\n\nAfter unzipping the file, you will see a `NanoRepeat_v1.3_example_data` folder and there are two subfolders: `HG002` and `HTT_amplicon`. In this section, we will use the data under the `HG002` folder. \n\n```\n$ ls -1  ./NanoRepeat_v1.3_example_data/HG002/ \nGRCh37_chr1.fasta\nGRCh37_chr1.fasta.fai\nHG002_GRCh37_example_regions.bed\nhg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam\nhg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam.bai\n```\n\nYou can use the following command to run NanoRepeat: \n\n```\nnanoRepeat.py \\\n    -i path/to/NanoRepeat_v1.3_example_data/HG002/hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam \\\n    -t bam \\\n    -d ont_q20 \\\n    -r path/to/NanoRepeat_v1.3_example_data/HG002/GRCh37_chr1.fasta \\\n    -b path/to/NanoRepeat_v1.3_example_data/HG002/HG002_GRCh37_example_regions.bed \\\n    -c 4 \\\n    -o ./nanorepeat_output/HG002\n```\n\n`-i` specifies the input file, which can be in `fasta`, `fastq` or `bam` format. In this case our input file is `hg002_Q20.20210805_3flowcells.hs37d5.example_regions.bam`. It is a subset of an Oxford Nanopore whole-genome sequencing dataset of the [NIST/GIAB HG002 (GM24385/NA24385)](https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA24385\u0026Product=DNA) genome. The sequencing data was from the [Oxford Nanopore Technologies Benchmark Datasets](https://registry.opendata.aws/ont-open-data/) and reads from 15 example STR regions in `chr1` were extracted. These regions were selected because they overlap with the [HG002 SV benchmark set](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz) and are heterozygous (i.e., two alleles have different repeat sizes). \n\n`-t` specifies the input file type. There are four valid values: bam, cram, fastq or fasta. In this case the input file is in a bam file. \n\n`-d` specifies the data type. There are five valid values: `ont_q20`, `ont_sup`, `ont`, `hifi`, and `clr`. `ont_q20` is for Oxford Nanopore sequencing with Q20+ chemistry (such as R10 flowcells). `ont_sup` is for Oxford Nanopore sequencing with R9 flowcells and basecalled in super accuracy mode. `ont` is for Oxford Nanopore sequencing with R9 flowcells and basecalled in fast mode or high accuracy mode. `hifi` is for PacBio HiFi/CCS reads. `clr` is for PacBio Continuous Long Reads (CLR) reads. Default value: `ont`.\n\n`-r` specifies the reference genome file in `FASTA` format. In this case, `GRCh37_chr1.fasta` is chr1 of the GRCh37/hg19 reference genome. We used GRCh37 instead of GRCh38 because the HG002 SV benchmark set is based on GRCh37. \n\n`-b` specifies the information of the tandem repeat regions that you are interested in. It is a tab-delimited text file in BED format. There are four required columns: `chromosome`, `start_position`, `end_position`, `repeat_unit_sequence`. In our case, `HG002_GRCh37_example_regions.bed` contains 15 STR regions in chr1 of GRCh37. The first 5 rows of the `HG002_GRCh37_example_regions.bed` are shown below. \n\n| 1 | 4599903  | 4599930   | TTTAG   |\n|---|----------|-----------|---------|\n| 1 | 7923034  | 7923187   | TATTG   |\n| 1 | 8321418  | 8321465   | TTCC    |\n| 1 | 14459872 |  14459935 |  AAAG   |\n| 1 | 20934886 |  20934920 |  GTTTT  |\n\n**IMPORTANT NOTICE**\n1) Please note that in BED format, all chromosome positions start from 0. `start_position` is self-inclusive but `end_position` is NOT self-inclusive. Tip: If you have 1-based positons, simply decrease the value of `start_position` by 1 and no changes for `end_position`\n\n2) NanoRepeat assumes that the seqeunce between `start_position` and `end_position` are all repeats of the motif specified in the fourth column. There should be neither non-repeat sequences nor other repeat motifs between `start_position` and `end_position`. If a region contains two consecutive repeats, you can specify them in two rows. \n\n`-c` specifies the number of CPU cores for alignment. \n\n`-o` specifies the output prefix. Please include the path to the output directory and prefix of output file names. In our case, the output prefix is `./nanorepeat_output/HG002`, which means the output directory is `./nanorepeat_output/` and the prefix of output file names is `HG002`.\n\n\nIf you run NanoRepeat sucessfully, you will see 90 output files (six files per region). Output files of a single repeat region look like this: \n\n```\nHG002.1-7923034-7923187-TATTG.allele1.fastq\nHG002.1-7923034-7923187-TATTG.allele2.fastq\nHG002.1-7923034-7923187-TATTG.hist.png\nHG002.1-7923034-7923187-TATTG.phased_reads.txt\nHG002.1-7923034-7923187-TATTG.repeat_size.txt\nHG002.1-7923034-7923187-TATTG.summary.txt\n```\n\n`HG002.1-7923034-7923187-TATTG.allele1.fastq` and `HG002.1-7923034-7923187-TATTG.allele2.fastq` are reads of each allele. \n\n`HG002.1-7923034-7923187-TATTG.hist.png` is a histogram showing the repeat size distribution. Each allele has a different color.\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/HG002.1-7923034-7923187-TATTG.hist.png\" width=\"50%\"\u003e\u003c/p\u003e\n\n\n`HG002.1-7923034-7923187-TATTG.phased_reads.txt` shows the phasing results. First 10 lines of the `HG002.1-7923034-7923187-TATTG.phased_reads.txt` are shown below. \n\n```\n$ head HG002.1-7923034-7923187-TATTG.phased_reads.txt\n##RepeatRegion=1-7923034-7923187-TATTG\n#Read_Name\tAllele_ID\tPhasing_Confidence\tRepeat_Size\n746edfa7-715f-4e97-913e-ef73ed97135f\t1\tHIGH\t14.0\nd6355053-0ed2-438e-8469-28cabeb2aedf\t1\tHIGH\t17.0\n513a749a-6ffc-47c4-a499-9f9222e93abf\t1\tHIGH\t17.0\nfc8dc377-8772-4dc0-922d-ad694deec8d7\t1\tHIGH\t17.0\ncd847c0e-9fbf-4abf-8f0a-ea938026ef41\t1\tHIGH\t17.0\nf53bc376-69b4-4118-87e1-59379c640408\t1\tHIGH\t17.0\n9b70cd2a-c1df-447a-a7aa-b5ab8046115e\t1\tHIGH\t17.0\n6a9b6f5b-d59d-4dde-9adb-8e6ac91cc6e4\t1\tHIGH\t17.0\n```\n\nThe columns of the `*.phased_reads.txt` file: \n\n| Column | Description                                  |\n|:------:|----------------------------------------------|\n|    1   | Read_Name                                    |\n|    2   | Allele_ID                                    |\n|    3   | Phasing_Confidence (two values: HIGH or LOW) |\n|    4   | Repeat_Size                                  |\n\n`HG002.1-7923034-7923187-TATTG.repeat_size.txt` is the estimated repeat sizes of ALL reads. This file is similar to the `*.phased_reads.txt` file but it also includes reads that may be removed in the phasing process (e.g. reads considered as noisy reads or outliers)\n\n```\n$ head HG002.1-7923034-7923187-TATTG.repeat_size.txt\n##Repeat_Region=1-7923034-7923187-TATTG\n#Read_Name\tRepeat_Size\n746edfa7-715f-4e97-913e-ef73ed97135f\t14.0\nd6355053-0ed2-438e-8469-28cabeb2aedf\t17.0\ndadaf0a0-8797-47ca-a21b-259928edca7e\t48.0\n513a749a-6ffc-47c4-a499-9f9222e93abf\t17.0\n07f65d31-4023-4d86-beba-76fb88f2cf45\t48.0\n4e66c3d0-6f15-4ff7-a8a8-d5c95d57e73d\t48.0\nfc8dc377-8772-4dc0-922d-ad694deec8d7\t17.0\ncd847c0e-9fbf-4abf-8f0a-ea938026ef41\t17.0\n```\n\n`HG002.1-7923034-7923187-TATTG.summary.txt` gives the quantification of the repeat size. It has the following information: 1) repeat region; 2) number of detected alleles; 3) repeat size of each allele; 4) number of reads of each allele; 5) number of removed reads.\n\n```\n$ cat HG002.1-7923034-7923187-TATTG.summary.txt\nRepeat_Region=1-7923034-7923187-TATTG\tMethod=GMM\tNum_Alleles=2\tNum_Removed_Reads=0\tAllele1_Num_Reads=33\tAllele1_Repeat_Size=17\tAllele2_Num_Reads=19\tAllele2_Repeat_Size=48\n```\n\n### \u003ca name=\"joint_quantification\"\u003e Joint quantification of two adjacent STRs (such as the `CAG` and `CCG` repeats in the HTT gene)\n\nSometimes two STRs are next to each other. For example, in exon-1 of the human HTT gene, there are two adjacent STRs: `CAG` and `CCG`. The sequence structure is: (CAG)\u003csub\u003em\u003c/sub\u003e-CAA-CAG-CCG-CCA-(CCG)\u003csub\u003en\u003c/sub\u003e. NanoRepeat can jointly quantify the two STRs and provide phased results. In our experience, looking at both repeats help generate better quantification results. \n\t\nWe will demonstrate the joint quantification using the same example dataset (described in the above section). If you have not downloaded the dataset, you can execute following commands. \n\n```\nwget https://github.com/WGLab/NanoRepeat/releases/download/v1.3/NanoRepeat_v1.3_example_data.tar.bz2\ntar xjf NanoRepeat_v1.3_example_data.tar.bz2\n```\n\nAfter unzipping the file, you will see a `NanoRepeat_v1.2_example_data` folder and there are two subfolders: `HG002` and `HTT_amplicon`. In this section, we will use the data under the `HTT_amplicon` folder. \n\t\nThe input fastq file is here: `./NanoRepeat_v1.2_example_data/HTT_amplicon/HTT_amplicon.fastq.gz`.\n\t\nThe reference fasta file is here: `./NanoRepeat_v1.2_example_data/HTT_amplicon/GRCh38_chr4.0_4Mb.fasta`.\n\nYou can use the following command to run `NanoRepeat-joint`:\n```\nnanoRepeat-joint.py  \\\n    -i ./NanoRepeat_v1.3_example_data/HTT_amplicon/HTT_amplicon.fastq.gz \\\n    -r ./NanoRepeat_v1.3_example_data/HTT_amplicon/GRCh38_chr4.0_4Mb.fasta \\\n    -1 chr4:3074876:3074933:CAG:200      \\\n    -2 chr4:3074946:3074966:CCG:20       \\\n    -o ./joint_quantification_output/HTT \\\n    -c 4\n```\n\n`-1` and `-2` specify the two repeat regions. The format of `-1`  and `-2` is `chrom:start_position:end_position:repeat_unit:max_size`. The start and end positions are 0-based (the first base on the chromosome is numbered 0). The start position is self-inclusive but the end position is non-inclusive, which is the same as the [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). For example, a region of the first 100 bases of chr1 is denoted as `chr1:0:100`.  `max_size` is the max repeat length that we consider. Please set `max_size` to be a reasonal number. If `max_size` is too large (e.g. well beyond the max possible number), the speed of joint quantification might be slow.\n\n\nIf you run NanoRepeat sucessfully, you will see the following files in the `./joint_quantification_output` folder. \n\n```\nHTT.allele1.fastq\nHTT.allele2.fastq\nHTT.chr4-3074876-3074933-CAG.hist.png\nHTT.chr4-3074946-3074966-CCG.hist.png\nHTT.hist2d.png\nHTT.phased_reads.txt\nHTT.repeat_size.txt\nHTT.scatter.png\nHTT.summary.txt\n```\n\n`HTT.allele1.fastq` and `HTT.allele2.fastq` are the reads assigned to each allele. \n\n`HTT.chr4-3074876-3074933-CAG.hist.png` is a histogram showing the repeat size distribution of the first repeat (chr4-3074876-3074933-CAG).\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/HTT.chr4-3074876-3074933-CAG.hist.png\" width=\"50%\"\u003e\u003c/p\u003e\n\n`HTT.chr4-3074946-3074966-CCG.hist.png` is a histogram showing the repeat size distribution of the second repeat (chr4-3074946-3074966-CCG). \n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/HTT.chr4-3074946-3074966-CCG.hist.png\" width=\"50%\"\u003e\u003c/p\u003e\n\n`HTT.hist2d.png` is a two-dimensional histogram showing the joint distribution of the two repeats. \n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/HTT.hist2d.png\" width=\"50%\"\u003e\u003c/p\u003e\n\n`HTT.scatter.png` is a scatter plot showing the joint distribution of the two repeats. The dotted lines indicates the 95% equi-probability surface of the Gaussian mixture models.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/HTT.scatter.png\" width=\"50%\"\u003e\u003c/p\u003e\n\n\n`HTT.phased_reads.txt` shows the phasing results. The first line is the path to the input FASTQ file. Lines 2-9 of the `HTT.phased_reads.txt` file are shown below (as a table). \n\n| #Read_Name | Allele_ID | Phasing_Confidence | chr4-3074876-3074933-CAG.Repeat_Size | chr4-3074946-3074966-CCG.Repeat_Size |\n|---|:---:|:---:|:---:|:---:|\n| ONT_read330 | 1 | HIGH | 13.5 | 8 |\n| ONT_read1284 | 1 | HIGH | 17 | 11.5 |\n| ONT_read579 | 1 | HIGH | 16 | 10 |\n| ONT_read838 | 1 | HIGH | 15.5 | 10 |\n| ONT_read520 | 1 | LOW | 25 | 13 |\n| ONT_read1066 | 1 | HIGH | 17.5 | 10 |\n| ONT_read1059 | 1 | HIGH | 16 | 10.5 |\n| ONT_read526 | 1 | HIGH | 17 | 10 |\n\nThe `*summary.txt` file gives the quantification of the repeat sizes. It has the following information: \n1) input file\n2) number of alleles\n3) number of reads for each allele\n4) quantification of repeat sizes of each allele\n\nThe content of `HTT.summary.txt` is shown below: \n\n| Input_FASTQ | path/to/HTT_amplicon.fastq.gz |\n|---|---|\n| Method | 2D-GMM |\n| Num_Alleles | 2 |\n| Num_Removed_Reads | 0 |\n| Allele1_Num_Reads | 733 |\n| Allele1_chr4-3074876-3074933-CAG.Repeat_Size | 17 |\n| Allele1_chr4-3074946-3074966-CCG.Repeat_Size | 10 |\n| Allele2_Num_Reads | 856 |\n| Allele2_chr4-3074876-3074933-CAG.Repeat_Size | 55 |\n| Allele2_chr4-3074946-3074966-CCG.Repeat_Size | 7 |\n\n\n## Citation\nIf you use NanoRepeat, please cite: \n\nFang L, Monteys AM, Dürr A, Keiser M, Cheng C, Harapanahalli A, et al. Haplotyping SNPs for allele-specific gene editing of the expanded huntingtin allele using long-read sequencing. Human Genetics and Genomics Advances. 2023;4(1):100146. DOI: https://doi.org/10.1016/j.xhgg.2022.100146.\n\n\nBibTeX format: \n\n```\n@article{FANG2023100146,\n\ttitle = {Haplotyping SNPs for allele-specific gene editing of the expanded huntingtin allele using long-read sequencing},\n\tjournal = {Human Genetics and Genomics Advances},\n\tvolume = {4},\n\tnumber = {1},\n\tpages = {100146},\n\tyear = {2023},\n\tissn = {2666-2477},\n\tdoi = {https://doi.org/10.1016/j.xhgg.2022.100146},\n\turl = {https://www.sciencedirect.com/science/article/pii/S266624772200063X},\n\tauthor = {Li Fang and Alex Mas Monteys and Alexandra Dürr and Megan Keiser and Congsheng Cheng and Akhil Harapanahalli and Pedro Gonzalez-Alegre and Beverly L. Davidson and Kai Wang},\n\tkeywords = {Huntington’s disease, long-read sequencing, CRISPR, SNP, repeat detection}\n}\n\n```\n## Limitation\nNanoRepeat can accuratly quantify simple repeats but cannot handle mixed repeats of different motifs (i.e. a mixture of `GCCA` and `AAATT`), but imperfect repeats of approximately the same motif are OK. \n\n## Contact Us\n\nIf you need any help from us, you are welcome to raise an issue at the issue page. You can also contact Dr. Li Fang (fangli9@sysu.edu.cn) or Dr. Kai Wang (wangk@chop.edu).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwglab%2Fnanorepeat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwglab%2Fnanorepeat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwglab%2Fnanorepeat/lists"}