{"id":18448416,"url":"https://github.com/nextomics/nextpolish2","last_synced_at":"2025-04-09T18:18:44.378Z","repository":{"id":155707084,"uuid":"607950698","full_name":"Nextomics/NextPolish2","owner":"Nextomics","description":"Repeat-aware polishing genomes assembled using HiFi long reads","archived":false,"fork":false,"pushed_at":"2024-11-19T02:16:56.000Z","size":7686,"stargazers_count":82,"open_issues_count":0,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-09T18:18:38.316Z","etag":null,"topics":["genome-assembly","genome-polish","t2t-polish"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Nextomics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-01T02:06:46.000Z","updated_at":"2025-04-09T12:07:37.000Z","dependencies_parsed_at":"2024-07-12T10:16:01.924Z","dependency_job_id":"47ee758c-4879-4576-9ee0-cfe5b2918efc","html_url":"https://github.com/Nextomics/NextPolish2","commit_stats":{"total_commits":60,"total_committers":1,"mean_commits":60.0,"dds":0.0,"last_synced_commit":"4fec66bf19514963e3dc3fce7bb4ffe2e7edf973"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nextomics%2FNextPolish2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nextomics%2FNextPolish2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nextomics%2FNextPolish2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nextomics%2FNextPolish2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Nextomics","download_url":"https://codeload.github.com/Nextomics/NextPolish2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248085325,"owners_count":21045139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genome-assembly","genome-polish","t2t-polish"],"created_at":"2024-11-06T07:15:50.048Z","updated_at":"2025-04-09T18:18:44.361Z","avatar_url":"https://github.com/Nextomics.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/nextpolish2/README.html)\n[![Conda Version](https://img.shields.io/conda/v/bioconda/NextPolish2)](https://anaconda.org/bioconda/nextpolish2)\n# NextPolish2\n\nTelomere-to-telomere (T2T) genome has been emerging as a new hotspot in the field of genomics. Typically, we obtain a T2T genome with datasets including both high-accuracy PacBio HiFi long reads and Oxford Nanopore Technologies (ONT) ultra-long reads. Although genomes obtained using HiFi long reads have considerably higher qualities, however, they still contain a handful of assembly errors in regions where HiFi long reads stumble as well, such as homopolymer or low-complexity microsatellite regions. Additionally, a typical gap-filling step is accomplished using ONT ultra long reads which contain a certain amount of errors. Hence, the current T2T genomes assembled still require further improvement in terms of consensus accuracy. NextPolish2 can be used to fix these errors (SNV/Indel) in a high quality assembly. Through the built-in phasing module, it can only correct the error bases while maintaining the original haplotype consistency. Therefore, even in the regions with complex repeat elements, NextPolish2 will still not produce overcorrections. In fact, in some cases it can reduce switching errors in the heterozygous region. NextPolish2 is not an upgraded version of NextPolish, but an additional supplement for the pursuit of extremely-high-quality genome assemblies.\n\nIf you are concerned about the overcorrection problem, please refer to the [HG005 dataset benchmarking](#overcorrection) and the [NextPolish2 article](#cite) for more information.\n\n## Table of Contents\n\n- [Installation](#install)\n- [General usage](#usage)\n- [Getting help](#help)\n- [Citation](#cite)\n- [License](#license)\n- [Limitations](#limit)\n- [Benchmarking](#benchmark)\n- [Overcorrection](#overcorrection)\n- [FAQ](./doc/faq.md)\n\n### \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\n#### Installing from bioconda\n```sh\nconda install nextpolish2\n```\n#### Installing from source\n##### Dependencies\n\n`NextPolish2` is written in rust, try below commands (no root required) or refer [here](https://www.rust-lang.org/tools/install) to install `Rust` first.\n```sh\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n```\n\n##### Download and install\n\n```sh\ngit clone --recursive git@github.com:Nextomics/NextPolish2.git\ncd NextPolish2 \u0026\u0026 cargo build --release\n```\n\n##### Test\n\n```sh\ncd test \u0026\u0026 bash hh.sh\n```\n\n### \u003ca name=\"usage\"\u003e\u003c/a\u003eGeneral usage\n\nNextPolish2 takes a genome assembly file, a HiFi mapping file and one or more k-mer dataset files from short reads as input and generates the polished genome.\n\n1. Prepare HiFi mapping file ([winnowmap](https://github.com/marbl/Winnowmap) or [minimap2](https://github.com/lh3/minimap2/)).\n\n```sh\n#prefer using winnowmap\nmeryl count k=15 output merylDB asm.fa.gz\nmeryl print greater-than distinct=0.9998 merylDB \u003e repetitive_k15.txt\nwinnowmap -t 5 -W repetitive_k15.txt -ax map-pb asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam -\n\n# or mapping using minimap2\n# minimap2 -ax map-hifi -t 5 asm.fa.gz hifi.fasta.gz|samtools sort -o hifi.map.sort.bam -\n\n# indexing\nsamtools index hifi.map.sort.bam\n```\n\n2. Prepare k-mer dataset files ([yak](https://github.com/lh3/yak)) (We recommend using \u003e=60X short reads). Here we only produce 21-mer and 31-mer datasets, you can produce more k-mer datasets with different k-mer size.\n\n```sh\n# Quality control and filtering.\n# fastp -5 -3 -n 0 -f 5 -F 5 -t 5 -T 5 -q 20 -i sr.R1.fastq.gz -I sr.R2.fastq.gz -o sr.R1.clean.fastq.gz -O sr.R2.clean.fastq.gz\n\n# produce a 21-mer dataset, remove -b 37 if you want to count singletons\n./yak/yak count -o k21.yak -k 21 -b 37 \u003c(zcat sr.R*.clean.fastq.gz) \u003c(zcat sr.R*.clean.fastq.gz)\n\n# produce a 31-mer dataset, remove -b 37 if you want to count singletons\n./yak/yak count -o k31.yak -k 31 -b 37 \u003c(zcat sr.R*.clean.fastq.gz) \u003c(zcat sr.R*.clean.fastq.gz) \n```\n***Important:*** To maximize correction accuracy, quality filtering steps (fastp) such as adapter removal, global or quality trimming, and read filtering are essential for short reads.\n\n3. Run NextPolish2.\n\n```sh\n./target/release/nextPolish2 -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak \u003e asm.np2.fa\n\n# or try with -r\n# ./target/release/nextPolish2 -r -t 5 hifi.map.sort.bam asm.fa.gz k21.yak k31.yak \u003e asm.np2.fa\n```\n\n***Optional:*** If your genome is assembled via **trio binning**. You can discard reads that have different haplotype with the reference before the mapping procedure, see [here](./doc/benchmark3.md) for an example.\n\n#### More options\n\nUse `./target/release/nextPolish2 -h` to see options.\n\n### \u003ca name=\"help\"\u003e\u003c/a\u003eGetting help\n\n#### Help\n\n   Feel free to raise an issue at the [issue page](https://github.com/Nextomics/NextPolish2/issues/new).\n\n   ***Note:*** Please ask questions on the issue page first. They are also helpful to other users.\n#### Contact\n   \n   For additional help, please send an email to huj\\_at\\_grandomics\\_dot\\_com.\n\n### \u003ca name=\"cite\"\u003e\u003c/a\u003eCitation\n\nJiang Hu, Zhuo Wang, Fan Liang, Shan-Lin Liu, Kai Ye, De-Peng Wang, NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads, Genomics, Proteomics \u0026 Bioinformatics, 2024, qzad009, https://doi.org/10.1093/gpbjnl/qzad009\n\n### \u003ca name=\"license\"\u003e\u003c/a\u003eLicense\n\nNextPolish2 is only freely available for academic use and other non-commercial use.\n\n### \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\n1. NextPolish2 can only correct the regions that are mapped by HiFi reads. For regions without HiFi reads mapping (usually cause by high error rate), you can try to adjust mapping parameters.\n2. **The performance of NextPolish2 relies heavily on the quality of short reads. Please use high-quality short reads to avoid overcorrection errors, which can falsely improve estimated QV but reduce actual accuracy.**\n3. NextPolish2 can only fix some structural misassemblies.\n\n### \u003ca name=\"benchmark\"\u003e\u003c/a\u003eBenchmarking\n\n| Source                                           | Software           | QV      | Switch error rate (‱) |\n| :----------------------------------------------: | ------------------ | :-----: | :---------------------: |\n| [*A. thaliana*](./doc/benchmark1.md)             | Hifiasm  (primary) | 47.67   | 1.99                    |\n|^(simulated data, primary contigs)^               | NextPolish2        |**65.42**| **0.35**                |\n| [*A. thaliana*](./doc/benchmark2.md)             | Hifiasm  (primary) | 58.03   |                         |\n| ^(Col-XJTU, primary contigs)^                    | NextPolish2        |**64.26**|                         |\n| [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (primary) | 60.25   | 0.15                    |\n| ^(HG002, primary contigs)^                       | NextPolish2        |**62.87**| **0.14**                |\n| [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (trio)    | 59.77   | 0.21                    |\n|^(HG002, paternal contigs)^                       | NextPolish2        |**63.49**| **0.20**                |\n| [*H. sapiens*](./doc/benchmark3.md)              | Hifiasm  (trio)    | 59.78   | 0.33                    |\n|^(HG002, maternal contigs)^                       | NextPolish2        |**63.29**| **0.30**                |\n\n### \u003ca name=\"overcorrection\"\u003e\u003c/a\u003eOvercorrection\n\nIn addition to evaluating the overcorrection problem discussed in the [`NextPolish2` article](#cite), we used the HG005 data to further assess this issue. First, we assembled the HG005 genome with approximately 30x HiFi data using HiFiasm, followed by polishing the assembled genome with NextPolish2. To minimize the impact of evaluation method limitations, we employed three approaches to evaluate the genome's accuracy before and after polishing:\n\n1. **Merqury:** To assess the quality value (QV).\n2. **DeepVariant:** To count homozygous high-quality variants as potential errors.\n3. **Paftools.js:** To count variants between GRCh37 and HG005 that are not in the high-confidence benchmarking variants (GIAB) as potential errors.\n\nThe results demonstrated that `NextPolish2` improved the QV of the assembled genome from 53.8544 to 55.6257 and reduced the number of homozygous high-quality variants (potential errors) from 14,155 to 6,955. Additionally, `NextPolish2` increased the number of high-quality variants called from 2,520,470 to 2,522,817 and reduced the error rate from 0.029927% to 0.029575%.\n\nOverall, these results indicate that `NextPolish2` effectively reduces the error rate of the assembled HG005 genome. Detailed step-by-step instructions are available [here](./doc/benchmark5.md).\n\n### Star\nYou can track updates by tab the **Star** button on the upper-right corner at the [github page](https://github.com/Nextomics/NextPolish2).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnextomics%2Fnextpolish2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnextomics%2Fnextpolish2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnextomics%2Fnextpolish2/lists"}