{"id":13752473,"url":"https://github.com/OpenGene/gencore","last_synced_at":"2025-05-09T19:32:15.133Z","repository":{"id":44454432,"uuid":"131948664","full_name":"OpenGene/gencore","owner":"OpenGene","description":"Generate duplex/single consensus reads to reduce sequencing noises and remove duplications","archived":false,"fork":false,"pushed_at":"2023-10-27T06:19:21.000Z","size":268,"stargazers_count":111,"open_issues_count":41,"forks_count":32,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-08-03T09:04:00.445Z","etag":null,"topics":["bioinformatics","consensus","deduplication","deep-sequencing","duplex","duplex-sequencing","duplication","ngs","sequencing","sequencing-error","sequencing-noise","somatic"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-03T06:00:40.000Z","updated_at":"2024-07-24T08:15:41.000Z","dependencies_parsed_at":"2022-08-30T02:30:36.580Z","dependency_job_id":"eff61f30-2684-48f9-80fd-0cace48032b7","html_url":"https://github.com/OpenGene/gencore","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Fgencore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Fgencore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Fgencore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Fgencore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGene","download_url":"https://codeload.github.com/OpenGene/gencore/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224880776,"owners_count":17385367,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","consensus","deduplication","deep-sequencing","duplex","duplex-sequencing","duplication","ngs","sequencing","sequencing-error","sequencing-noise","somatic"],"created_at":"2024-08-03T09:01:06.319Z","updated_at":"2024-11-16T05:30:35.118Z","avatar_url":"https://github.com/OpenGene.png","language":"C++","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"[![install with conda](\nhttps://anaconda.org/bioconda/gencore/badges/version.svg)](https://anaconda.org/bioconda/gencore)\n# gencore\nAn efficient tool to remove sequencing duplications and eliminate sequencing errors by generating consensus reads.\n* [What's gencore](#whats-gencore)\n* [Download, compile and install](#get-gencore)\n* [Why to use gencore](#why-to-use-gencore)\n* [Understand the output](#understand-the-output)\n* [How it works](#how-it-works)\n* [Command examples](#command-examples)\n* [UMI format](#umi-format)\n* [All options](#all-options)\n* [Read/cite gencore paper](#citation)\n\n# what's gencore?\n`gencore` is a tool for fast and powerful deduplication for paired-end next-generation sequencing (NGS) data. It is much faster and uses much less memory than Picard and other tools. It generates very informative reports in both HTML and JSON formats. It's based on an algorithm for `generating consensus reads`, and that's why it's named `gencore`.\n\nBasically, `gencore` groups the reads derived from the same original DNA template, merges them by generating a consensus read, which contains much less errors than the original reads.\n\n`gencore` supports the data with unique molecular identifiers (UMI). If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names, and use `gencore` to generate consensus reads.\n\nThis tool can eliminate the errors introduced by library preparation and sequencing processes, and consenquently reduce the false positives for downstream variant calling. This tool can also be used to remove duplicated reads. Since it generates consensus reads from duplicated reads, it outputs much cleaner data than conventional duplication remover. ***Due to these advantages, it is especially useful for processing ultra-deep sequencing data for cancer samples.***\n\n`gencore` accepts a sorted BAM/SAM with its corresponding reference fasta as input, and outputs an unsorted BAM/SAM.\n\n# take a quick glance of the informative report\n* Sample HTML report: http://opengene.org/gencore/gencore.html\n* Sample JSON report: http://opengene.org/gencore/gencore.json\n\n# try gencore to generate above reports\n* BAM file for testing: http://opengene.org/gencore/input.sorted.bam\n* BED file for testing: http://opengene.org/gencore/test.bed\n* Reference genome file: [ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta](ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta)\n* Command for testing: \n```shell\ngencore -i input.sorted.bam -o output.bam -r Homo_sapiens_assembly19.fasta -b test.bed --coverage_sampling=50000\n```\n* After the processing is finished, check the `gencore.html` and `gencore.json` in the working directory. The option `--coverage_sampling=50000` is to change the default setting `(coverage_sampling=10000)` to generate smaller report files by reducing the coverage sampling rate.\n\n# quick examples\nThe simplest way\n```shell\ngencore -i input.sorted.bam -o output.bam -r hg19.fasta\n```\nWith a BED file to specify the capturing regions\n```shell\ngencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed\n```\nOnly output the fragment with \u003e=2 supporting reads (useful for aggressive denoising)\n```shell\ngencore -i input.sorted.bam -o output.bam -r hg19.fasta -b test.bed -s 2\n```\n\n# get gencore\n## install with Bioconda\n[![install with conda](\nhttps://anaconda.org/bioconda/gencore/badges/version.svg)](https://anaconda.org/bioconda/gencore)\n```shell\nconda install -c bioconda gencore\n```\n## download binary \nThis binary is only for Linux systems: http://opengene.org/gencore/gencore\n```shell\n# this binary was compiled on CentOS, and tested on CentOS/Ubuntu\nwget http://opengene.org/gencore/gencore\nchmod a+x ./gencore\n```\n## or compile from source\n```shell\n# step 1: download and compile htslib from: https://github.com/samtools/htslib\n# step 2: get gencore source (you can also use browser to download from master or releases)\ngit clone https://github.com/OpenGene/gencore.git\n\n# step 3: build\ncd gencore\nmake\n\n# step 4: install\nsudo make install\n```\n\n# why to use gencore?\nAs described above, gencore can eliminate the errors introduced by library preparation and sequencing processes, and consenquently it can greatly reduce the false positives for downstream variant calling. Let me show your an example.\n\n## original BAM\n![image](http://www.opengene.org/gencore/original.png)   \n\n***This is an image showing a pileup of the original BAM. A lot of sequencing errors can be observed.***\n\n\n## gencore processed BAM\n![image](http://www.opengene.org/gencore/processed.png)   \n\n***This is the image showing the result of gencore processed BAM. It becomes much cleaner. Cheers!***\n\n# QC result reported by gencore\ngencore also performs some quality control when processing deduplication and generating consensus reads. Basically it reports mapping rate, duplication rate, mismatch rate and some statisticical results. Especially, gencore reports the coverate statistics of input BAM file in genome scale, and in capturing regions (if a BED file is specified).\n\ngencore reports the results both in HTML format and JSON format for manually checking and downstream analysis. See the examples of interactive [HTML](http://opengene.org/gencore/gencore.html) report and [JSON](http://opengene.org/gencore/gencore.json) reports.\n\n## coverate statistics in genome scale\n![image](http://www.opengene.org/gencore/coverage-genome.jpeg) \n\n## coverate statistics in capturing regions\n![image](http://www.opengene.org/gencore/coverage-bed.jpeg) \n\n# understand the output\ngencore outputs following files:\n1. the processed BAM. In this BAM, each consensus read will have a tag `FR`, which means `forward read number of this consensus read`. If the read is a duplex consensus read, it will also has a tag `RR`, which means `reverse read number of this consensus read`. Downstream tools can read the `FR` and `RR` tags for further processing or variant calling. In following example, the first read is a single-stranded consensus sequence (only has a `FR` tag), and the second read is a duplex consensus sequence (has both `FR` and `RR` tags):\n```\nA00250:28:H2HC3DSX2:1:1117:3242:5321:UMI_GCT_CTA        161     chr12   25377992        60      143M    =       25378431        582\n     GCAATAATTTTTGTCAGAAAAATGCATTAAATGAATAACAGAATTTCTGTTGGCTTTCTGGGTATTGTCTTTCTTTAATGAGACCTTTCTCCAGAAATAAACACATCCTCAAAAAAATTCTGCCAAAGTAAAATTCTTCAAAT FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:1  MD:Z:34G108     AS:i:138        XS:i:21 RG:Z:cfdna      FR:i:2\nA00250:28:H2HC3DSX2:1:2316:10547:25989:UMI_AAC_AGA      161     chr12   25377993        60      143M    =       25378462        612\n     CAATAATTTTTGTCAGAAAAATGCATTAAATGAATAACAGAATTTCTGTTGGCTTTCTGGGTATTGTCTTTCTTTAATGAGACCTTTCTCCAGAAATAAACACATCCTCAAAAAAATTCTGCCAAAGTAAAATTCTTCAAATA FFFFF:FFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFF,!FF:F:F:F,FFF,F:FFFF,,:F,FFFF:FF:,:FF:F,:, NM:i:1  MD:Z:33G67A41   AS:i:133        XS:i:21 RG:Z:cfdna      FR:i:1  RR:i:5\n```\n2. the JSON report. A json file contains lots of statistical informations.\n3. the HTML report. A html file visualizes the information of the JSON.\n4. the plain text output.\n\n# how it works\nimportant steps:\n1. clusters the reads by their mapping positions and UMIs (if UMIs are applicable).\n2. for each cluster, compares its supporting reads number (the number of reads/pairs for this DNA fragment) with the threshold specified by `supporting_reads`. If it passes, start to generate a consensus read for it.\n3. if the reads are paired, finds the overlapped region of each pair, and scores the bases in the overlapped regions according their concordance and base quality.\n4. for each base position at this cluster, computes the total scores of each different nucleotide (A/T/C/G/N).\n5. if there exists a major nucleotide with good quality, use this nucleotide for this position; otherwise, check the reference nucleotide from reference genome (if reference is specified).\n6. when checking the reference, if there exists one or more reads are concordant with reference genome with high quality, or all reads at this positions are with low quality, use the reference nucleotide for this position.\n\n## the quality thresholds\n`gencore` uses 3 different thresholds, and they can be specified by the commandline options：\n\n| Quality threshold | Default Score | CMD option |\n|- | - | - |\n| High Quality | 30 (Q30) | --high_qual |\n| Moderate Quality | 20 (Q20) | --moderate_qual |\n| Low Quality | 15 (Q15) | --low_qual |\n\n## the scoring\n`gencore` assigns a score to each base in a read of a read cluster, the score means the confidence of this base. The score is given by following rules:\n\n| in overlapped region? | matched with its pair? | condition? | score for this base |\n| - | - | - | - |\n| NO | N/A | HIGH_QUAL \u003c= this_qual | 8 |\n| NO | N/A | MODERATE_QUAL \u003c= this_qual \u003c HIGH_QUAL | 6 |\n| NO | N/A | LOW_QUAL \u003c= this_qual \u003c MODERATE_QUAL | 4 |\n| NO | N/A | this_qual \u003c LOW_QUAL | 2 |\n| YES | YES | 2 * HIGH_QUAL \u003c= this_qual + pair_qual | 12 |\n| YES | YES | 2 * MODERATE_QUAL \u003c= this_qual + pair_qual \u003c 2 * HIGH_QUAL | 10 |\n| YES | YES | 2 * LOW_QUAL \u003c= this_qual + pair_qual \u003c 2 * MODERATE_QUAL | 8 |\n| YES | YES | this_qual + pair_qual \u003c 2 * LOW_QUAL | 6 |\n| YES | NO | HIGH_QUAL \u003c= this_qual - pair_qual | 5 |\n| YES | NO | MODERATE_QUAL \u003c= this_qual - pair_qual \u003c HIGH_QUAL | 3 |\n| YES | NO | LOW_QUAL \u003c= this_qual - pair_qual \u003c MODERATE_QUAL | 1 |\n| YES | NO | this_qual - pair_qual \u003c LOW_QUAL | 0 |\n\nIn this table:\n* `this_qual` is the quality of this base\n* `pair_qual` is the quality of the corresponding in the overlapped region of a pair.\n* `HIGH_QUAL` is the quality threshold that can be specified by `--high_qual`\n* `MODERATE_QUAL` is the quality threshold that can be specified by `--moderate_qual`\n* `LOW_QUAL` is the quality threshold that can be specified by `--low_qual`\n\nIn the overlapped region, if a base and its pair are mismatched, its quality score will be adjusted to: `max(0, this_qual - pair_qual)`\n\n# command examples\nIf you want to get very clean data, we can only keep the clusters with 2 or more supporting reads (recommended for ultra-deep sequencing with higher dup-rate):\n```\ngencore -i in.bam -o out.bam -r hg19.fa -s 2\n```\nIf you want to keep all the DNA fragments, we can set `supporting_reads` to 1 (this option can be used to replace `picard markduplicate` to deduplication):\n```\ngencore -i in.bam -o out.bam -r hg19.fa -s 1\n```\n(Recommanded) If you want to keep all the DNA fragments, and for each output read you want to discard all the low quality unoverlapped mutations to obtain a relative clean data (recommended for dup-rate \u003c 50%):\n```\ngencore -i in.bam -o out.bam -r hg19.fa -s 1 --score_threshold=9\n```\nIf you want to obtain fewer but ultra clean data, and your data has UMI, you can enable the `duplex_only` option, and increase the `supporting_reads` and the `ratio_threshold`:\n```\ngencore -i in.bam -o out.bam -r hg19.fa --duplex_only -s 3 --ratio_threshold=0.9\n```\nPlease note that only UMI-integrated paired-end data can be used to generate duplex consensuses sequences.\n\n# UMI format\n`gencore` supports calling consensus reads with or without UMI. Although UMI is not required, it is strongly recommended. If your FASTQ data has UMI integrated, you can use [fastp](https://github.com/OpenGene/fastp) to shift the UMI to read query names.  \n\nThe UMI should in the tail of query names. It can have a prefix like `UMI`, followed by an underscore. If the UMI has a prefix, it should be specified by `--umi_prefix` or `-u`. If the UMI prefix is `umi` or `UMI`, it can be automatically detected. The UMI can also have two parts, which are connected by an underscore.   \n\n## UMI examples\n* Read query name = `\"NB551106:8:H5Y57BGX2:1:13304:3538:1404:UMI_GAGCATAC\"`, prefix = `\"UMI\"`, umi = `\"GAGCATAC\"`\n* Read query name = `\"NB551106:8:H5Y57BGX2:1:13304:3538:1404:umi_GAGC_ATAC\"`, prefix = `\"umi\"`, umi = `\"GAGC_ATAC\"`\n* Read query name = `\"NB551106:8:H5Y57BGX2:1:13304:3538:1404:GAGCATAC\"`, prefix = `\"\"`, umi = `\"GAGCATAC\"`\n* Read query name = `\"NB551106:8:H5Y57BGX2:1:13304:3538:1404:GAGC_ATAC\"`, prefix = `\"\"`, umi = `\"GAGC_ATAC\"`\n\n# all options\n```\noptions:\n  -i, --in                       input sorted bam/sam file. STDIN will be read from if it's not specified (string [=-])\n  -o, --out                      output bam/sam file. STDOUT will be written to if it's not specified (string [=-])\n  -r, --ref                      reference fasta file name (should be an uncompressed .fa/.fasta file) (string)\n  -b, --bed                      bed file to specify the capturing region, none by default (string [=])\n  -x, --duplex_only              only output duplex consensus sequences, which means single stranded consensus sequences will be discarded.\n      --no_duplex                don't merge single stranded consensus sequences to duplex consensus sequences.\n  -u, --umi_prefix               the prefix for UMI, if it has. None by default. Check the README for the defails of UMI formats. (string [=auto])\n  -s, --supporting_reads         only output consensus reads/pairs that merged by \u003e= \u003csupporting_reads\u003e reads/pairs. The valud should be 1~10, and the default value is 1. (int [=1])\n  -a, --ratio_threshold          if the ratio of the major base in a cluster is less than \u003cratio_threshold\u003e, it will be further compared to the reference. The valud should be 0.5~1.0, and the default value is 0.8 (double [=0.8])\n  -c, --score_threshold          if the score of the major base in a cluster is less than \u003cscore_threshold\u003e, it will be further compared to the reference. The valud should be 1~20, and the default value is 6 (int [=6])\n  -d, --umi_diff_threshold       if two reads with identical mapping position have UMI difference \u003c= \u003cumi_diff_threshold\u003e, then they will be merged to generate a consensus read. Default value is 1. (int [=1])\n  -D, --duplex_diff_threshold    if the forward consensus and reverse consensus sequences have \u003c= \u003cduplex_diff_threshold\u003e mismatches, then they will be merged to generate a duplex consensus sequence, otherwise will be discarded. Default value is 2. (int [=2])\n      --high_qual                the threshold for a quality score to be considered as high quality. Default 30 means Q30. (int [=30])\n      --moderate_qual            the threshold for a quality score to be considered as moderate quality. Default 20 means Q20. (int [=20])\n      --low_qual                 the threshold for a quality score to be considered as low quality. Default 15 means Q15. (int [=15])\n      --coverage_sampling        the sampling rate for genome scale coverage statistics. Default 10000 means 1/10000. (int [=10000])\n  -j, --json                     the json format report file name (string [=gencore.json])\n  -h, --html                     the html format report file name (string [=gencore.html])\n      --debug                    output some debug information to STDERR.\n      --quit_after_contig        stop when \u003cquit_after_contig\u003e contigs are processed. Only used for fast debugging. Default 0 means no limitation. (int [=0])\n  -?, --help                     print this message\n```\n# citation\nThe gencore paper has been published in  BMC Bioinformatics: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3280-9. If you used gencore in your research work, please cite it as:\n\nChen, S., Zhou, Y., Chen, Y. et al. Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data. BMC Bioinformatics 20, 606 (2019) doi:10.1186/s12859-019-3280-9\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2Fgencore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGene%2Fgencore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2Fgencore/lists"}