{"id":16618569,"url":"https://github.com/brentp/indelope","last_synced_at":"2025-06-22T12:04:09.881Z","repository":{"id":66472200,"uuid":"110574300","full_name":"brentp/indelope","owner":"brentp","description":"find large indels (in the blind spot between GATK/freebayes and SV callers)","archived":false,"fork":false,"pushed_at":"2017-12-03T14:49:45.000Z","size":136,"stargazers_count":39,"open_issues_count":4,"forks_count":1,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-01-17T20:46:10.509Z","etag":null,"topics":["genome-assembly","genomics","k-mer-counting","local-assembly","nim-lang","variant-calling","vcf"],"latest_commit_sha":null,"homepage":null,"language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brentp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-11-13T16:43:01.000Z","updated_at":"2023-06-22T08:55:14.000Z","dependencies_parsed_at":"2023-02-22T12:00:48.851Z","dependency_job_id":null,"html_url":"https://github.com/brentp/indelope","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Findelope","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Findelope/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Findelope/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Findelope/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brentp","download_url":"https://codeload.github.com/brentp/indelope/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242980782,"owners_count":20216285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genome-assembly","genomics","k-mer-counting","local-assembly","nim-lang","variant-calling","vcf"],"created_at":"2024-10-12T02:20:42.700Z","updated_at":"2025-03-11T05:45:27.628Z","avatar_url":"https://github.com/brentp.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"## indelope: find indels and SVs too small for structural variant callers and too large for GATK\n\n`indelope` was started with the goal of increasing the diagnostic rate in exomes. To do this it must be:\n\n+ fast : ~2.5 CPU-minutes per exome (25% slower than `samtools view -c`)\n+ easy-to-use : goes from BAM to VCF in a single command.\n+ novel : it does local assembly and then [aligns](https://github.com/lh3/ksw2) assembled contigs\n  to the genome to determine the event, and then does k-mer counting (not alignment) to genotype\n  without k-mer tables.\n+ accurate : because of the genotyping method, we know that called variants are not present\n  in the reference.\n\nThese features will help ensure that it is actually used (fast, easy-to-use) and that it finds new\nand valid variation.\n\nAs of November 2017, `indelope` is working -- it finds large indels that are clearly valid by visual\ninspection that are missed by GATK/freebayes/lumpy.\n\nAs of November 2017, I am still tuning. Here is a look at the progress:\n\n![image](https://user-images.githubusercontent.com/1739/33039303-c5ebb54c-cdf4-11e7-88be-bc3736bcdd25.png)\n\nNote that while `indelope` is steadily improving, it still is not as good as `scalpel`. More improvements\nare coming soon.\n\n`indelope` also works on whole genomes, but, for now, that is not the target use-case.\n\n## how it works\n\n`indelope` sweeps over a single bam and finds regions that are likely to harbor indels--reads that have\nmore than 1 cigar event and split-reads (work on split reads is in progress). As it finds these it increments\na counter for the genomic position of the event. Upon finding a gap in coverage, it goes back, finds any\nprevious position with sufficient evidence (this is a parameter) of an event, gathers reads that have been\naligned across that genomic position (and unaligned reads from that region) and does assembly on those reads.\nIt then aligns the assembled contigs to the genome using [ksw2](https://github.com/lh3/ksw2) and uses the CIGAR\nto determine the event as it's represented in the VCF. Any event will result in a novel k-mer not present in\nthe reference genome; `indelope` gets the k-mer of the reference genome at the event and the novel k-mer of\nthe alternate event. It again iterates through the reads that were originally aligned to the event and counts\nreference and alternate k-mers. Those counts are used for genotyping. Note that this reduces reference bias\nbecause we are aligning a contig (often \u003e400 bases) to the genome and never re-aligning the actual reads.\n\nAs `indelope` sweeps across the genome, it keeps the reads for each chunk in memory. A chunk bound is defined\nby a gap in coverage; this occurs frequently enough that the memory use is negligible. Once a new chunk is reached,\nall events from the previous chunk are called and then those reads are discarded. This method, along with the\nassembly method make `indelope` extremely fast--given 2 BAM decompression threads, it can call variants in an\nexome in ~ 1 minute (2.5 CPU-minutes).\n\n## assembly\n\nA read (contig) slides along another read (contig) to find the offset with the most matches. At each offset, if\nmore than $n mismatches are found, the next offset is attempted. This is efficient enough that a random read to\na random (non-matching) contig of length $N will incur ~ 1.25 * $N equality (char vs. char) tests.\n\nWithin each contig `indelope` tracks the number of reads supporting each base. Given a sufficient number of\nreads supporting a contig, it can account for sequencing errors with a simple voting scheme. That is: if contig a,\nposition x has more than 7 supporting reads and contig b has fewer than 3 supporting reads (and we know that\notherwise, `a` and `b` have no mismatches), we can vote to set the mismatch in `b` to the apparent match in `a`.\nThis allows us to first create contigs allowing no mismatches within the reads and then to combine and extend contigs\nusing this voting method.\n\n## installation and usage\n\nget a binary from [here](https://github.com/brentp/indelope/releases)\nand make sure that libhts.so is in your `LD_LIBRARY_PATH`\n\nthen run `./indelope -h` for usage. recommended is:\n\n```\nindelope --min-event-len 5 --min-reads 5 $fasta $bam \u003e $vcf\n```\n\n## to do\n\n+ somatic mode / filter mode. allow filtering on a set of k-mers from a parental genome (parent for \n  mendelian context or normal for somatic variation).\n\n+ use SA tag. (and possibly discordant reads)\n\n## see also\n\n+ [svaba](https://github.com/walaj/svaba) does local assembly, but then genotypes by alignment to those\n  assemblies. It is slower than `indelope` but it is an extremely useful tool and has a series of\n  careful and insightful analyses in its paper. (highly recommend!!)\n\n+ [rufus](https://github.com/jandrewrfarrell/RUFUS) does k-mer based variant detection; Andrew described\n  to me the RUFUS assembly method that inspired the one now used in `indelope`.\n\n+ [lancet](https://github.com/nygenome/lancet), [scalpel](http://scalpel.sourceforge.net/),\n  [mate-clever](https://academic.oup.com/bioinformatics/article/29/24/3143/194997),  and [prosic2](https://github.com/prosic/prosic2) are all\n  great tools that are similar in spirit that are worth checking out (of those, AFAICT, only scalpel has a focus on germ-line variation).\n\n\n## notes and TODO\n\n# need a better way to combine contigs\n\nsometimes, can have 2 contigs, each of length ~ 80 and they overlap for 60 bases but cutoff is\ne.g. 65. Need a way to recover this as it happens a lot in low-coverage scenarios. maybe it can\nfirst combine, then trim (currently, it's trim, combine).\nThis should also allow more permissive overlaps if the correction list is empty.\n\ntrack a read/contig matches multiple contigs with the same match, mismatch count\n\n# CHM1/13 truth-set\n\nhttps://www.ncbi.nlm.nih.gov/biosample?Db=biosample\u0026DbFrom=bioproject\u0026Cmd=Link\u0026LinkName=bioproject_biosample\u0026LinkReadableName=BioSample\u0026ordinalpos=1\u0026IdsFromResult=316945\n\n\n~/.aspera/connect/bin/ascp -P33001 -QT -L- -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/ERA596/ERA596286/bam/CHM1_1.bam .\n~/.aspera/connect/bin/ascp -P33001 -QT -L- -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/ERA596/ERA596286/bam/CHM13_1.bam .\n\nftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/CHM1_final_genotypes.annotated.vcf.gz\nftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/CHM13_final_genotypes.annotated.vcf.gz\nftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/nstd137_Huddleston_et_al_2016/genotype/\n\n\n# contigs\n\n`min_overlap` in contig::best_match should be a float between 0 and 1 that will make sure that at least\nthat portion of the shortest contig overlaps the other.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Findelope","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrentp%2Findelope","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Findelope/lists"}