{"id":13752487,"url":"https://github.com/lh3/fermikit","last_synced_at":"2026-02-13T14:59:51.055Z","repository":{"id":30210998,"uuid":"33761973","full_name":"lh3/fermikit","owner":"lh3","description":"De novo assembly based variant calling pipeline for Illumina short reads","archived":false,"fork":false,"pushed_at":"2020-11-30T22:57:56.000Z","size":7811,"stargazers_count":108,"open_issues_count":12,"forks_count":21,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-08-31T22:27:36.467Z","etag":null,"topics":["bioinformatics","denovo-assembly","genomics","variant-calling"],"latest_commit_sha":null,"homepage":"","language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-11T04:04:05.000Z","updated_at":"2025-08-17T21:17:27.000Z","dependencies_parsed_at":"2022-09-20T21:20:32.594Z","dependency_job_id":null,"html_url":"https://github.com/lh3/fermikit","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/lh3/fermikit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Ffermikit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Ffermikit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Ffermikit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Ffermikit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/fermikit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Ffermikit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29411138,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-13T06:24:03.484Z","status":"ssl_error","status_checked_at":"2026-02-13T06:23:12.830Z","response_time":78,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","denovo-assembly","genomics","variant-calling"],"created_at":"2024-08-03T09:01:06.566Z","updated_at":"2026-02-13T14:59:51.038Z","avatar_url":"https://github.com/lh3.png","language":"TeX","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/lh3/fermikit.svg?branch=master)](https://travis-ci.org/lh3/fermikit)\n## Introduction\n\nFermiKit is a *de novo* assembly based variant calling pipeline for deep\nIllumina resequencing data. It assembles reads into unitigs, maps them to the\nreference genome and then calls variants from the alignment to an accuracy\ncomparable to conventional mapping based pipelines (see evaluation in the `tex`\ndirectory). The assembly does not only encode SNPs and short INDELs, but also\nretains long deletions, novel sequence insertions, translocations and copy\nnumbers. It is a heavily reduced representation of raw data. Storing,\ndistributing and analyzing assemblies is much faster and cheaper at an\nacceptable loss of information.\n\nFermiKit is not a prototype. It is a practical pipeline targeting large-scale\ndata and has been used to process hundreds of human samples. On a modern server\nwith 16 CPU cores, FermiKit can assemble 30-fold human reads in one day with\nabout 85GB RAM at the peak. The subsequent mapping and variant calling only\ntake half an hour.\n\n## Installation and Usage\n\nThe only library dependency of FermiKit is [zlib][zlib]. To compile on Linux or\nMac:\n```sh\ngit clone --recursive https://github.com/lh3/fermikit.git\ncd fermikit\nmake\n```\nThis creates a `fermikit/fermi.kit` directory containing all the executables.\nYou can copy the `fermi.kit` directory anywhere and invoke the pipeline by\nspecifying absolute or relative path:\n```sh\n# assembly reads into unitigs (-s specifies the genome size and -l the read length)\nfermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix reads.fq.gz \u003e prefix.mak\nmake -f prefix.mak\n# call small variants and structural variations\nfermi.kit/run-calling -t16 bwa-indexed-ref.fa prefix.mag.gz | sh\n```\nThis generates `prefix.mag.gz` for the final assembly and `prefix.flt.vcf.gz`\nfor filtered SNPs and short INDELs and `prefix.sv.vcf.gz` for long deletions,\nnovel sequence insertions and complex structural variations. If you have\nmultiple FASTQ files and want to trim adapters before assembly:\n```sh\nfermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix \\\n    \"fermi.kit/seqtk mergepe r1.fq r2.fq | fermi.kit/trimadap-mt -p4\" \u003e prefix.mak\n```\nIt is also possible to call SNPs and short INDELs from multiple BAMs at the\nsame time and produce a multi-sample VCF:\n```sh\nfermi.kit/htsbox pileup -cuf ref.fa pre1.srt.bam pre2.srt.bam \u003e out.raw.vcf\nfermi.kit/k8 fermi.kit/hapdip.js vcfsum -f out.raw.vcf \u003e out.flt.vcf\n```\n\n## Limitations\n\nFermiKit does not use paired-end information during assembly, which potentially\nleads to loss of power. In evaluations, the loss is minor for germline samples\nand even without pair information, FermiKit is more sensitive to short INDELs\nand long deletions. Furthermore, with longer upcoming Illumina reads, it is\nactually preferred to merge overlapping ends in a pair before assembly and\ntreat the merged reads as regular single-end reads (see AllPaths-LG and\nDISCOVAR).\n\nAnother technical limitation of FermiKit is that the error correction phase\nmay take excessive RAM when the error rate is unusually high. In practice,\nthis concern is also minor. I have assembled ~270 human samples and none of\nthem require more than ~90GB RAM.\n\nRunning FermiKit twice on the same dataset under the same setting is likely to\nresult in two slightly different assemblies. Please see bfc/count.c for the\ncause in BFC. Unitig construction also has a random factor under the\nmulti-threading mode. Nonetheless, FermiKit should call the same variants from\nthe same assembly.\n\n[zlib]: http://zlib.net\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Ffermikit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Ffermikit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Ffermikit/lists"}