{"id":13752167,"url":"https://github.com/lh3/kmer-cnt","last_synced_at":"2025-05-07T08:12:48.144Z","repository":{"id":139699847,"uuid":"242280098","full_name":"lh3/kmer-cnt","owner":"lh3","description":"Code examples of fast and simple k-mer counters for tutorial purposes","archived":false,"fork":false,"pushed_at":"2020-03-10T16:24:06.000Z","size":83,"stargazers_count":168,"open_issues_count":6,"forks_count":15,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-07T08:12:42.642Z","etag":null,"topics":["bioinformatics","genomics","k-mer-counting"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-22T04:36:16.000Z","updated_at":"2025-03-23T08:14:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"c0eb537b-d315-488c-b7cd-c51ec4885d7a","html_url":"https://github.com/lh3/kmer-cnt","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fkmer-cnt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fkmer-cnt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fkmer-cnt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fkmer-cnt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/kmer-cnt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252839296,"owners_count":21812090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics","k-mer-counting"],"created_at":"2024-08-03T09:01:00.749Z","updated_at":"2025-05-07T08:12:48.123Z","avatar_url":"https://github.com/lh3.png","language":"C++","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"## Getting Started\n\n```sh\ngit clone https://github.com/lh3/kmer-cnt\ncd kmer-cnt\nmake  # C++11 required to compile the two C++ implementations\nwget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz\n./yak-count M_abscessus_HiSeq_10M.fa.gz \u003e kc-c4.out\n```\n\n## Introduction\n\nK-mer counting is the foundation of many mappers, assemblers and miscellaneous\ntools (e.g. genotypers, metagenomics profilers, etc). It is one of the most\nimportant classes of algorithms in Bioinformatics. Here we will implement basic\nk-mer counting algorithms but with advanced engineering tricks. We will see how\nfar better engineering can go.\n\nIn this repo, each `{kc,yak}-*.*` file implements a standalone k-mer counter.\nAs to other files: ketopt.h is a command line option parser; khashl.h is a\ngeneric hash table library in C; kseq.h is a fasta/fastq parser; kthread.{h,c}\nprovides two multi-threading models; robin\\_hood.h is a C++11 hash table\nlibrary.\n\n## Results\n\nWe provide eight k-mer counters, which are detailed below the result table. All\nimplementations count canonical k-mers, the lexicographically *smaller* k-mer\nbetween the k-mers on the two DNA strands.\n\nThe following table shows the timing and peak memory of different\nimplementations for counting 31-mers from 2.5 million pairs of 100bp reads\nsampled from the HiSeq *M. abscessus* 6G-0125-R dateset in [GAGE-B][gage-b].\nThey were run on a Linux server equipped with two EPYC 7301 CPUs and 512GB RAM.\n\n|Implementation                 |Limitation          |Elapsed time (s)|CPU time (s)|Peak RAM (GB)|\n|:------------------------------|:-------------------|---------------:|-----------:|------------:|\n|[kc-py1](kc-py1.py) + Python3.7|                    |           499.6|       499.5|         8.15|\n|[kc-py1](kc-py1.py) + Pypy7.3  |                    |          1220.8|      1220.8|        12.21|\n|[kc-cpp1](kc-cpp1.cpp)         |                    |           528.0|       527.9|         8.27|\n|[kc-cpp2](kc-cpp2.cpp)         |                    |           319.6|       319.6|         6.90|\n|[kc-c1](kc-c1.c)               |\u003c=32-mer            |            39.3|        38.3|         1.52|\n|[kc-c2](kc-c2.c)               |\u003c=32-mer; \u003c1024 count|           38.7|        37.9|         1.05|\n|[kc-c3](kc-c3.c)               |\u003c=32-mer; \u003c1024 count|           34.1|        38.7|         1.15|\n|[kc-c4](kc-c4.c) (2+4 threads) |\u003c=32-mer; \u003c1024 count|            7.5|        35.1|         1.27|\n|[yak-count](yak-count.c) (2+4; \u003e=2 count)|\u003c=32-mer; \u003c1024 count| 14.6|        54.8|         0.47|\n|[jellyfish2][jf] (16 threads)  |                    |            10.8|       163.9|         0.82|\n|[KMC3][KMC] (16 thr; in-mem)   |                    |             9.2|        36.2|         5.02|\n\n## Discussions\n\n* [kc-py1.py](kc-py1.py) is a basic Python3 implementation. It uses string\n  translate for fast complementary. Interestingly, pypy is much slower than\n  python3. Perhaps the official python3 comes with a better hash table\n  implementation. Just a guess. I often recommend pypy over python. I need to\n  be more careful about this recommendation in future.\n\n* [kc-cpp1.cpp](kc-cpp1.cpp) implements a basic counter in C++11 using STL's\n  [unordered\\_map][unordermap]. It is slower than python3. This is partly\n  because STL's hash table implementation is very inefficient. C++ does not\n  necessarily lead to a fast implementation.\n\n* [kc-cpp2.cpp](kc-cpp2.cpp) replaces `std::unordered_map` with Martin Ankerl's\n  [robin\\_hood][rhhash] hash table library, which is [among the\n  fastest][rhbench] hash table implementations. It is now faster than\n  kc-py1.py, though the performance gap is small.\n\n* [kc-c1.c](kc-c1.c) packs k-mers no longer than 32bp into 64-bit integers.\n  This dramatically improves speed and reduces the peak memory. Most practical\n  k-mer counters employs bit packing. Excluding library files, this counter has\n  less than 100 coding lines, not much more complex than the C++ or the python\n  implementations.\n\n* [kc-c2.c](kc-c2.c) uses an ensemble of hash tables to save 8 bits for\n  counter. This reduces the peak memory. The key advantage of using multiple\n  hash tables is to implement multithreading. See below.\n\n* [kc-c3.c](kc-c3.c) puts file reading and parsing into a separate thread. The\n  performance improvement is minor here, but it sets the stage for the next\n  multi-threaded implementation.\n\n* [kc-c4.c](kc-c4.c) is the fastest counter in this series. Due to the use of\n  an ensembl of hash tables in kc-c2, we can parallelize the insertion of a\n  batch of k-mers. It is much faster than the previous versions. Notably, kc-c4\n  also uses less CPU time. This is probably because batching helps data\n  locality.\n\n* [yak-count.c](yak-count.c) is adapted from [yak][yak] and uses the same kc-c4\n  algorithm. Similar to [BFCounter][BFCnt], it optionally adds a bloom filter\n  to filter out most singleton k-mers (k-mers occurring only once in the\n  input). Yak needs to update the bloom filter, read the input twice and count\n  twice. It is slower but uses less memory. Yak-count is the most complex\n  example in this repo, but it is still short. Its code is also better\n  organized. Command line: `-b30` (bloom filter with 1 billion bits).\n\n* [jellyfish2][jf] is probably the fastest in-memory k-mer counter to date. It\n  uses less memory and more flexible than kc-c4, but it is slower and much more\n  complex. Command line: `count -m 31 -C -s 100000000 -o /dev/null -t 16`.\n\n* [KMC3][KMC] is one of the fastest k-mer counters. It uses minimizers and\n  relies on sorting. KMC3 is run in the in-memory mode here. The disk mode is\n  as fast. KMC3 is optimized for counting much larger datasets. Although it\n  uses more RAM here, it generally uses less RAM than jellyfish2 and other\n  in-memory counters given high-coverage human data. Command line: `-k31 -t16\n  -r -fa`.\n\n## Conclusions\n\nThe k-mer counters here are fairly basic implementations only using generic\nhash tables. Nonetheless, we show better engineering can carry the basic idea a\nlong way. If you want to implement your own k-mer counter,\n[yak-count.c](yak-count.c) could be a good starting point. It is fast and\nrelatively simple. By the way, if you have an efficient and simple k-mer\ncounter (implemented in a few files), please let me know. I will be happy to add it to the table.\n\n[jf]: http://www.genome.umd.edu/jellyfish.html\n[unordermap]: http://www.cplusplus.com/reference/unordered_map/unordered_map/\n[rhhash]: https://github.com/martinus/robin-hood-hashing\n[rhbench]: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/\n[gage-b]: https://ccb.jhu.edu/gage_b/datasets/index.html\n[yak]: https://github.com/lh3/yak\n[BFCnt]: https://github.com/pmelsted/BFCounter\n[KMC]: https://github.com/refresh-bio/KMC\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fkmer-cnt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fkmer-cnt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fkmer-cnt/lists"}