{"id":27928639,"url":"https://github.com/drbh/quemer","last_synced_at":"2025-05-07T02:41:00.465Z","repository":{"id":291427348,"uuid":"973366108","full_name":"drbh/quemer","owner":"drbh","description":"GPU accelerated k-mer counter","archived":false,"fork":false,"pushed_at":"2025-04-26T20:40:54.000Z","size":6,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-04T15:50:58.811Z","etag":null,"topics":["biology","cuda","gpu"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/drbh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-26T20:40:52.000Z","updated_at":"2025-04-26T20:42:01.000Z","dependencies_parsed_at":"2025-05-04T15:51:01.356Z","dependency_job_id":"e8a8cc62-fe6c-4b42-bd00-6cd48642d83f","html_url":"https://github.com/drbh/quemer","commit_stats":null,"previous_names":["drbh/quemer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drbh%2Fquemer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drbh%2Fquemer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drbh%2Fquemer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drbh%2Fquemer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/drbh","download_url":"https://codeload.github.com/drbh/quemer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252802429,"owners_count":21806499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biology","cuda","gpu"],"created_at":"2025-05-07T02:40:59.922Z","updated_at":"2025-05-07T02:41:00.451Z","avatar_url":"https://github.com/drbh.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# quemer\n### *pronounced \"k-mer\" [keɪ-mɜr]*\n\nGPU accelerated k-mer counter.\n\nBuilt on top of the [bqtools](https://github.com/arcinstitute/bqtools) and the `.bq` file format.\n\n## 📦 Installation\n\nbuild and copy to local bin\n\n```bash\nmake \u0026\u0026 PATH=\"$HOME/.local/bin:$PATH\" cp quemer ~/.local/bin/ \n```\n\n## 📋 Usage\n\nrun to see help\n```bash\nquemer\n# ┌───────────────────────────────────────┐\n# │       quemer - K-mer Counter          │\n# └───────────────────────────────────────┘\n# Usage: quemer \u003cbq_file\u003e \u003ck\u003e\n```\n\n## Performance\n\n\n| Tool      | Dataset                  | FASTQ Size | BQ Size | Processing Time |\n| --------- | ------------------------ | ---------- | ------- | --------------- |\n| quemer    | E. coli ERR4245144 8-mer | 4.5GB      | 579M    | 1.34s           |\n| jellyfish | E. coli ERR4245144 8-mer | 4.5GB      | -       | 7.48s           |\n\n\u003e [!NOTE]\n\u003e approximately 5.6x faster than jellyfish on my dev box, please do your own benchmarking.\n\n## Example\n\nRun the following commands to download the E. coli dataset, convert it to bq format, and run k-mer analysis. \n\nCounting 8-mers in the E. coli dataset (8.427M records) takes about 1.327 seconds on an RTX 4090.\n\n```bash\n# Download the E. coli dataset\ncurl -L -o data/ecoli.fastq.gz https://ftp.sra.ebi.ac.uk/vol1/fastq/ERR424/004/ERR4245144/ERR4245144_1.fastq.gz\ngunzip data/ecoli.fastq.gz\n\n# Convert to bq format\nbqtools encode data/ecoli.fastq -o data/ecoli.bq --policy c -T 32\n# 8426997 records written\n\n# Run k-mer analysis\ntime quemer data/ecoli.bq 8\n# ┌───────────────────────────────────────┐\n# │       quemer - K-mer Counter          │\n# └───────────────────────────────────────┘\n# ┌───────────────────────────────────────────────────┐\n# │ GPU: NVIDIA GeForce RTX 4090 (8.9)  RAM: 23.53 GB │\n# └───────────────────────────────────────────────────┘\n# Input: data/ecoli.bq (k=8) | 8426997 records × 251 bp (506.31 MB)\n# Packing sequences... done (888.52 ms, 2380.55 Mbp/s)\n# Transferring to GPU... done (36.82 ms, 21.60 GB/s)\n# Counting 2115176240 k-mers... done (44.83 ms, 47180.43 Mbp/s)\n# Retrieving results... done (65536/65536 non-zero)\n# Writing to data/ecoli.bq.k8.fa... done\n\n# ┌──────────────────────────────────────────────────────┐\n# │              K-MER COUNTING PERFORMANCE              │\n# ├──────────────────────────────┬───────────┬───────────┤\n# │ Operation                    │ Time (ms) │ % Total   │\n# ├──────────────────────────────┼───────────┼───────────┤\n# │ Host memory allocation       │    210.81 │   16.54%  │\n# │ Sequence packing             │    888.53 │   69.73%  │\n# │ GPU setup \u0026 transfer         │     37.09 │    2.91%  │\n# │ Kernel execution             │     44.83 │    3.52%  │\n# │ Result retrieval             │      0.12 │    0.01%  │\n# │ Output writing               │      3.88 │    0.30%  │\n# ├──────────────────────────────┼───────────┼───────────┤\n# │ TOTAL                        │   1274.18 │  100.00%  │\n# ├──────────────────────────────┴───────────┴───────────┤\n# │                  METRICS                             │\n# ├──────────────────────────────┬───────────────────────┤\n# │ Processing rate (Mbp/s)      │              1660.03  │\n# │ Throughput (GB/s)            │                 0.42  │\n# │ k-mer size                   │                    8  │\n# │ Records processed            │              8426997  │\n# └──────────────────────────────┴───────────────────────┘\n# quemer data/ecoli.bq 8  26.85s user 0.68s system 2075% cpu 1.327 total\n```\n\n## Comparing to other tools\n\nA very non scientific comparison with `jellyfish` follows.\n\nFirst we can run `jellyfish` on 32 threads to find all 8-mers in the E. coli dataset. \n\nThen we can dump `.jf` file into a fasta file that is human readable. \n\n```bash\ntime jellyfish count -m 8 -s 100M -t 16 data/ecoli.fastq \n# jellyfish count -m 8 -s 100M -t 16 data/ecoli.fastq  82.29s user 0.62s system 1108% cpu 7.476 total\n\ntime jellyfish dump mer_counts.jf \u003e mer_counts_dumps.fa\n# jellyfish dump mer_counts.jf \u003e mer_counts_dumps.fa  0.01s user 0.01s system 96% cpu 0.017 total\n```\n\nthe top of the file looks like:\n\n```fasta\n\u003e785758\nAAAAAAAA\n\u003e93506\nAAAAAAAC\n\u003e81435\nAAAAAAAG\n\u003e111666\nAAAAAAAT\n```\n\nand with `quemer`:\n\n```fasta\n\u003e785758\nAAAAAAAA\n\u003e94615\nAAAAAAAC\n\u003e81435\nAAAAAAAG\n\u003e111666\nAAAAAAAT\n```\n\n**note we can see that the counts differ and this is due to a structural difference in `bq` files do not allow `N` characters. \n\nabove we set the `--policy c` flag, which replaces all `N` characters with `C` characters. Which is why above the counts differ - specifically a higher count for `AAAAAAAC` (N's were replaced with C's).\n\n## Requirements\n- NVIDIA GPU (RTX 4090 recommended)\n- CUDA Toolkit\n\n## References\n- [K-mer (Wikipedia)](https://en.wikipedia.org/wiki/K-mer)\n- [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home)\n\n## TODO\n\n- [ ] Review sequence packing and host allocation (+86% of the overall time)\n- [ ] Add more tests\n- [ ] Expand to large k-mers\n- [ ] Improve overall performance\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrbh%2Fquemer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrbh%2Fquemer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrbh%2Fquemer/lists"}