{"id":13639159,"url":"https://github.com/shenwei356/unikmer","last_synced_at":"2025-12-30T01:02:42.470Z","repository":{"id":57547370,"uuid":"144173371","full_name":"shenwei356/unikmer","owner":"shenwei356","description":"A versatile toolkit for k-mers with taxonomic information","archived":false,"fork":false,"pushed_at":"2024-08-06T07:39:45.000Z","size":6478,"stargazers_count":73,"open_issues_count":4,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-06T11:21:27.434Z","etag":null,"topics":["difference","golang","intersection","k-mer","kmer","set","unik","union","unique"],"latest_commit_sha":null,"homepage":"https://bioinf.shenwei.me/unikmer","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shenwei356.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-09T15:44:35.000Z","updated_at":"2024-08-06T07:39:48.000Z","dependencies_parsed_at":"2024-06-19T01:29:07.358Z","dependency_job_id":"85e02ef5-652b-4146-a8e9-f62721fea934","html_url":"https://github.com/shenwei356/unikmer","commit_stats":null,"previous_names":[],"tags_count":42,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Funikmer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Funikmer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Funikmer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Funikmer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shenwei356","download_url":"https://codeload.github.com/shenwei356/unikmer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223810279,"owners_count":17206728,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["difference","golang","intersection","k-mer","kmer","set","unik","union","unique"],"created_at":"2024-08-02T01:00:58.256Z","updated_at":"2025-12-30T01:02:42.436Z","avatar_url":"https://github.com/shenwei356.png","language":"Go","funding_links":[],"categories":["Sequence Analysis and Manipulation"],"sub_categories":[],"readme":"# unikmer: a versatile toolkit for k-mers with taxonomic information\n\nDocuments: https://bioinf.shenwei.me/unikmer/\n\n`unikmer` is a toolkit for nucleic acid [k-mer](https://en.wikipedia.org/wiki/K-mer) analysis, \nproviding functions\nincluding set operation k-mers (sketch) optional with\nTaxIds but without count information.\n\nK-mers are either encoded (k\u003c=32) or hashed ([k\u003c=64, using ntHash v1](https://github.com/bcgsc/ntHash/issues/41)) into `uint64`,\nand serialized in binary file with extension `.unik`.\n\nTaxIds can be assigned when counting k-mers from genome sequences,\nand LCA (Lowest Common Ancestor) is computed during set opertions\nincluding computing union, intersecton, set difference, unique and\nrepeated k-mers.\n\nRelated projects:\n\n- [kmers](https://github.com/shenwei356/kmers) provides bit-packed k-mers methods for this tool.\n- [unik](https://github.com/shenwei356/unik) provides k-mer serialization methods for this tool.\n- [sketches](https://pkg.go.dev/github.com/shenwei356/bio/sketches) provides generators/iterators for k-mer sketches \n([Minimizer](https://academic.oup.com/bioinformatics/article/20/18/3363/202143),\n [Scaled MinHash](https://f1000research.com/articles/8-1006),\n [Closed Syncmers](https://peerj.com/articles/10805/)).\n- [taxdump](https://github.com/shenwei356/bio/tree/master/taxdump) provides querying manipulations from NCBI Taxonomy taxdump files.\n\n\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n## Table of Contents\n\n- [Using cases](#using-cases)\n- [Installation](#installation)\n- [Commands](#commands)\n- [Binary file](#binary-file)\n- [Quick start](#quick-start)\n- [Support](#support)\n- [License](#license)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n## Using cases\n\n- Finding conserved regions in all genomes of a species.\n- Finding species/strain-specific sequences for designing probes/primers.\n\n## Installation\n\n1. Downloading [executable binary files](https://github.com/shenwei356/unikmer/releases).\n\n1. Via Bioconda [![Anaconda Cloud](https://anaconda.org/bioconda/unikmer/badges/version.svg)](https://anaconda.org/bioconda/unikmer) [![downloads](https://anaconda.org/bioconda/unikmer/badges/downloads.svg)](https://anaconda.org/bioconda/unikmer)\n\n        conda install -c bioconda unikmer\n\n## Commands\n\n[Usages](https://bioinf.shenwei.me/unikmer/usage)\n\n1. Counting\n\n        count           Generate k-mers (sketch) from FASTA/Q sequences\n\n1. Information\n\n        info            Information of binary files\n        num             Quickly inspect the number of k-mers in binary files\n\n1. Format conversion\n\n        view            Read and output binary format to plain text\n        dump            Convert plain k-mer text to binary format\n\n        encode          Encode plain k-mer texts to integers\n        decode          Decode encoded integers to k-mer texts\n        \n\n1. Set operations\n\n        concat          Concatenate multiple binary files without removing duplicates\n        inter           Intersection of k-mers in multiple binary files\n        common          Find k-mers shared by most of the binary files\n        union           Union of k-mers in multiple binary files\n        diff            Set difference of k-mers in multiple binary files\n\n1. Split and merge\n\n        sort            Sort k-mers to reduce the file size and accelerate downstream analysis\n        split           Split k-mers into sorted chunk files\n        tsplit          Split k-mers according to TaxId\n        merge           Merge k-mers from sorted chunk files\n\n1. Subset\n\n        head            Extract the first N k-mers\n        sample          Sample k-mers from binary files\n        grep            Search k-mers from binary files\n        filter          Filter out low-complexity k-mers\n        rfilter         Filter k-mers by taxonomic rank\n\n1. Searching on genomes\n\n        locate          Locate k-mers in genome\n        map             Mapping k-mers back to the genome and extract successive regions/subsequences\n\n1. Misc\n\n        autocompletion  Generate shell autocompletion script\n        version         Print version information and check for update\n\n## Binary file\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/shenwei356/unik.svg)](https://pkg.go.dev/github.com/shenwei356/unik)\n\nK-mers (represented in `uint64` in RAM ) are serialized in 8-Byte\n(or less Bytes for shorter k-mers in compact format,\nor much less Bytes for sorted k-mers) arrays and\noptionally compressed in gzip format with extension of `.unik`.\nTaxIds are optionally stored next to k-mers with 4 or less bytes.\n\n### Compression ratio comparison\n\nNo TaxIds stored in this test.\n\n![cr.jpg](testdata/cr.jpg)\n\nlabel           |encoded-kmer\u003csup\u003ea\u003c/sup\u003e|gzip-compressed\u003csup\u003eb\u003c/sup\u003e|compact-format\u003csup\u003ec\u003c/sup\u003e|sorted\u003csup\u003ed\u003c/sup\u003e|comment\n:---------------|:----------------------:|:-------------------------:|:------------------------:|:----------------:|:------------------------------------------------------\n`plain`         |                        |                           |                          |                  |plain text\n`gzip`          |                        |✔                          |                          |                  |gzipped plain text\n`unik.default`  |✔                       |✔                          |                          |                  |gzipped encoded k-mers in fixed-length byte array\n`unik.compat`   |✔                       |✔                          |✔                         |                  |gzipped encoded k-mers in shorter fixed-length byte array\n`unik.sorted`   |✔                       |✔                          |                          |✔                 |gzipped sorted encoded k-mers\n\n\n- \u003csup\u003ea\u003c/sup\u003e One k-mer is encoded as `uint64` and serialized in 8 Bytes.\n- \u003csup\u003eb\u003c/sup\u003e K-mers file is compressed in gzip format by default,\n  users can switch on global option `-C/--no-compress` to output non-compressed file.\n- \u003csup\u003ec\u003c/sup\u003e One k-mer is encoded as `uint64` and serialized in 8 Bytes by default.\n However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for\n  15-mers (30 bits). This makes the file more compact with smaller file size,\n  controled by global option `-c/--compact `.\n- \u003csup\u003ed\u003c/sup\u003e One k-mer is encoded as `uint64`, all k-mers are sorted and compressed\n  using varint-GB algorithm.\n- In all test, flag `--canonical` is ON when running `unikmer count`.\n\n\n## Quick Start\n\n\n    # memusg is for compute time and RAM usage: https://github.com/shenwei356/memusg\n\n\n    # counting (only keep the canonical k-mers and compact output)\n    # memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23 --canonical --compact\n    $ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical --compact\n    elapsed time: 0.897s\n    peak rss: 192.41 MB\n\n\n    # counting (only keep the canonical k-mers and sort k-mers)\n    # memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted --canonical --sort\n    $ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted --canonical --sort\n    elapsed time: 1.136s\n    peak rss: 227.28 MB\n    \n    \n    # counting and assigning global TaxIds\n    $ unikmer count -k 23 -K -s Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted   -t 585057\n    $ unikmer count -k 23 -K -s Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted -t 511145\n    $ unikmer count -k 23 -K -s A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.sorted -t 349741\n    \n    # counting minimizer and ouputting in linear order\n    $ unikmer count -k 23 -W 5 -H -K -l A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.m\n\n    # view\n    $ unikmer view Ecoli-MG1655.fasta.gz.k23.sorted.unik --show-taxid | head -n 3\n    AAAAAAAAACCATCCAAATCTGG 511145\n    AAAAAAAAACCGCTAGTATATTC 511145\n    AAAAAAAAACCTGAAAAAAACGG 511145\n    \n    # view (hashed k-mers needs original FASTA/Q file)\n    $ unikmer view --show-code --genome A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 3\n    CATCCGCCATCTTTGGGGTGTCG 1210726578792\n    AGCGCAAAATCCCCAAACATGTA 2286899379883\n    AACTGATTTTTGATGATGACTCC 3542156397282\n    \n    # find the positions of k-mers\n    $ unikmer locate -g A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 5\n    NC_010655.1     2       25      ATCTTATAAAATAACCACATAAC 0       .\n    NC_010655.1     5       28      TTATAAAATAACCACATAACTTA 0       .\n    NC_010655.1     6       29      TATAAAATAACCACATAACTTAA 0       .\n    NC_010655.1     9       32      AAAATAACCACATAACTTAAAAA 0       .\n    NC_010655.1     13      36      TAACCACATAACTTAAAAAGAAT 0       .\n\n    # info\n    $ unikmer info *.unik -a -j 10\n    file                                              k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version     number  description\n    A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik       23  ✓          ✓       ✕       ✕                            ✕       ✕        ✓        v5.0       860,900             \n    A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik  23  ✓          ✕       ✕       ✕                    349741  ✓       ✕        ✓        v5.0     2,630,905             \n    Ecoli-IAI39.fasta.gz.k23.sorted.unik             23  ✓          ✕       ✕       ✕                    585057  ✓       ✕        ✓        v5.0     4,902,266             \n    Ecoli-IAI39.fasta.gz.k23.unik                    23  ✓          ✕       ✕       ✕                            ✕       ✓        ✓        v5.0     4,902,266             \n    Ecoli-MG1655.fasta.gz.k23.sorted.unik            23  ✓          ✕       ✕       ✕                    511145  ✓       ✕        ✓        v5.0     4,546,632             \n    Ecoli-MG1655.fasta.gz.k23.unik                   23  ✓          ✕       ✕       ✕                            ✕       ✓        ✓        v5.0     4,546,632             \n    \n    \n    # concat\n    $ memusg -t unikmer concat *.k23.sorted.unik -o concat.k23 -c\n    elapsed time: 1.020s\n    peak rss: 25.86 MB\n\n\n    \n    # union\n    $ memusg -t unikmer union *.k23.sorted.unik -o union.k23 -s\n    elapsed time: 3.991s\n    peak rss: 590.92 MB\n    \n    \n    # or sorting with limited memory.\n    # note that taxonomy database need some memory.\n    $ memusg -t unikmer sort *.k23.sorted.unik -o union2.k23 -u -m 1M\n    elapsed time: 3.538s\n    peak rss: 324.2 MB\n    \n    $ unikmer view -t union.k23.unik | md5sum \n    4c038832209278840d4d75944b29219c  -\n    $ unikmer view -t union2.k23.unik | md5sum \n    4c038832209278840d4d75944b29219c  -\n    \n    \n    # duplicate k-mers\n    # memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -d -m 1M # limit memory usage\n    $ memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -d\n    elapsed time: 1.143s\n    peak rss: 240.18 MB\n\n    \n    # intersection\n    $ memusg -t unikmer inter *.k23.sorted.unik -o inter.k23\n    elapsed time: 1.481s\n    peak rss: 399.94 MB\n    \n\n    # difference\n    $ memusg -t unikmer diff -j 10 *.k23.sorted.unik -o diff.k23 -s\n    elapsed time: 0.793s\n    peak rss: 338.06 MB\n\n\n    $ ls -lh *.unik\n    -rw-r--r-- 1 shenwei shenwei 6.6M Sep  9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik\n    -rw-r--r-- 1 shenwei shenwei 9.5M Sep  9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik\n    -rw-r--r-- 1 shenwei shenwei  46M Sep  9 17:25 concat.k23.unik\n    -rw-r--r-- 1 shenwei shenwei 9.2M Sep  9 17:27 diff.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  11M Sep  9 17:26 dup.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  18M Sep  9 17:23 Ecoli-IAI39.fasta.gz.k23.sorted.unik\n    -rw-r--r-- 1 shenwei shenwei  29M Sep  9 17:24 Ecoli-IAI39.fasta.gz.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  17M Sep  9 17:23 Ecoli-MG1655.fasta.gz.k23.sorted.unik\n    -rw-r--r-- 1 shenwei shenwei  27M Sep  9 17:25 Ecoli-MG1655.fasta.gz.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  11M Sep  9 17:27 inter.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  26M Sep  9 17:26 union2.k23.unik\n    -rw-r--r-- 1 shenwei shenwei  26M Sep  9 17:25 union.k23.unik\n\n    $ unikmer stats *.unik -a -j 10\n    file                                              k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version     number  description\n    A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik       23  ✓          ✓       ✕       ✕                            ✕       ✕        ✓        v5.0       860,900             \n    A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik  23  ✓          ✕       ✕       ✕                    349741  ✓       ✕        ✓        v5.0     2,630,905             \n    concat.k23.unik                                  23  ✓          ✕       ✕       ✓                            ✕       ✓        ✓        v5.0            -1             \n    diff.k23.unik                                    23  ✓          ✕       ✕       ✓                            ✓       ✕        ✓        v5.0     2,326,096             \n    dup.k23.unik                                     23  ✓          ✕       ✕       ✓                            ✓       ✕        ✓        v5.0     2,576,170             \n    Ecoli-IAI39.fasta.gz.k23.sorted.unik             23  ✓          ✕       ✕       ✕                    585057  ✓       ✕        ✓        v5.0     4,902,266             \n    Ecoli-IAI39.fasta.gz.k23.unik                    23  ✓          ✕       ✕       ✕                            ✕       ✓        ✓        v5.0     4,902,266             \n    Ecoli-MG1655.fasta.gz.k23.sorted.unik            23  ✓          ✕       ✕       ✕                    511145  ✓       ✕        ✓        v5.0     4,546,632             \n    Ecoli-MG1655.fasta.gz.k23.unik                   23  ✓          ✕       ✕       ✕                            ✕       ✓        ✓        v5.0     4,546,632             \n    inter.k23.unik                                   23  ✓          ✕       ✕       ✓                            ✓       ✕        ✓        v5.0     2,576,170             \n    union2.k23.unik                                  23  ✓          ✕       ✕       ✓                            ✓       ✕        ✓        v5.0     6,872,728             \n    union.k23.unik                                   23  ✓          ✕       ✕       ✓                            ✓       ✕        ✓        v5.0     6,872,728\n\n    # -----------------------------------------------------------------------------------------\n\n    # mapping k-mers to genome\n    seqkit seq Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta\n    g=Ecoli-IAI39.fasta\n    f=inter.k23.unik\n    # mapping k-mers back to the genome and extract successive regions/subsequences\n    unikmer map -g $g $f -a | more\n    \n    \n    # using bwa\n    # to fasta\n    unikmer view $f -a -o $f.fa.gz\n    # make index\n    bwa index $g; samtools faidx $g\n    ncpu=12\n    ls $f.fa.gz \\\n        | rush -j 1 -v ref=$g -v j=$ncpu \\\n            'bwa aln -o 0 -l 17 -k 0 -t {j} {ref} {} \\\n                | bwa samse {ref} - {} \\\n                | samtools view -bS \u003e {}.bam; \\\n             samtools sort -T {}.tmp -@ {j} {}.bam -o {}.sorted.bam; \\\n             samtools index {}.sorted.bam; \\\n             samtools flagstat {}.sorted.bam \u003e {}.sorted.bam.flagstat; \\\n             /bin/rm {}.bam '  \n\n## Support\n\nPlease [open an issue](https://github.com/shenwei356/unikmer/issues) to report bugs,\npropose new functions or ask for help.\n\n## License\n\n[MIT License](https://github.com/shenwei356/unikmer/blob/master/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenwei356%2Funikmer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshenwei356%2Funikmer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenwei356%2Funikmer/lists"}