{"id":24143851,"url":"https://github.com/dnbaker/dashing","last_synced_at":"2025-07-31T09:09:41.525Z","repository":{"id":31422703,"uuid":"127836686","full_name":"dnbaker/dashing","owner":"dnbaker","description":"Fast and accurate genomic distances using HyperLogLog","archived":false,"fork":false,"pushed_at":"2023-01-19T19:55:29.000Z","size":920013,"stargazers_count":160,"open_issues_count":26,"forks_count":12,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-02T13:48:57.333Z","etag":null,"topics":["hyperloglog","indexing","metagenomics","sketch-data-structures"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnbaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-03T02:02:39.000Z","updated_at":"2024-07-29T23:47:50.000Z","dependencies_parsed_at":"2023-02-11T20:00:37.597Z","dependency_job_id":null,"html_url":"https://github.com/dnbaker/dashing","commit_stats":null,"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"purl":"pkg:github/dnbaker/dashing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnbaker","download_url":"https://codeload.github.com/dnbaker/dashing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268016880,"owners_count":24181656,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-31T02:00:08.723Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hyperloglog","indexing","metagenomics","sketch-data-structures"],"created_at":"2025-01-12T05:45:43.057Z","updated_at":"2025-07-31T09:09:40.849Z","avatar_url":"https://github.com/dnbaker.png","language":"C++","readme":"# Dashing 🕺 [![Build Status](https://travis-ci.com/dnbaker/dashing.svg?branch=main)](https://travis-ci.com/dnbaker/dashing) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/dashing/README.html)\n\ndashing sketches and computes distances between fasta and fastq data.\n\nOur paper is available [here](https://www.biorxiv.org/content/10.1101/501726v2) as a preprint and [here](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1875-0) at Genome Biology.\n\n# Use\n\nThe easiest way to use dashing is to download a binary release. These are located on the [release page of the `dashing-binaries` repo](https://github.com/dnbaker/dashing-binaries/tags).  These archives contain three versions per executable, called `dashing_s128`, `dashing_s256`, and `dashing_s512`.  These work, respectively, on systems supporting SSE2, AVX2, and AVX512BW instructions.  If a binary with a higher number after the `s` fails to work on your system, try one with a lower number.  Note also that the binaries are gzipped, so you may need to `gunzip *` in the appropriate binary directory first.\n\n# Build\nClone this repository recursively, and use make.\n\n```bash\ngit clone --recursive https://github.com/dnbaker/dashing\ncd dashing \u0026\u0026 make dashing\n```\n\nIf you clone without submodules, this can be corrected by `git submodule update --init --recursive`.\n\nDashing is written in C++14, which means that it requires a relatively new compiler.\nDashing is tested under gcc`{5.4-9}`, but fails for gcc4, which is installed by default on many machines.\nFor OSX, we recommend using Homebrew to install gcc-8.\nOn Linux, we recommend package managers. (For instance, our Travis-CI Ubuntu example upgrades to a sufficiently new GCC using `sudo update-alternatives`.)\n\n# Usage\n\nTo see all usage options, use `./dashing \u003csubcommand\u003e`, for subcommand in `[sketch, dist, hll, union, printmat]`.\nOf most interest is probably the dist command, which can take either genomes or pre-built sketches as arguments.\n\n## dist\nFor the simplest case of unspaced, unminimized kmers for a set of genomes with `k = 31` and 13 threads:\n\n```\ndashing dist -k31 -p13 -Odistance_matrix.txt -osize_estimates.txt genome1.fna.gz genome2.fna genome3.fasta \u003c...\u003e\n```\n\nThe genomes can be omitted as positional arguments if `-F genome_paths.txt` is provided, where `genome_paths.txt` is a file containing a path to a genome per line.\nThis can avoid system limits on the number of arguments in a shell command.\n\nThese can be cached with `-c`, which saves the sketches for later use. These sketch filenames are based on spacing, kmer size, and sketch size, so there is no risk of overwriting each other.\n\n### dist (asymmetric mode)\n\n`dashing dist` performs all pairwise jaccard index estimates by default. By providing the `-Q` flag, dashing performs a core\ncomparison operation between all queries and all references, where references are provided by `-F`.\n\nThis is necessary to provide containment.\n\nFor example:\n\n```\ndashing dist --containment-index -k21 -Odistmat.txt -ofsizes.txt -Q query_paths.txt -F ref_paths.txt\n```\n\nTo generate a full, asymmetric distance matrix, provide the same path to -F and -Q.\n\n\n\n## sketch\nThe sketch command largely mirrors dist, except that only sketches are computed.\n\n```\ndashing sketch -k31 -p13 -F genome_paths.txt\n```\n\n## hll\nThe hll command simply estimates the number of unique elements in a set of files. This can be useful for estimating downstream database sizes based on spacing schemes.\n\n## union\nThe union command takes a set of pre-sketched HLLs and performs unions between them. Currently, the sketches must be of the same size.\nWe may modify this in future releases to allow a merger of different sizes by flattening larger sketches to the smallest sketch size discovered.\nThis would involve a loss of precision from the larger models.\nThis currently doesn't support data structures besides HLLs, but we plan to make this change at a later date.\n\n\n## Features\n\nTransparently consumes uncompressed, zlib- or zstd-compressed files.\n\nCaching of sketches to disk (in compressed form)\n\nCalculation of a variety of (dis)similarity measures:\n1. Jaccard Similarity\n2. Mash distance\n3. Containment index \n4. Containment distance (log transformed containment index)\n5. Symmetric Containment Index (`\\frac{|A \\bigcap B|}{\\min{|A|,|B|}}`) (The maximum of each containment index)\n6. Symmetric Continament Distance (log transformed SCI)\n7. Intersection size\n\nAdditionally, supports all the above under the weighted/multiset Jaccard index via labeled w-shingling. (See Broder, 1997 \"On the Resemblance and Containment of Documents\" for more details.)\n\n#### Filtering\nFiltering of of rare k-mer events via count-min sketch point query estimates. This is primarily desirable for raw sequencing datasets rather than genome assemblies. This is enabled with `-y/--countmin`, and the number of hashes (`--nhashes`), sketch size (`--cm-sketch-size`) and min count `--min-count` can all be controlled by command-line parameters.\n\n#### Encoding options\n1. Exact k-mer encoding (`k \u003c= 32`)\n2. Rolling hashing encoding for any k\n3. Spaced seed encoding (Hamming weight \u003c= 32)\n4. Windowed/minimized k-mers\n\nSee the [bonsai](https://github.com/dnbaker/bonsai) for more details on encoding.\n\n#### Output formats\n\nDashing defaults to upper triangular TSV matrix emission, but it also suppurts upper triangular PHYLIP format, packed binary encoding, and top k-nearest-neighbor emission formats.\n\nThis is supported for all symmetric measures (Mash distance, Jaccard, Intersection size, and their multiset equivalents), whereas asymmetric measures and nearest neighbor forms (all variations of containment) have two emission options: tabular and binary.\n\n\n## Alternative Data Structures\n\nDashing supports comparisons with a variety of data structures, which have speed and accuracy tradeoffs for given situations.\nBy default, HyperLogLog sketches are used, while b-bit minhashing, bottom-k minhashing, bloom filters, and hash sets are supported. \nUsing hash sets provides a ground truth at the expense of greatly increased runtime costs.\n\n```\nb-bit minhashing:             --use-bb-minhash\n\nbottom-k minhashing:          --use-range-minhash\n\nhash sets:                    --use-full-khash-sets\n\nbloom filters:                --use-bloom-filter\n\nWide HLL:                     --use-wide-hll\n```\n\nReferences:\n[SuperMinHash](https://arxiv.org/abs/1706.05698), modified. (Use 32-bit register instead of float between 0 and 1 to make use of more information.)\n[Bloom Filter Jaccard Index](https://www.ncbi.nlm.nih.gov/pubmed/17444629)\n\n\n## To Cite:\n\n```tex\n@Article{pmid31801633,\n   Author=\"Baker, D. N.  and Langmead, B. \",\n   Title=\"{{D}ashing: fast and accurate genomic distances with {H}yper{L}og{L}og}\",\n   Journal=\"Genome Biol.\",\n   Year=\"2019\",\n   Volume=\"20\",\n   Number=\"1\",\n   Pages=\"265\",\n   Month=\"12\"\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fdashing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnbaker%2Fdashing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fdashing/lists"}