{"id":24143813,"url":"https://github.com/dnbaker/dashing2","last_synced_at":"2025-09-19T12:32:25.386Z","repository":{"id":38442604,"uuid":"364971115","full_name":"dnbaker/dashing2","owner":"dnbaker","description":"Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.","archived":false,"fork":false,"pushed_at":"2024-04-17T16:23:32.000Z","size":811,"stargazers_count":59,"open_issues_count":18,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-04-21T06:14:47.262Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnbaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-05-06T16:22:38.000Z","updated_at":"2024-04-04T12:47:33.000Z","dependencies_parsed_at":"2024-04-17T17:35:30.287Z","dependency_job_id":"4b28585f-d569-427c-85be-472fe06fef52","html_url":"https://github.com/dnbaker/dashing2","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fdashing2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnbaker","download_url":"https://codeload.github.com/dnbaker/dashing2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233570525,"owners_count":18695859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-12T05:45:34.443Z","updated_at":"2025-09-19T12:32:20.004Z","avatar_url":"https://github.com/dnbaker.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Introduction\n\nDashing2 is the second version of the Dashing sequence sketching and comparison system.\n\nThere have been several major changes, but you can still can get a quick start to compare a group of sequence collections [here](#quickstart).\nFor instructions on parsing, see [Parsing](#parsing-code) below.\n[Installation instructions](#installation) can be found below.\n\n### New Features\n\ntl;dr --\n1. Faster and more accurate distances.\n2. Weight-informed sketching.\n3. Near-linear K-nearest neighbor and thresholded neighbor graphs via LSH tables.\n4. Minimizer sequence transduction.\n5. Faster/easier installation than Dashing1\n  1. Compilation in 45s in parallel, 2 min serial\n\nInput Formats -- See [inputs](#inputs) below for details.\n 1. Fastq/Fasta -- enhanced alphabet support\n   1. Default -- ACGT\n   2. Protein encoding - 20 characters (`--protein`)\n   3. Reduced alphabets\n     1. Protein - 14, 8, and 6-character alphabets for long-distance homology.\n   4. Optional -- generating 128-bit k-mers (`--long-kmers`)\n   5. All k-mer parsing can be winnowed by setting `--window-size` to be \u003e k\n   6. Seeds can be spaced by providing a `--spacing` option, which provides the number of ignored characters in between used characters for seeds.\n   7. Exact multiset comparisons (`--countdict`)\n   8. Weight-aware sketching -- multiset and probability distribution\n     1. We use BagMinHash for weighted sets (`--bagminhash` or `--multiset`), and ProbMinHash for discrete probability distributions (`--prob`)\n   9. Minimizer sequence transduction\n     1. By enabling `--seq`, a sequence of minimizer values are emitted as a string.\n     2. This can be used for simple minimizer generation, or these minimizer sequences's edit distances can be compared in downstream analysis.\n 2. Splicing data\n   1. LeafCutter splicing output files `--leafcutter`\n 3. Interval Sets\n   1. BigWig files (`--bigwig`) [Weighted Interval Set]\n   2. BED files (`--bed`)\n 4. Binary dumps\n   1. Binary files made from 64-bit identifiers\n   2. Optional: paired weights for each feature\n   3. Can be sketched as sets, weighted sets, and discrete probability distributions.\n\n\nThere's also a two-step process; we can use the LSH index to generate probable candidates, but switch to exact distance comparisons after refinement.\nThis can be enabled with the `--refine-exact` flag when using filtered exact-set (`--set`) or exact-multiset (`--countdict`).\n\nFor sketching algorithms that are not OrderMinHash (whether the default SetSketch, BagMinHash or ProbMinHash), this means that candidates are generated with fast, compressed sketches,\nbut final results are produced using the full hash values.\n\nIf `--edit-distance` has been enabled, `--refine-exact` and `--compute-edit-distance` cause candidates to be generated by LSH table querying, but whose final distances are computed by exact edit ditsance.\nThis allows one to prune the edit distances with LSH but still get exact results out.\n\n## Outputs\n\nOutputs\n   1. All-pairs symmetric (default), resulting in a compressed distance matrix of size N-choose-2 = `(N (N - 1)) / 2`.\n   2. Rectangular comparisons (`--qfile/-Q` against `--ffile/-F`) by providing query/reference sets\n   3. Filtered output\n      1. Jaccard thresholded-results (`--similarity-threshold`) and top-k results (`--topk \u003ck\u003e`)\n        2. Dashing2 builds and queries an LSH index to avoid all-pairs comparisons when generating these neighbor-graphs.\n        3. This affords us near-linear time- and space- comparisons, which we can use for clustering, indexing, and summarization.\n   4. Asymmetric all-pairs (`--asymmetric-all-pairs`), which performs c(A, B) for all-pairs, resulting in an NxN matrix.\n\nAll of these can be emitted in binary format `--binary-output` to avoid parsing/print formatting costs.\n\n1. The matrices are emitted in flat, row-major format in float32.\n   1. If human-readable, the all-pairs symmetric is emitted as PHYLIP.\n   2. All-pairs asymmetric is emitted as a flat tsv.\n   3. Panel (rectangular) results are emitted as a flat tsv.\n2. Thresholded results are emitted in CSR-format; see `dashing2 cmp --help` for more details.\n   1. Human-readable thresholded results are emitted as a tsv, with potentially varying numbers of items per line.\n   2. Individual entries consist of the distance/similarity and the corresponding entity.\n\n\n## Inputs\n\nDashing2 sketches input datasets, and then compares resulting sketches.\n\nThe formats supported are:\n\n• FastA and FastQ\n  - Uncompressed or gzip-compressed is transparently processed\n  - Note: this supports both nucleotide and protein sketching.\n  - Canonicalization is off by default.\n  -       For k \u003c= 64, DNA/RNA uses exact encoding, and for higher k values, a rolling hash is used.\n  -       If protein is enabled (--enable-protein), then canonicalization does not apply and is not performed.\n  - There is currently no quality filtering; to filter by quality scores, mask bases below desired quality as Ns.\n• BigWig\n  - Is not stratified by chromosome\n• BED files\n  -  By default, BED files are sketched as sets of reference base/contig pairs.\n  -  These rows can be normalized (--normalize-intervals) to treat each *interval* as having the same weight,\n     or can be treated as-is to treat each contig/reference base pair as an item to sketch.\n  - Note: BED7 files have strand information, but Dashing2 does not attempt to stratify by strand\n  -     Convert these files into artificial BED files with the strand information included to work around this for now.\n• LeafCutter output files for splicing datasets\n\n# Sketching Formats\n## Sketching Algorithms\nBy default, we do set sketching (One-Permutation MinHash or SetSketch Minhash), but we can do multiset (BagMinHash) or discrete prob set (ProbMinHash) sketching.\nBagMinhash is probably most appropriate for genome assemblies (--multiset), and ProbMinhash (--prob) is most appropriate for splicing and expression datasets, due to normalization.\n\nWe also have untested support of edit distance LSH. It seems to work, but hasn't been carefully vetted.\n\n## Counting\nCounting is necessary for both BagMinHash and ProbMinHash. Default behavior, which works with sets and discards quantities, does not require it.\n\nFor counting, we either use a (single-row) CountSketch to approximate weighted set comparisons or we compute exact counts using a hash table.\nCounting is exact by default.\n\nTo enable count-sketch approximated counting, use the flag --countsketch-size [size]. The larger this parameter, the closer to exact the weighted sketching will be.\n\n## Exact sketching\nIn addition to creating sketches for m-mer sets, Dashing2 can perform full m-mer sets and m-mer count dictionaries.\nThis can be enabled with --set or --countdict, respectively. This will be slower, but exact.\n\n\n#### BED sketching\n##### Normalization\nDefault normalization treats each base pair in an interval as equally weighted. For SPACE\\_SET, this doesn't matter, but for weighted sketching, it does.\n\nThat is, an interval from 0:100-200 will have twice the weight as 0:100-150. If, instead, you want to treat each interval as weighing the same as another interval, enable --normalize-intervals.\n\n\n#### LeafCutter sketching\nSimilar to BED, except that the normalize flag causes the splicing event weight to be the fraction of reads supporting the junction rather than the absolute number  of reads.\n\nIE, 3/5 would have weight 0.6 if --normalize-intervals is enabled and weight 3 otherwise.\n\n\nSketches can be:\n    SetSketch (One-permutation minhash or SetSketch)\n    BagMinHash\n    ProbMinHash\n    OrderMinHash\n\nHowever, OrderMinHash is only available for sequences.\n\nWhen handling multiplicities, you can use exact counting, which may be slower, or you can approximate the count vectors with a single-row count-sketch\nby setting opts.cssize\\_ \u003e 0.\n\nProbMinHash is usually significantly (2-20+x) faster than BagMinHash, although multiset jaccard may be more appropriate for some problems.\nFor instance, expression and chromatic accessibility might be better considered discrete probability distributions and therefore fit ProbMinHash, whereas genomic sequences better match the multiset concept and benefit from BagMinHash, which is a MinHash algorithm for weighted sets.\n\n\n#### QuickStart\n\nWe expect most usage to involve the `sketch` subcommand; to sketch and perform distances, add `--cmpout \u003cfile\u003e`, where \u003cfile\u003e is the destination file or '-' to represent stdout.\n\n**Use 1 -- Sketch \\+ All-pairs Compare Sequence Collections**\n\n```\ndashing2 sketch [options] --cmpout \u003coutfile\u003e genome1.fa genome2.fa \u003c...\u003e\n# Alternate, if filenames are in F.txt\n# dashing2 sketch [options] --cmpout \u003coutfile\u003e -F F.txt\n```\n\nFull usage is found via `dashing2 --help` and `dashing2 \u003csubcommand\u003e --help`, where  \u003csubcommand\u003e is one of the dashing2 subcommands.\n\n`dashing2 sketch` performs sketching/summarization of a set of input files, or sequence-by-sequence processing of one or more sequence files.\nIt also optionally performs comparisons and emits results to `--cmpout`.\n\nAdding `--cache` causes Dashing2 to cache sketches to disk adjacent to the input files;\n     this location can be changed with `--outprefix`.\n\n\nWe support a variety of alphabets -- DNA, Protein, and reduced amino acid alphabets for long-range homology (--protein14, --protein8, --protein6).\n\nAlternate file-types supported include BigWigs (`--bigwig`), BED (`--bed`), and LeafCutter outputs (`--leafcutter`).\n\n\nClustering: scipy.cluster.hierarchy and fastcluster.hierarchy yield fast, concise clusterings, if distances are emitted.\n\nTo perform query-set vs reference-set comparison, see `-Q/--qfile` usage -- this yields a full rectangular matrix.\n\nThis is particularly useful for asymmetric similarities, such as containment.\n\n\n**Use 2 -- Sketch \\+ Top-k NN graphs**\n\nFor many applications, the quadratic space and time complexity of pairwise comparisons is prohibitive;\nwe use locality-sensitive hashing (LSH) table to only perform comparisons between candidates highly probable to be nearest neighbors.\n\nTo generate a K-NN graph between a collection of sequences, with k = 250:\n\n```\ndashing2 sketch \u003ccomparison options...\u003e --cmpout \u003coutfile\u003e --topk 250 -F F.txt\n```\n\nThe output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.\n\nClustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.\n\n**Use 3 -- Sketch \\+ Jaccard-thresholded similarity graphs**\n\nAlternative to `--topk [k]`, once can select a Jaccard similarity threshold below which the algorithm can ignore.\n\n```\ndashing2 sketch \u003ccomparison options...\u003e --cmpout \u003coutfile\u003e --similarity-threshold 0.7 -F F.txt\n```\n\nThe output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.\n\nClustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.\n\n\n\n**Use 4 -- Protein sequence similarity search**\n\nThe feature `--parse-by-seq` allows us to sketch and compute similarities between collections of sequences in a single file;\nin particular, this is useful for sequence files of protein sequences.\n\nThe default similarity/distance is k-mer set comparisons in sketch space; however, edit distance is perhaps more useful for protein sequences.\nAlso, sketch sizes for proteins should be substantially smaller than for a full genome sequence.\n\n```\ndashing2 sketch -S256 --cmpout prot.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa\n```\n\nIf `--edit-distance` is enabled, OrderMinHash (Guillaume Marcais, et al.) is used to build an LSH table, and comparisons are generated between OrderMinHash sketches;\nthis is only allowed in parse-by-seq mode, as it is undefined on a collection of sequences as opposed to a single sequence.\n\n```\ndashing2 sketch --topk 25 --edit-distance --compute-edit-distance -S256 --cmpout prot.top25.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa\n\n```\n\nIf either `--refine-exact` and `--compute-edit-distance` is enabled, then the final distances will be computed via edit distance.\n\nThis is particularly useful if the result is Jaccard or top-k thresholded, allowing linear-time top-k nearest neighbor lists.\n\n\n**Use 5: weighted sketching of feature sets**\n\nThere's also `dashing2 wsketch`, which can be used for hashing weighted sets for comparisons.\nThis is for the case where there are a set of integral identifiers for sketching.\nSee `dashing2 wsketch --help` for usage and examples.\n\n\n**Use 6: Grouping sequences by edit distance**\n\nTo use OrderMinHash, parsing must by by sequence. Further, --edit-distance tells Dashing2 to use edit distance LSH. --compute-edit-distance instructs dashing2 to compute edit distance between candidate neighbors instead\nof only comparing the sketches (the LSH register values themselves).\n\nUse this to generate K-Nearest Neighbor graphs for edit distance.\n\n```\ndashing2 sketch -p8 --cmpout knn.edit-distance.tbl -k7 --parse-by-seq --edit-distance --compute-edit-distance input.fasta\n```\n\n\n**Use 7: Generate KNN graph using exact K-mer distances but using LSH pre-filtering.**\n\nYou can do this for exact k-mer sets:\n\n```\ndashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31  -S1024 --set --topk 25 -o input_sequence_set.topk.tsv\n```\n\nOr for exact weighted k-mer sets:\n```\ndashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31  -S1024 --countdict --topk 25 -o input_sequence_set.topk.tsv\n```\n\nFor these approaches, Dashing2 will use a bottom-k LSH index to generate candidates.\n\nYou can instead generate a list of neighbors filtered by similarity threshold rather than choosing its top-k nearest neighbors.\n\nSimply replace `--topk \u003cint\u003e` with `--similarity-threshold \u003cfloat\u003e`.\n\n### Installation\n\nThe easiest way to get started is to download a statically-linked binary in the [dashing2-binaries](https://github.com/dnbaker/dashing2-binaries) repo.\nThere are different folders for osx and linux, so select the latest version from the folder corresponding to your operating system.\nDashing2 has not been tested on Windows, and we do not support it.\n\n\nAlternatively, you can build from source with  `git clone --recursive https://github.com/dnbaker/dashing2 \u0026\u0026 cd dashing2 \u0026\u0026 make -j4`.\n\nDashing2 now requires C++20, and therefore needs a relatively recent compiler, but the binary will be smaller than the statically-linked options provided\nand the code may be more directly tailored to your architecture.\n\n## Versions + Configuration\n\n1. More than 2^32 items -\n`dashing2` uses 32-bit hash identifiers in LSH tables for speed and memory efficiency.\nTo use more than 4.3 billion, use `dashing2-64`, which switched to 64-bit identifiers and hashes.\n\nThe default version of Dashing2 is dashing2, which uses 32-bit LSH keys and ID types in its NN tables;\nthis is faster and more memory-efficient, but less specific;\n\n2. Hardware cache size\nWhen comparing sketches, computations are grouped for better cache efficiency.\nA group size is selected to fit as many sketches as possible in cache;\nThe default cache size estimate is 4MB. To change this, set the `D2_CACHE_SIZE` environment variable.\n\n```sh\n# set cache size to 64 MB\nexport D2_CACHE_SIZE=67108864\n```\n\nAs an aside, these sketches are stored contiguously to reduce fragmentation compared to Dashing1.\n\n\n## Parsing code\n\nFor parsing binary output (--binary-output/--emit-binary), we provide [Python code](https://github.com/dnbaker/dashing2/blob/main/python/parse.py) for parsing into NumPy/SciPy matrices.\nThis can save on processing time, and it avoids differences due to formatting/parsing loss.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fdashing2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnbaker%2Fdashing2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fdashing2/lists"}