{"id":27430987,"url":"https://github.com/bigmlcom/sketchy","last_synced_at":"2025-04-14T15:28:13.210Z","repository":{"id":8961825,"uuid":"10701580","full_name":"bigmlcom/sketchy","owner":"bigmlcom","description":"Sketching Algorithms for Clojure (bloom filter, min-hash, hyper-loglog, count-min sketch)","archived":false,"fork":false,"pushed_at":"2023-06-07T16:10:59.000Z","size":151,"stargazers_count":146,"open_issues_count":1,"forks_count":18,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-04-16T10:59:00.878Z","etag":null,"topics":["bloom-filter","clojure","count-min-sketch","hashing","hyperloglog","minhash","sketching"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigmlcom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-06-15T03:22:18.000Z","updated_at":"2023-06-07T16:11:04.000Z","dependencies_parsed_at":"2022-09-14T10:11:30.979Z","dependency_job_id":null,"html_url":"https://github.com/bigmlcom/sketchy","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsketchy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsketchy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsketchy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsketchy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigmlcom","download_url":"https://codeload.github.com/bigmlcom/sketchy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248905862,"owners_count":21181065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bloom-filter","clojure","count-min-sketch","hashing","hyperloglog","minhash","sketching"],"created_at":"2025-04-14T15:28:12.555Z","updated_at":"2025-04-14T15:28:13.195Z","avatar_url":"https://github.com/bigmlcom.png","language":"Clojure","readme":"\n# Sketching Algorithms in Clojure\n\n## Installation\n\n`sketchy` is available as a Maven artifact from\n[Clojars](http://clojars.org/bigml/sketchy).\n\n[![Clojars Project](https://img.shields.io/clojars/v/bigml/sketchy.svg)](https://clojars.org/bigml/sketchy)\n\n## Overview\n\nThis library contains various sketching/hash-based algorithms useful\nfor building compact summaries of large datasets.\n\nAll the sketches are composed using vanilla Clojure data\nstructures. That means immutability and easy serialization but humble\nperformance. [stream-lib](https://github.com/addthis/stream-lib) is a\ngood alternative for those in need of speed.\n\nGeneral Utilities:\n- [MurmurHash](#murmurhash)\n- [Immutable Bitset](#immutable-bitset)\n\nSketching/hash-based algorithms:\n- [Bloom Filter](#bloom-filter)\n- [Min Hash](#min-hash)\n- [Hyper-LogLog](#hyper-loglog)\n- [Count-Min](#count-min)\n\nAs we review each section, feel free to follow along in the REPL. Note\nthat `bigml.sketchy.test.demo` loads *Hamlet* and *A Midsummer Night's\nDream* into memory for our code examples.\n\n```clojure\nuser\u003e (ns test\n        (:use [bigml.sketchy.test.demo])\n        (:require (bigml.sketchy [murmur :as murmur]\n                                 [bits :as bits]\n                                 [bloom :as bloom]\n                                 [min-hash :as min-hash]\n                                 [hyper-loglog :as hll]\n                                 [count-min :as count-min])))\n```\n\n## MurmurHash\n\nThe `bigml.sketchy.murmur` namespace makes it easy to generate seeded\n[Murmur hashes](http://en.wikipedia.org/wiki/MurmurHash).  Murmur hashes\nare popular as they are reasonably quick to produce and adequately\nrandom.\n\nThese Murmur hashes are all produced as 64 bit longs.  A simple example\nhashing the string \"foo\" to a long:\n\n```clojure\ntest\u003e (murmur/hash \"foo\")\n6231696022289519434\n```\n\nAnything that `clojure.core/hash` accepts may also be used with this\nhash fn:\n\n```clojure\ntest\u003e (murmur/hash {:foo \"bar\"})\n-7720779806311024803\n```\n\nAn optional seed parameter selects a unique hashing function. Anything\nthat's hashable by `clojure.core/hash` is valid as a seed.\n\n```clojure\ntest\u003e (murmur/hash \"foo\" 0)\n6231696022289519434\ntest\u003e (murmur/hash \"foo\" 42)\n-8820575662888368925\ntest\u003e (murmur/hash \"foo\" :bar)\n-8527955061573093315\n```\n\nThe `truncate` function can be used to truncate the number of bits\n(must be less than 64 and more than 0).\n\n```clojure\ntest\u003e (murmur/truncate (murmur/hash \"foo\") 32)\n3972535114\ntest\u003e (murmur/truncate (murmur/hash \"foo\") 16)\n4938\ntest\u003e (murmur/truncate (murmur/hash \"foo\") 8)\n74\n```\n\nIf you need multiple unique hashes for a value, `hash-seq` is a\nconvenience function for that.  It applies an infinite sequence of\nunique hash functions (always in the same order), so `take` as many\nas you need.\n\n```clojure\ntest\u003e (take 3 (murmur/hash-seq \"foo\"))\n(6231696022289519434 -1965669315023635442 -4826411765733908310)\n```\n\n## Immutable Bitset\n\nBesides being my favorite name for a namespace, `bigml.sketchy.bits`\nprovides an immutable bitset supporting bit-level operations for any\nnumber of bits. The bitset is backed by a vector of longs.\n\nThe `create` function builds a bitset given the desired number of\nbits. Every bit will be initialized as clear (all zero).\n\nThe `set` function sets the bits at the given indicies. The `test`\nfunction returns true if the bit at the given index is set.\n\n```clojure\ntest\u003e (def my-bits (-\u003e (bits/create 256)\n                       (bits/set 2 48 58 184 233)))\ntest\u003e (bits/test my-bits 47)\nfalse\ntest\u003e (bits/test my-bits 48)\ntrue\n```\n\nThe `set-seq` function returns the indicies of every set\nbit. Alternatively, `clear-seq` returns all the clear bits.\n\n```clojure\ntest\u003e (bits/set-seq my-bits)\n(2 48 58 184 233)\n```\n\nThe `clear` function complements `set` by clearing the bits for the\ngiven indices. Similarly, the `flip` function reverses a bit's state.\n\n```clojure\ntest\u003e (bits/set-seq (bits/clear my-bits 48))\n(2 58 184 233)\ntest\u003e (bits/set-seq (bits/flip my-bits 48))\n(2 58 184 233)\n```\n\nMoreover, the namespace offers functions to `and` and `or` two\nbitsets. You can also measure `hamming-distance`,\n`jaccard-similarity`, or `cosine-similarity`.\n\n## Bloom Filter\n\n`bigml.sketchy.bloom` contains an implementation of a [Bloom\nfilter](http://en.wikipedia.org/wiki/Bloom_filter), useful for testing\nset membership. When checking set membership for an item, false\npositives are possible but false negatives are not.\n\nYou may `create` a Bloom filter by providing the expected number of\nitems to be inserted into the filter and the acceptable\nfalse positive rate.\n\nAfter creating the filter, you may either `insert` individual items or\nadd an entire collection of items `into` the Bloom filter.\n\n```clojure\ntest\u003e (def hamlet-bloom\n        (reduce bloom/insert\n                (bloom/create (count hamlet-tokens) 0.01)\n                hamlet-tokens))\n\ntest\u003e (def midsummer-bloom\n        (bloom/into (bloom/create (count midsummer-tokens) 0.01)\n                    midsummer-tokens))\n```\n\nItem membership is tested with `contains?`.\n\n```clojure\ntest\u003e (bloom/contains? hamlet-bloom \"puck\")\nfalse\ntest\u003e (bloom/contains? midsummer-bloom \"puck\")\ntrue\n```\n\nThe Bloom filters are also merge friendly as long as they are\ninitialized with the same parameters.\n\n```clojure\ntest\u003e (def summerham-bloom\n        (let [total (+ (count hamlet-tokens) (count midsummer-tokens))]\n          (bloom/merge (bloom/into (bloom/create total 0.01) midsummer-tokens)\n                       (bloom/into (bloom/create total 0.01) hamlet-tokens))))\ntest\u003e (bloom/contains? summerham-bloom \"puck\")\ntrue\ntest\u003e (bloom/contains? summerham-bloom \"yorick\")\ntrue\ntest\u003e (bloom/contains? summerham-bloom \"henry\")\nfalse\n```\n\n## Min-Hash\n\n`bigml.sketchy.min-hash` contains an implementation of the\n[MinHash](http://en.wikipedia.org/wiki/MinHash) algorithm, useful for\ncomparing the [Jaccard\nsimilarity](http://en.wikipedia.org/wiki/Jaccard_index) of two sets.\n\nThis implementation includes the improvements recommended in\n\"[Improved Densification of One Permutation Hashing](http://arxiv.org/abs/1406.4784)\",\nwhich greatly reduces the algorithmic complexity for building a MinHash.\n\nTo `create` a MinHash, you may provide a target error rate for\nsimilarity (default is 0.05). After that, you can either `insert`\nindividual values or add collections `into` the MinHash.\n\nIn the following example we break *A Midsummer Night's Dream* into two\nhalves (`midsummer-part1` and `midsummer-part2`) and build a MinHash\nfor each. We then compare the two parts together to see if they are\nmore similar than a MinHash of *Hamlet*.\n\nAs we'd expect, the two halves of *A Midsummer Night's Dream* are more\nalike than *Hamlet*.\n\n```clojure\ntest\u003e (def hamlet-hash (min-hash/into (min-hash/create) hamlet-tokens))\ntest\u003e (def midsummer1-hash (min-hash/into (min-hash/create) midsummer-part1))\ntest\u003e (def midsummer2-hash (min-hash/into (min-hash/create) midsummer-part2))\ntest\u003e (min-hash/jaccard-similarity midsummer1-hash midsummer2-hash)\n0.2852\ntest\u003e (min-hash/jaccard-similarity midsummer1-hash hamlet-hash)\n0.2012\n```\n\nThe MinHashes are merge friendly as long as they're initialized with\nthe same target error rate.\n\n```clojure\ntest\u003e (def midsummer-hash (min-hash/into (min-hash/create) midsummer-tokens))\ntest\u003e (min-hash/jaccard-similarity midsummer-hash\n                                   (min-hash/merge midsummer1-hash\n                                                   midsummer2-hash))\n1.0\n```\n\n## Hyper-LogLog\n\n`bigml.sketchy.hyper-loglog` contains an implementation of the\n[HyperLogLog](http://research.google.com/pubs/pub40671.html) sketch,\nuseful for estimating the number of distinct items in a set. This is a\ntechnique popular for tracking unique visitors over time.\n\nTo `create` a HyperLogLog sketch, you may provide a target error rate\nfor distinct item estimation (default is 0.05). After that, you can\neither `insert` individual values or add collections `into` the\nsketch.\n\n```clojure\ntest\u003e (def hamlet-hll (hll/into (hll/create 0.01) hamlet-tokens))\ntest\u003e (def midsummer-hll (hll/into (hll/create 0.01) midsummer-tokens))\ntest\u003e (count (distinct hamlet-tokens)) ;; actual\n4793\ntest\u003e (hll/distinct-count hamlet-hll)  ;; estimated\n4868\ntest\u003e (count (distinct midsummer-tokens)) ;; actual\n3034\ntest\u003e (hll/distinct-count midsummer-hll) ;; estimated\n3018\n```\n\nHyperLogLog sketches may be merged if they're initialized with the\nsame error rate.\n\n```clojure\ntest\u003e (count (distinct (concat hamlet-tokens midsummer-tokens))) ;; actual\n6275\ntest\u003e (hll/distinct-count (hll/merge hamlet-hll midsummer-hll)) ;; estimated\n6312\n```\n\nSimilar to MinHash, HyperLogLog sketches can also provide an estimate\nof the [Jaccard\nsimilarity](http://en.wikipedia.org/wiki/Jaccard_index) between two\nsets.\n\n```clojure\ntest\u003e (def midsummer1-hll (hll/into (hll/create 0.01) midsummer-part1))\ntest\u003e (def midsummer2-hll (hll/into (hll/create 0.01) midsummer-part2))\ntest\u003e (hll/jaccard-similarity midsummer1-hll midsummer2-hll)\n0.2833001988071571\ntest\u003e (hll/jaccard-similarity midsummer1-hll hamlet-hll)\n0.201231310466139\n```\n\n## Count-Min\n\n`bigml.sketchy.count-min` provides an implementation of the [Count-Min\nsketch](http://en.wikipedia.org/wiki/Count-Min_sketch), useful for\nestimating frequencies of arbritrary items in a stream.\n\nTo `create` a count-min sketch you may define the desired number of\nhash-bits and the number of independent hash functions.  The total\nnumber of counters maintained by the sketch will be\n(2^hash-bits)*hashers, so choose these values carefully.\n\nAfter creating a sketch, you may either `insert` individual values or\nadd collections `into` the sketch.\n\nIn the example below we build a Count-Min sketch that uses 1500\ncounters to estimate frequencies for the 4800 unique tokens in\n*Hamlet*.\n\n```clojure\ntest\u003e (def hamlet-cm (count-min/into (count-min/create :hash-bits 9)\n                                     hamlet-tokens))\ntest\u003e (count (:counters hamlet-cm))\n1536\ntest\u003e ((frequencies hamlet-tokens) \"hamlet\")\n77\ntest\u003e (count-min/estimate-count hamlet-cm \"hamlet\")\n87\ntest\u003e ((frequencies hamlet-tokens) \"rosencrantz\")\n7\ntest\u003e (count-min/estimate-count hamlet-cm \"rosencrantz\")\n15\n```\n\nAs with the other sketching algorithms, Count-Min sketches may be\nmerged if they're initialized with the same parameters.\n\n```clojure\ntest\u003e (def midsummer1-cm (count-min/into (count-min/create :hash-bits 9)\n                                         midsummer-part1))\ntest\u003e (def midsummer2-cm (count-min/into (count-min/create :hash-bits 9)\n                                         midsummer-part2))\ntest\u003e ((frequencies midsummer-tokens) \"love\") ;; actual count\n98\ntest\u003e (count-min/estimate-count (count-min/merge midsummer1-cm midsummer2-cm)\n                                \"love\")\n104\n```\n\n## Contributing to this project\n\nSee doc/contributing.md\n\n## License\n\nCopyright (C) 2013 BigML Inc.\n\nDistributed under the Apache License, Version 2.0.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fsketchy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigmlcom%2Fsketchy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fsketchy/lists"}