{"id":19558495,"url":"https://github.com/mach-kernel/clj-rabin","last_synced_at":"2025-02-26T08:19:03.129Z","repository":{"id":209559242,"uuid":"724293261","full_name":"mach-kernel/clj-rabin","owner":"mach-kernel","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-24T19:42:05.000Z","size":52,"stargazers_count":0,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-08T22:05:23.110Z","etag":null,"topics":["clojure","content-defined-chunking","rabin-karp"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mach-kernel.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-27T19:38:18.000Z","updated_at":"2024-07-24T15:04:16.000Z","dependencies_parsed_at":"2023-12-03T00:23:42.262Z","dependency_job_id":"f0769a83-1707-4469-9137-7bf63b4d1466","html_url":"https://github.com/mach-kernel/clj-rabin","commit_stats":null,"previous_names":["mach-kernel/clj-rabin"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mach-kernel%2Fclj-rabin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mach-kernel%2Fclj-rabin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mach-kernel%2Fclj-rabin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mach-kernel%2Fclj-rabin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mach-kernel","download_url":"https://codeload.github.com/mach-kernel/clj-rabin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240815049,"owners_count":19861993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","content-defined-chunking","rabin-karp"],"created_at":"2024-11-11T04:47:18.267Z","updated_at":"2025-02-26T08:19:03.093Z","avatar_url":"https://github.com/mach-kernel.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# clj-rabin\n\nA Clojure implementation of a rolling Rabin hash + chunker.\n\n#### Hacking\n\nBring up a REPL:\n```\nlein with-profile +dev \n```\n\nUse the CDC notebook to try chunking a dataset:\n\n```clojure\n(comment\n  (def chunks\n    (atom nil))\n\n  (load-dataset! \"data/natural_images\" chunks)\n\n  (let [rows-\u003elong (fn [r] (into {} (map (fn [[k v]] [k (long v)])) r))\n        stats-all (chunk-ds-\u003eagg-stats @chunks)\n        stats-cdc (-\u003e @chunks (ds/unique-by-column :sha256) chunk-ds-\u003eagg-stats)\n        stats-per-file (-\u003e\u003e @chunks\n                            (rd/group-by-column-agg :file {:block-count (rd/count-distinct :sha256)})\n                            (rd/aggregate {:avg-blocks-per-file (rd/mean :block-count)}))\n\n        ; maps containing aggregate vals\n        all (first (map rows-\u003elong (ds/rows stats-all)))\n        cdc (first (map rows-\u003elong (ds/rows stats-cdc)))\n        blocks (first (map rows-\u003elong (ds/rows stats-per-file)))\n        reduced-bytes (- (:total-bytes all)\n                         (:total-bytes cdc))]\n    {:all all\n     :cdc cdc\n     :blocks blocks\n     :diff {:reduced-bytes reduced-bytes\n            :reduced-percent (-\u003e\u003e (:total-bytes all)\n                                  (/ reduced-bytes)\n                                  (* 100)\n                                  double)}}))\n```\n\n`@chunks` is a `tech.ml.dataset` with all the chunks from the data loaded. Each chunk also gets a SHA-256 hash to ensure that the block is actually unique:\n\n```clojure\n{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},\n :cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},\n :blocks {:avg-blocks-per-file 2},\n :diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}\n```\n\n##### Datasets tested\n\nRabin parameter overrides are shown, otherwise assume defaults from `clj-rabin.hash/default-ctx`.\n\n\n###### Audio\n\nOverall: terrible ratios\n\n[Million song dataset](https://www.kaggle.com/datasets/undefinenull/million-song-dataset-spotify-lastfm)\n\nAudio codec info:\n\n```\nInput #0, mp3, from 'MP3-Example/Blues/Blues-TRADWSG128F4259317.mp3':\n  Duration: 00:00:30.04, start: 0.025057, bitrate: 96 kb/s\n  Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 96 kb/s\n    Metadata:\n      encoder         : LAME3.99r\n    Side data:\n      replaygain: track gain - -5.900000, track peak - unknown, album gain - unknown, album peak - unknown, \n```\n\n```clojure\n{:all {:total-bytes 544824272, :total-blocks 977662, :avg-block-size-bytes 557},\n :cdc {:total-bytes 543861899, :total-blocks 43428, :avg-block-size-bytes 12523},\n :blocks {:avg-blocks-per-file 651},\n :diff {:reduced-bytes 962373, :reduced-percent 0.1766391567811795}}\n```\n\n[Indian Music Raga](https://www.kaggle.com/datasets/kcwaghmarewaghmare/indian-music-raga)\n\nAudio codec info:\n```\nInput #0, wav, from 'raga/asavari02.wav':\n  Duration: 00:03:46.82, bitrate: 705 kb/s\n  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 1 channels, s16, 705 kb/s\n```\n\n```clojure\n{:all {:total-bytes 1105687312, :total-blocks 73031, :avg-block-size-bytes 15139},\n :cdc {:total-bytes 1105687295, :total-blocks 73014, :avg-block-size-bytes 15143},\n :blocks {:avg-blocks-per-file 890},\n :diff {:reduced-bytes 17, :reduced-percent 1.537505207439696E-6}}\n```\n\n###### Images\n\nOverall: awesome ratios on images\n\n[Natural images (jpeg)](https://www.kaggle.com/datasets/prasunroy/natural-images)\n\n```clojure\n{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},\n :cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},\n :blocks {:avg-blocks-per-file 2},\n :diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}\n```\n\n[Ripe and unripe tomatoes (jpeg)](https://www.kaggle.com/datasets/sumn2u/riped-and-unriped-tomato-dataset)\n\n```clojure\n{:all {:total-bytes 122864035, :total-blocks 151613, :avg-block-size-bytes 810},\n :cdc {:total-bytes 52075246, :total-blocks 29899, :avg-block-size-bytes 1741},\n :blocks {:avg-blocks-per-file 428},\n :diff {:reduced-bytes 70788789, :reduced-percent 57.6155495788495}}\n```\n\n#### Resources / Credits\n\n- [moinakg pcompress article](https://moinakg.wordpress.com/tag/rabin-fingerprint/).\n- [ncona](https://ncona.com/2017/06/the-rabin-karp-algorithm/)\n- [YADL](https://github.com/YADL/yadl/wiki/Rabin-Karp-for-Variable-Chunking)\n- [Horner's method](https://en.wikipedia.org/wiki/Horner%27s_method)\n- [how does rabin-karp choose breakpoint in variable-length chunking? (SO)](https://stackoverflow.com/questions/67101553/how-does-rabin-karp-choose-breakpoint-in-variable-length-chunking)\n  - [TTTD Paper](https://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf)\n- [Why is Rabin base a prime?](https://cs.stackexchange.com/a/28024)\n- [Choosing modulus in Rabin-Karp](https://cs.stackexchange.com/questions/10174/how-do-we-find-the-optimal-modulus-q-in-rabin-karp-algorithm)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmach-kernel%2Fclj-rabin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmach-kernel%2Fclj-rabin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmach-kernel%2Fclj-rabin/lists"}