{"id":24435287,"url":"https://github.com/bigmlcom/histogram","last_synced_at":"2025-04-04T06:06:52.088Z","repository":{"id":62431301,"uuid":"2038969","full_name":"bigmlcom/histogram","owner":"bigmlcom","description":"Streaming Histograms for Clojure/Java","archived":false,"fork":false,"pushed_at":"2024-06-12T13:16:01.000Z","size":350,"stargazers_count":155,"open_issues_count":3,"forks_count":25,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-03-28T05:08:30.588Z","etag":null,"topics":["clojure","data-summary","histogram","streaming"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigmlcom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-07-12T22:53:59.000Z","updated_at":"2025-01-27T03:24:05.000Z","dependencies_parsed_at":"2024-06-21T05:53:33.602Z","dependency_job_id":null,"html_url":"https://github.com/bigmlcom/histogram","commit_stats":{"total_commits":155,"total_committers":8,"mean_commits":19.375,"dds":"0.058064516129032295","last_synced_commit":"e471369e2b8b30793a7ad36896d774c21576ee2f"},"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fhistogram","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fhistogram/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fhistogram/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fhistogram/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigmlcom","download_url":"https://codeload.github.com/bigmlcom/histogram/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247128739,"owners_count":20888234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","data-summary","histogram","streaming"],"created_at":"2025-01-20T17:18:57.425Z","updated_at":"2025-04-04T06:06:52.070Z","avatar_url":"https://github.com/bigmlcom.png","language":"Java","funding_links":[],"categories":["Science and Data Analysis"],"sub_categories":[],"readme":"# Overview\n\nThis project is an implementation of the streaming, one-pass\nhistograms described in Ben-Haim's [Streaming Parallel Decision\nTrees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). Inspired\nby Tyree's [Parallel Boosted Regression\nTrees](http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf),\nthe histograms are extended so that they may track multiple values.\n\nThe histograms act as an approximation of the underlying dataset. They\ncan be used for learning, visualization, discretization, or analysis.\nThe histograms may be built independently and merged, making them\nconvenient for parallel and distributed algorithms.\n\nWhile the core of this library is implemented in Java, it includes a\nfull featured Clojure wrapper. This readme focuses on the Clojure\ninterface, but Java developers can find documented methods in\n`com.bigml.histogram.Histogram`.\n\n# Installation\n\n`histogram` is available as a Maven artifact from\n[Clojars](http://clojars.org/bigml/histogram).\n\nFor [Leiningen](https://github.com/technomancy/leiningen):\n\n[![Clojars Project](http://clojars.org/bigml/histogram/latest-version.svg)](http://clojars.org/bigml/histogram)\n\nFor [Maven](http://maven.apache.org/):\n\n```xml\n\u003crepository\u003e\n  \u003cid\u003eclojars.org\u003c/id\u003e\n  \u003curl\u003ehttp://clojars.org/repo\u003c/url\u003e\n\u003c/repository\u003e\n\u003cdependency\u003e\n  \u003cgroupId\u003ebigml\u003c/groupId\u003e\n  \u003cartifactId\u003ehistogram\u003c/artifactId\u003e\n  \u003cversion\u003e4.1.2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n# Basics\n\nIn the following examples we use [Incanter](http://incanter.org/) to\ngenerate data and for charting.\n\nThe simplest way to use a histogram is to `create` one and then\n`insert!` points.  In the example below, `ex/normal-data` refers to a\nsequence of 200K samples from a normal distribution (mean 0, variance\n1).\n\n```clojure\nuser\u003e (ns examples\n        (:use [bigml.histogram.core])\n        (:require (bigml.histogram.test [examples :as ex])))\nexamples\u003e (def hist (reduce insert! (create) ex/normal-data))\n```\n\nYou can use the `sum` fn to find the approximate number of points less\nthan a given threshold:\n\n```clojure\nexamples\u003e (sum hist 0)\n99814.63248\n```\n\nThe `density` fn gives us an estimate of the point density at the\ngiven location:\n\n```clojure\nexamples\u003e (density hist 0)\n80936.98291\n```\n\nThe `uniform` fn returns a list of points that separate the\ndistribution into equal population areas.  Here's an example that\nproduces quartiles:\n\n```clojure\nexamples\u003e (uniform hist 4)\n(-0.66904 0.00229 0.67605)\n```\nArbritrary percentiles can be found using `percentiles`:\n\n```clojure\nexamples\u003e (percentiles hist 0.5 0.95 0.99)\n{0.5 0.00229, 0.95 1.63853, 0.99 2.31390}\n```\n\nWe can plot the sums and density estimates as functions.  The red line\nrepresents the sum, the blue line represents the density.  If we\nnormalized the values (dividing by 200K), these lines approximate the\n[cumulative distribution\nfunction](http://en.wikipedia.org/wiki/Cumulative_distribution_function)\nand the [probability distribution\nfunction](http://en.wikipedia.org/wiki/Probability_density_function)\nfor the normal distribution.\n\n```clojure\nexamples\u003e (ex/sum-density-chart hist) ;; also see (ex/cdf-pdf-chart hist)\n```\n![Histogram from normal distribution](img/normal.png)\n\nThe histogram approximates distributions using a constant number of\nbins. This bin limit is a parameter when creating a histogram\n(`:bins`, defaults to 64). A bin contains a `:count` of the points\nwithin the bin along with the `:mean` for the values in the bin. The\nedges of the bin aren't captured. Instead the histogram assumes that\npoints of a bin are distributed with half the points less than the bin\nmean and half greater. This explains the fractional sum in the example\nbelow:\n\n```clojure\nexamples\u003e (def hist (-\u003e (create :bins 3)\n                        (insert! 1)\n                        (insert! 2)\n                        (insert! 3)))\nexamples\u003e (bins hist)\n({:mean 1.0, :count 1} {:mean 2.0, :count 1} {:mean 3.0, :count 1})\nexamples\u003e (sum hist 2)\n1.5\n```\n\nAs mentioned earlier, the bin limit constrains the number of unique\nbins a histogram can use to capture a distribution. The histogram\nabove was created with a limit of just three bins. When we add a\nfourth unique value it will create a fourth bin and then merge the\nnearest two.\n\n```clojure\nexamples\u003e (bins (insert! hist 0.5))\n({:mean 0.75, :count 2} {:mean 2.0, :count 1} {:mean 3.0, :count 1})\n```\n\nA larger bin limit means a higher quality picture of the distribution,\nbut it also means a larger memory footprint.  In the chart below, the\nred line represents a histogram with 8 bins and the blue line\nrepresents 64 bins.\n\n```clojure\nexamples\u003e (ex/multi-pdf-chart\n           [(reduce insert! (create :bins 8) ex/mixed-normal-data)\n            (reduce insert! (create :bins 64) ex/mixed-normal-data)])\n```\n![8 and 64 bins histograms](img/bin-limit.png)\n\nAnother option when creating a histogram is to use *gap\nweighting*. When `:gap-weighted?` is true, the histogram is encouraged\nto spend more of its bins capturing the densest areas of the\ndistribution. For the normal distribution that means better resolution\nnear the mean and less resolution near the tails. The chart below\nshows a histogram without gap weighting in blue and with gap weighting\nin red.  Near the center of the distribution, red uses more bins and\nbetter captures the gaussian distribution's true curve.\n\n```clojure\nexamples\u003e (ex/multi-pdf-chart\n           [(reduce insert! (create :bins 8 :gap-weighted? true)\n                    ex/normal-data)\n            (reduce insert! (create :bins 8 :gap-weighted? false)\n                    ex/normal-data)])\n```\n![Gap weighting vs. No gap weighting](img/gap-weighting.png)\n\n# Merging\n\nA strength of the histograms is their ability to merge with one\nanother. Histograms can be built on separate data streams and then\ncombined to give a better overall picture.\n\nIn this example, the blue line shows a density distribution from a\nhistogram after merging 300 noisy histograms. The red shows one of the\noriginal histograms:\n\n```clojure\nexamples\u003e (let [samples (partition 1000 ex/mixed-normal-data)\n                hists (map #(reduce insert! (create) %) samples)\n                merged (reduce merge! (create) (take 300 hists))]\n            (ex/multi-pdf-chart [(first hists) merged]))\n```\n![Merged histograms](img/merging.png)\n\n# Targets\n\nWhile a simple histogram is nice for capturing the distribution of a\nsingle variable, it's often important to capture the correlation\nbetween variables. To that end, the histograms can track a second\nvariable called the *target*.\n\nThe target may be either numeric or categorical. The `insert!` fn is\noverloaded to accept either type of target. Each histogram bin will\ncontain information summarizing the target. For numeric targets the\nsum and sum-of-squares are tracked.  For categoricals, a map of\ncounts is maintained.\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! 1 9)\n              (insert! 2 8)\n              (insert! 3 7)\n              (insert! 3 6)\n              (bins))\n({:target {:sum 9.0, :sum-squares 81.0, :missing-count 0.0},\n  :mean 1.0,\n  :count 1}\n {:target {:sum 8.0, :sum-squares 64.0, :missing-count 0.0},\n  :mean 2.0,\n  :count 1}\n {:target {:sum 13.0, :sum-squares 85.0, :missing-count 0.0},\n  :mean 3.0,\n  :count 2})\nexamples\u003e (-\u003e (create)\n              (insert! 1 :a)\n              (insert! 2 :b)\n              (insert! 3 :c)\n              (insert! 3 :d)\n              (bins))\n({:target {:counts {:a 1.0}, :missing-count 0.0},\n  :mean 1.0,\n  :count 1}\n {:target {:counts {:b 1.0}, :missing-count 0.0},\n  :mean 2.0,\n  :count 1}\n {:target {:counts {:d 1.0, :c 1.0}, :missing-count 0.0},\n  :mean 3.0,\n  :count 2})\n```\n\nMixing target types isn't allowed:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! 1 :a)\n              (insert! 2 999))\nCan't mix insert types\n  [Thrown class com.bigml.histogram.MixedInsertException]\n```\n\n`insert-numeric!` and `insert-categorical!` allow target types to be\nset explicitly:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert-categorical! 1 1)\n              (insert-categorical! 1 2)\n              (bins))\n({:target {:counts {2 1.0, 1 1.0}, :missing-count 0.0}, :mean 1.0, :count 2})\n```\n\nThe `extended-sum` fn works similarly to `sum`, but returns a result\nthat includes the target information:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! 1 :a)\n              (insert! 2 :b)\n              (insert! 3 :c)\n              (extended-sum 2))\n{:sum 1.5, :target {:counts {:c 0.0, :b 0.5, :a 1.0}, :missing-count 0.0}}\n```\n\nThe `average-target` fn returns the average target value given a\npoint. To illustrate, the following histogram captures a dataset where\nthe input field is a sample from the normal distribution while the\ntarget value is the sine of the input. The density is in red and the\naverage target value is in blue:\n\n```clojure\nexamples\u003e (def make-y (fn [x] (Math/sin x)))\nexamples\u003e (def hist (let [target-data (map (fn [x] [x (make-y x)])\n                                           ex/normal-data)]\n                      (reduce (fn [h [x y]] (insert! h x y))\n                              (create)\n                              target-data)))\nexamples\u003e (ex/pdf-target-chart hist)\n```\n![Numeric target](img/targets.png)\n\nContinuing with the same histogram, we can see that `average-target`\nproduces values close to original target:\n\n```clojure\nexamples\u003e (def view-target (fn [x] {:actual (make-y x)\n                                    :approx (:sum (average-target hist x))}))\nexamples\u003e (view-target 0)\n{:actual 0.0, :approx -0.00051}\nexamples\u003e  (view-target (/ Math/PI 2))\n{:actual 1.0, :approx 0.9968169965429206}\nexamples\u003e (view-target Math/PI)\n{:actual 0.0, :approx 0.00463}\n```\n\n# Missing Values\n\nInformation about missing values is captured whenever the input field\nor the target is `nil`. The `missing-bin` fn retrieves information\nsummarizing the instances with a missing input. For a basic histogram,\nthat is simply the count:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! nil)\n              (insert! 7)\n              (insert! nil)\n              (missing-bin))\n{:count 2}\n```\n\nFor a histogram with a target, the `missing-bin` includes target\ninformation:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! nil :a)\n              (insert! 7 :b)\n              (insert! nil :c)\n              (missing-bin))\n{:target {:counts {:a 1.0, :c 1.0}, :missing-count 0.0}, :count 2}\n```\n\nTargets can also be missing, in which case the target `missing-count`\nis incremented:\n\n```clojure\nexamples\u003e (-\u003e (create)\n              (insert! nil :a)\n              (insert! 7 :b)\n              (insert! nil nil)\n              (missing-bin))\n{:target {:counts {:a 1.0}, :missing-count 1.0}, :count 2}\n```\n\n# Array-backed Categorical Targets\n\nBy default a histogram with categorical targets stores the category\ncounts as Java HashMaps. Building and merging HashMaps can be\nexpensive. Alternatively the category counts can be backed by an\narray. This can give better performance but requires the set of\npossible categories to be declared when the histogram is created. To\ndo this, set the `:categories` parameter:\n\n```clojure\nexamples\u003e (def categories (map (partial str \"c\") (range 50)))\nexamples\u003e (def data (vec (repeatedly 100000\n                                     #(vector (rand) (str \"c\" (rand-int 50))))))\nexamples\u003e (doseq [hist [(create) (create :categories categories)]]\n            (time (reduce (fn [h [x y]] (insert! h x y))\n                          hist\n                          data)))\n\"Elapsed time: 1295.402 msecs\"\n\"Elapsed time: 516.72 msecs\"\n```\n\n# Group Targets\n\nGroup targets allow the histogram to track multiple targets at the\nsame time. Each bin contains a sequence of target\ninformation. Optionally, the target types in the group can be declared\nwhen creating the histogram. Declaring the types on creation allows\nthe targets to be missing in the first insert:\n\n```clojure\nexamples\u003e (-\u003e (create :group-types [:categorical :numeric])\n              (insert! 1 [:a nil])\n              (insert! 2 [:b 8])\n              (insert! 3 [:c 7])\n              (insert! 1 [:d 6])\n              (bins))\n({:target\n  ({:counts {:d 1.0, :a 1.0}, :missing-count 0.0}\n   {:sum 6.0, :sum-squares 36.0, :missing-count 1.0}),\n  :mean 1.0,\n  :count 2}\n {:target\n  ({:counts {:b 1.0}, :missing-count 0.0}\n   {:sum 8.0, :sum-squares 64.0, :missing-count 0.0}),\n  :mean 2.0,\n  :count 1}\n {:target\n  ({:counts {:c 1.0}, :missing-count 0.0}\n   {:sum 7.0, :sum-squares 49.0, :missing-count 0.0}),\n  :mean 3.0,\n  :count 1})\n```\n\n# Rendering\n\nThere are multiple ways to render the charts, see\n[examples.clj](test/bigml/histogram/test/examples.clj).\nAn example of rendering a single function, namely cumulative probability:\n\n```clojure\nexamples\u003e (def hist (reduce hst/insert! (hst/create) [1 1 2 3 4 4 4 5]))\nexamples\u003e (let [{:keys [min max]} (hst/bounds hist)]\n            (core/view (charts/function-plot (hst/cdf hist) min max)))\n```\n\n(`core` and `charts` are [Incanter namespaces](http://liebke.github.io/incanter/).)\n\nTo render multiple functions on the same chart, you would use\n `add-function` with the result of `function-plot`:\n\n```clojure\nexamples\u003e (core/view (-\u003e (charts/function-plot (hst/cdf hist) min max :legend true)\n                         (charts/add-function (hst/pdf hist) min max)))\n```\n\n# Performance-related concerns\n\n## Freezing a Histogram\n\nWhile the ability to adapt to non-stationary data streams is a\nstrength of the histograms, it is also computationally expensive. If\nyour data stream is stationary, you can increase the histogram's\nperformance by setting the `:freeze` parameter. After the number of\ninserts into the histogram have exceeded the `:freeze` parameter, the\nhistogram bins are locked into place. As the bin means no longer\nshift, inserts become computationally cheap. However the quality of\nthe histogram can suffer if the `:freeze` parameter is too small.\n\n```clojure\nexamples\u003e (time (reduce insert! (create) ex/normal-data))\n\"Elapsed time: 333.5 msecs\"\nexamples\u003e (time (reduce insert! (create :freeze 1024) ex/normal-data))\n\"Elapsed time: 166.9 msecs\"\n```\n\n## Performance\n\nThere are two implementations of bin reservoirs (which support the\n`insert!` and `merge!` functions). Either of the two implementations,\n`:tree` and `:array`, can be explicitly selected with the `:reservoir`\nparameter.  The `:tree` option is useful for histograms with many bins\nas the insert time scales at `O(log n)` with respect to the # of\nbins. The `:array` option is good for small number of bins since\ninserts are `O(n)` but there's a smaller overhead. If `:reservoir` is\nleft unspecified then `:array` is used for histograms with \u003c= 256 bins\nand `:tree` is used for anything larger.\n\n```clojure\nexamples\u003e (time (reduce insert! (create :bins 16 :reservoir :tree)\n                        ex/normal-data))\n\"Elapsed time: 554.478 msecs\"\nexamples\u003e (time (reduce insert! (create :bins 16 :reservoir :array)\n                        ex/normal-data))\n\"Elapsed time: 183.532 msecs\"\n```\n\nInsert times using reservoir defaults:\n\n![timing chart]\n(https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc\u0026oid=2\u0026zx=mppmmoe214jm)\n\n# License\n\nCopyright (C) 2013 BigML Inc.\n\nDistributed under the Apache License, Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fhistogram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigmlcom%2Fhistogram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fhistogram/lists"}