{"id":16260169,"url":"https://github.com/twolodzko/histr","last_synced_at":"2025-04-08T13:50:14.958Z","repository":{"id":156440554,"uuid":"627981410","full_name":"twolodzko/histr","owner":"twolodzko","description":"📊 Streaming histograms implemented in Rust","archived":false,"fork":false,"pushed_at":"2023-10-19T06:24:58.000Z","size":1692,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T10:18:38.147Z","etag":null,"topics":["command-line","command-line-tool","rust","statistics","streaming-data"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/twolodzko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-04-14T16:20:48.000Z","updated_at":"2023-11-24T09:21:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"aa400bc2-0d97-4969-b650-119efa2b3702","html_url":"https://github.com/twolodzko/histr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twolodzko%2Fhistr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twolodzko%2Fhistr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twolodzko%2Fhistr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twolodzko%2Fhistr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/twolodzko","download_url":"https://codeload.github.com/twolodzko/histr/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247855185,"owners_count":21007519,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line","command-line-tool","rust","statistics","streaming-data"],"created_at":"2024-10-10T16:06:36.801Z","updated_at":"2025-04-08T13:50:14.935Z","avatar_url":"https://github.com/twolodzko.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streaming histograms\n\nAn implementation of the streaming histograms algorithm as described in\n*[A Streaming Parallel Decision Tree Algorithm]* by Yael Ben-Haim and Elad Tom-Tov (2010).\n\nThe streaming histogram is defined in terms of bins\n\n$$\n(p_1, m_1), (p_2, m_2), \\dots, (p_k, m_k)\n$$\n\nwhere $p_1 \u003c p_2 \u003c \\dots \u003c p_k$ are the means of the bins and $m_i$ are the counts of the values in the bin.\nThe sum of the counts is equal to the sample size used to create the histogram $\\sum_i m_i = N$.\n\nThe histogram is created by treating the newly arriving datapoint $x$ as a new bin $(x, 1)$. The new bins are added\nuntil reaching the pre-defined histogram size $k$. When the number of bins gets to $k+1$, the two bins with the\nsmallest difference between the means $p_{i+1} - p_i$ are merged by taking their weighted average\n\n$$\n\\Big( \\frac{p_i m_i + p_{i+1} m_{i+1}}{m_i + m_{i+1}},  m_i + m_{i+1}\\Big)\n$$\n\nwhich is taken as a new bin replacing them. Histograms can also be resized or merged by merging the closest bins.\n\n## Statistics\n\nThe [weighted mean] and [variance] of such bins can be used to approximate sample mean and variance. Yael Ben-Haim and\nElad Tom-Tov (2010) describe also algorithms for approximating the sample quantiles and empirical cumulative probability\ndistribution by applying the [trapezoidal rule] to interpolate between the bins.\n\n## Kernel density estimation\n\nAdditionally, a weighted [kernel density estimator][kde] may be used for approximating the probability density\nfunction of the data. The estimator is defined as\n\n$$\n\\hat f(x) = \\sum_{i=1}^k w_i K_h(x - p_i)\n$$\n\nwith weights $w_i = \\tfrac{m_i}{\\sum_j m_j}$ and the [kernel] $K_h$ having the [bandwidth] $h$. Kernel densities\nare closely related to histograms and there is a correspondence between the bandwidth of the kernel density estimator\nand the width of the bins in the histogram.\n\n## Command line interface\n\nFor example, the following command pipes the tab-separated file (ignoring the first line which is the header with\n`tail -n +2`) to `histr`. The histogram is saved to a file (`-o hist.msgpack`) and printed. The saved histogram\ncould then be read again (with `-l hist.msgpack`) and be updated with new data.\n\n```shell\n$ tail -n +2 examples/old_faithful.tsv | histr -o hist.msgpack\nmean    count\n1.855946 56 ■■■■■■■■■■\n2.162333 27 ■■■■■\n2.436364 11 ■■\n2.912500 4  ■\n3.402125 8  ■\n3.674462 13 ■■\n3.987889 36 ■■■■■■\n4.297208 48 ■■■■■■■■■\n4.622364 55 ■■■■■■■■■■\n4.919000 14 ■■■\n```\n\nInstead of piping, the file could be passed directly as `histr examples/old_faithful.tsv`, but then we would see\na warning printed to the standard error saying that parsing the first line (column name) failed.\n\nIt can be used with other command line programs, for example, to estimate the histogram of response times from ping.\n\n```shell\n$ ping google.com -c 20 | sed -n 's/.*time=\\([0-9.]*\\).*/\\1/p' | histr -b 5\nmean    count\n8.965000 2  ■■\n10.13000 10 ■■■■■■■■■■\n11.20000 3  ■■■\n13.22500 4  ■■■■\n18.00000 1  ■\n```\n\nMore details can be found in `histr -h` and some usage examples can be executed using the [Justfile] in this\nrepository with `just examples`.\n\n## Library\n\nHistr is also available as a Rust crate. It supports creating histograms from data or building them on-the-fly\nin a streaming manner. The histograms can be resized and merged with other histograms. The crate exposes methods for\ncalculating the basic statistics (mean, standard deviation, median, quantiles) from the histograms and calculating\nempirical cumulative distribution functions of kernel density estimators from them. \n\n```rust\nuse histr::StreamHist;\nuse histr::KernelDensity;\n\n// initialize a histogram with 10 bins\nlet mut hist = StreamHist::with_capacity(10);\n// add some values to it\nhist.insert(1.13);\nhist.insert(2.67);\n// ...\n\n// calculate statistics\nprintln!(\"Mean = {}\", hist.mean());\n\n// convert it to a kernel density estimator\nlet kde = KernelDensity::from(hist.clone());\nprintln!(\"f({}) = {}\", 3.14, kde.density(3.14));\n\n// print the histogram as a JSON\nprintln!(\"{}\", hist.to_json());\n```\n\nTo use it, [specify it in `Cargo.toml`] as:\n\n```toml\n[dependencies]\nhistr = { git = \"https://github.com/twolodzko/histr.git\" }\n```\n\n## Other implementations\n\nSimilar implementations are also available in [carsonfarmer/streamhist] (Python), [maki-nage/distogram] (Python),\n[VividCortex/gohistogram] (Go), [aaw/histosketch] (Go), [bigmlcom/histogram] (Java/Clojure), [aaw/histk] (C),\n[malor/bhtt] (Rust), [jettify/streamhist] (Rust), etc. They vary in maturity and features, and some do not implement\nthe approach described by Yael Ben-Haim and Elad Tom-Tov (2010) or diverge from it.\n\n\n [A Streaming Parallel Decision Tree Algorithm]: https://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html\n [carsonfarmer/streamhist]: https://github.com/carsonfarmer/streamhist\n [maki-nage/distogram]: https://github.com/maki-nage/distogram\n [malor/bhtt]: https://github.com/malor/bhtt\n [jettify/streamhist]: https://github.com/jettify/streamhist\n [aaw/histk]: https://github.com/aaw/histk\n [bigmlcom/histogram]: https://github.com/bigmlcom/histogram\n [VividCortex/gohistogram]: https://github.com/VividCortex/gohistogram\n [aaw/histosketch]: https://github.com/aaw/histosketch\n [weighted mean]: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean\n [variance]: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance\n [trapezoidal rule]: https://en.wikipedia.org/wiki/Trapezoidal_rule\n [kde]: https://en.wikipedia.org/wiki/Kernel_density_estimation\n [kernel]: https://en.wikipedia.org/wiki/Kernel_(statistics)\n [bandwidth]: https://stats.stackexchange.com/a/226239/35989\n [Justfile]: https://github.com/casey/just\n [specify it in `Cargo.toml`]: https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwolodzko%2Fhistr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwolodzko%2Fhistr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwolodzko%2Fhistr/lists"}