{"id":30659293,"url":"https://github.com/tomkyle/binning","last_synced_at":"2025-10-21T06:46:15.068Z","repository":{"id":301149945,"uuid":"1008342092","full_name":"tomkyle/binning","owner":"tomkyle","description":"Determine optimal number of bins 𝒌 for histogram creation and optimal bin width 𝒉 using various statistical methods.","archived":false,"fork":false,"pushed_at":"2025-06-28T18:51:58.000Z","size":68,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-02T22:49:19.774Z","etag":null,"topics":["binning","data-analysis","distributions","doanes-rule","freedman-diaconis","histogram","histogram-binning","math","php-math","rice-rule","scotts-rule","square-root","statistics","sturges-rule","terrell-scotts-rule"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomkyle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-25T11:52:51.000Z","updated_at":"2025-06-28T18:51:32.000Z","dependencies_parsed_at":"2025-06-25T12:27:40.812Z","dependency_job_id":null,"html_url":"https://github.com/tomkyle/binning","commit_stats":null,"previous_names":["tomkyle/binning"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/tomkyle/binning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkyle%2Fbinning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkyle%2Fbinning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkyle%2Fbinning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkyle%2Fbinning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomkyle","download_url":"https://codeload.github.com/tomkyle/binning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkyle%2Fbinning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272982770,"owners_count":25025984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-31T02:00:09.071Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binning","data-analysis","distributions","doanes-rule","freedman-diaconis","histogram","histogram-binning","math","php-math","rice-rule","scotts-rule","square-root","statistics","sturges-rule","terrell-scotts-rule"],"created_at":"2025-08-31T12:47:24.082Z","updated_at":"2025-10-21T06:46:14.953Z","avatar_url":"https://github.com/tomkyle.png","language":"PHP","readme":"\n# tomkyle/binning\n\n[![Composer Version](https://img.shields.io/packagist/v/tomkyle/binning)](https://packagist.org/packages/tomkyle/binning )\n[![PHP version](https://img.shields.io/packagist/php-v/tomkyle/binning)](https://packagist.org/packages/tomkyle/binning )\n[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/tomkyle/binning/php.yml)](https://github.com/tomkyle/binning/actions/workflows/php.yml)\n[![Packagist License](https://img.shields.io/packagist/l/tomkyle/binning)](LICENSE.txt)\n\n**Determine the optimal 𝒌 number of bins for histogram creation and optimal bin width 𝒉 using various statistical methods. Its unified interface includes implementations of well-known binning rules such as:**\n\n- Square Root Rule (1892)\n- Sturges’ Rule (1926)\n- Doane’s Rule (1976)\n- Scott’s Rule (1979)\n- Freedman-Diaconis Rule (1981)\n- Terrell-Scott’s Rule (1985)\n- Rice University Rule\n\n\n\n## Requirements\n\nThis library requires PHP 8.3 or newer. Support of older versions like [markrogoyski/math-php](https://github.com/markrogoyski/math-php) provides for PHP 7.2+ is not planned.\n\n\n\n## Installation\n\n```bash\ncomposer require tomkyle/binning\n```\n\n\n\n## Usage\n\nThe **BinSelection** class provides several methods for determining the optimal number of bins for histogram creation and optimal bin width. You can either use specific methods directly or the general `suggestBins()` and `suggestBinWidth()` methods with different strategies.\n\n\n\n### Determine Bin Width\n\nUse the **suggestBinWidth** method to get the *optimal bin width* based on the selected method. The method returns the bin width, often referred to as 𝒉, as a float value.\n\n```php\n\u003c?php\nuse tomkyle\\Binning\\BinSelection;\n\n$data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];\n\n// Default method: Freedman-Diaconis Rule (1981)\n$h = BinSelection::suggestBinWidth($data);\n$h = BinSelection::suggestBinWidth($data, BinSelection::DEFAULT);\n\n// Explicitly set method\n$h = BinSelection::suggestBinWidth($data, BinSelection::FREEDMAN_DIACONIS);\n$h = BinSelection::suggestBinWidth($data, BinSelection::SCOTT);\n```\n\n\n\n### Determine Number of Bins\n\nUse the **suggestBins** method to get the *optimal number of bins* based on the selected method. The method returns the number of bins, often referred to as 𝒌, as an integer value.\n\n```php\n\u003c?php\nuse tomkyle\\Binning\\BinSelection;\n\n$data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];\n\n// Defaults to Freedman-Diaconis Rule\n$k = BinSelection::suggestBins($data);\n$k = BinSelection::suggestBins($data, BinSelection::DEFAULT);\n\n// Square Root Rule (Pearson, 1892)\n$k = BinSelection::suggestBins($data, BinSelection::SQUARE_ROOT);\n$k = BinSelection::suggestBins($data, BinSelection::PEARSON);\n\n// Sturges' Rule (1926)\n$k = BinSelection::suggestBins($data, BinSelection::STURGES);\n\n// Doane's Rule (1976) in 2 variants for samples (default) or populations\n$k = BinSelection::suggestBins($data, BinSelection::DOANE);\n$k = BinSelection::suggestBins($data, BinSelection::DOANE, population: true); \n\n// Scott's Rule (1979)\n$k = BinSelection::suggestBins($data, BinSelection::SCOTT);\n\n// Freedman-Diaconis Rule (1981)\n$k = BinSelection::suggestBins($data, BinSelection::FREEDMAN_DIACONIS);\n\n// Terrell-Scott’s Rule (1985)\n$k = BinSelection::suggestBins($data, BinSelection::TERRELL_SCOTT);\n\n// Rice University Rule\n$k = BinSelection::suggestBins($data, BinSelection::RICE);\n```\n\n\n\n---\n\n\n\n### Explicit method calls\n\nYou can also call the specific methods directly to get the bin width 𝒉 or number of bins 𝒌.\n\n- Most of the methods return the bin number 𝒌 as an *integer* value. \n- Two methods, **Scotts’ Rule** and **Freedman-Diaconis Rule**, provide both 𝒌 and 𝒉 as an *array*. \n\nThe result array contains additional information like the data range 𝑹, the inter-quartile range ***IQR***, or standard deviation **stddev**, which can be useful for further analysis.\n\n\n\n---\n\n\n\n#### 1. Pearson’s Square Root Rule (1892)\n\nSimple rule using the square root of the sample size.\n\n$$\nk = \\left \\lceil \\sqrt{n} \\ \\right \\rceil \n$$\n\n```php\n$k = BinSelection::squareRoot($data);\n```\n\n\n\n---\n\n\n\n#### 2. Sturges’s Rule (1926)\n\nBased on the logarithm of the sample size. Good for normal distributions.\n\n$$\nk = 1 + \\left \\lceil \\ \\log_2(n) \\  \\right \\rceil\n$$\n\n```php\n$k = BinSelection::sturges($data);\n```\n\n\n\n---\n\n\n\n#### 3. Doane’s Rule (1976)\n\nImprovement of *Sturges*’ rule that accounts for data skewness.\n\n$$\nk = 1 + \\left\\lceil \\  \\log_2(n) + \\log_2\\left(1 + \\frac{|g_1|}{\\sigma_{g_1}}\\right) \\  \\right \\rceil \n$$\n\n```php\n// Using sample-based calculation (default)\n$k = BinSelection::doane($data);\n\n// Using population-based calculation\n$k = BinSelection::doane($data, population: true);\n```\n\n\n\n---\n\n\n\n#### 4. Scott’s Rule (1979)\n\nBased on the standard deviation and sample size. Good for continuous data.\n\n$$\nh = \\frac{3.49\\,\\hat{\\sigma}}{\\sqrt[3]{n}} \n$$\n\n$$\nR = \\max_i x_i - \\min_i x_i \n$$\n\n$$\nk = \\left \\lceil \\ \\frac{R}{h} \\ \\right \\rceil \n$$\n\nThe result is an array with keys `width`, `bins`, `range`, and `stddev`. Map them to variables like so:\n\n```php\nlist($h, $k, $R, stddev) = BinSelection::scott($data);\n```\n\n\n\n---\n\n\n\n#### 5. Freedman-Diaconis Rule (1981)\n\nBased on the interquartile range (IQR). Robust against outliers.\n\n$$\nIQR = Q_3 - Q_1 \n$$\n\n$$\nh = 2 \\times \\frac{\\mathrm{IQR}}{\\sqrt[3]{n}} \n$$\n\n$$\nR = \\text{max}_i x_i - \\text{min}_i x_i \n$$\n\n$$\nk = \\left \\lceil \\frac{R}{h} \\right \\rceil \n$$\n\nThe result is an array with keys `width`, `bins`, `range`, and `IQR`. Map them to variables like so:\n\n```php\nlist($h, $k, $R, $IQR) = BinSelection::freedmanDiaconis($data);\n```\n\n\n\n---\n\n\n\n#### 6. Terrell-Scott’s Rule (1985)\n\nUses the cube root of the sample size, generally provides more bins than *Sturges*. This is the original *Rice Rule*:\n\n$$\nk = \\left \\lceil \\  \\sqrt[3]{2n} \\enspace \\right \\rceil = \\left \\lceil \\  (2n)^{1/3} \\  \\right \\rceil \n$$\n\n```php\n$k = BinSelection::terrellScott($data);\n```\n\n\n\n---\n\n\n\n#### 7. Rice University Rule\n\nUses the cube root of the sample size, generally provides more bins than *Sturges*. Formula as taught by David M. Lane at Rice University. — **N.B.** This *Rice Rule* seems to be not the original. In fact, *Terrell-Scott’s* (1985) seems to be. Also note that both variants can yield different results under certain circumstances. This Lane’s variant from the early 2000s is however more commonly cited:\n\n$$\nk = 2 \\times \\left \\lceil \\  \\sqrt[3]{n} \\enspace \\right \\rceil =  2 \\times \\left \\lceil \\  n^{1/3} \\  \\right \\rceil \n$$\n\n```php\n$k = BinSelection::rice($data);\n```\n\n\n\n---\n\n\n\n## Method Selection Guidelines\n\n| Rule                  | Strengths \u0026 Weaknesses                                       |\n| --------------------- | ------------------------------------------------------------ |\n| **Freedman–Diaconis** | Uses the IQR to set 𝒉, so it is robust against outliers and adapts to data spread. \u003cbr /\u003e⚠️ May over‐smooth heavily skewed or multi‐modal data when IQR is small. |\n| **Sturges’ Rule**     | Very simple, works well for roughly normal, moderate-sized datasets. \u003cbr /\u003e⚠️ Ignores outliers and underestimates bin count for large or skewed samples. |\n| **Rice Rule**         | Independent of data shape and easy to compute. \u003cbr /\u003e⚠️ Prone to over‐ or under‐smoothing when the distribution is heavy‐tailed or skewed. |\n| **Terrell–Scott**     | Similar approach as *Rice Rule* but with asymptotically optimal MISE properties; gives more bins than Sturges and adapts better at large 𝒏. \u003cbr /\u003e⚠️ Still ignores skewness and outliers. |\n| **Square Root Rule**  | Simply the square root, so it requires no distributional estimates. \u003cbr /\u003e⚠️ May produce too few bins for complex distributions — or too many for very noisy data. |\n| **Doane’s Rule**      | Extends *Sturges’ Rule* by adding a skewness correction. Improving performance on asymmetric data.\u003cbr /\u003e⚠️ Requires estimating the third moment (skewness), which can be unstable for small 𝒏. |\n| **Scott’s Rule**      | Uses standard deviation to minimize MISE, providing good balance for unimodal, symmetric data. \u003cbr /\u003e⚠️  Sensitive to outliers (inflated $\\sigma$) and may underperform on skewed distributions. |\n\n\n\n## Literature\n\nRubia, J.M.D.L. (2024): \n**Rice University Rule to Determine the Number of Bins.**\nOpen Journal of Statistics, 14, 119-149.\nDOI: [10.4236/ojs.2024.141006](https://doi.org/10.4236/ojs.2024.141006) \n\nWikipedia: \n**Histogram / Number of bins and width**\nhttps://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width\n\n\n\n## Practical Example\n\n```php\n\u003c?php\nuse tomkyle\\Binning\\BinSelection;\n\n// Generate sample data (e.g., from measurements)\n$measurements = [\n\t12.3, 14.1, 13.8, 15.2, 12.9, 14.7, 13.1, 15.8, 12.5, 14.3,\n\t13.6, 15.1, 12.8, 14.9, 13.4, 15.5, 12.7, 14.2, 13.9, 15.0\n];\n\necho \"Data points: \" . count($measurements) . \"\\n\\n\";\n\n// Compare different methods\n$methods = [\n\t'Sturges’s Rule' =\u003e BinSelection::STURGES,\n\t'Rice University Rule' =\u003e BinSelection::RICE,\n\t'Terrell-Scott’s Rule' =\u003e BinSelection::TERRELL_SCOTT,\n\t'Square Root Rule' =\u003e BinSelection::SQUARE_ROOT,\n\t'Doane’s Rule' =\u003e BinSelection::DOANE,\n\t'Scott’s Rule' =\u003e BinSelection::SCOTT,\n\t'Freedman-Diaconis Rule' =\u003e BinSelection::FREEDMAN_DIACONIS,\n];\n\nforeach ($methods as $name =\u003e $method) {\n\t$bins = BinSelection::suggestBins($measurements, $method);\n\techo sprintf(\"%-18s: %2d bins\\n\", $name, $bins);\n}\n```\n\n\n\n## Error Handling\n\nAll methods will throw `InvalidArgumentException` for invalid inputs:\n\n```php\ntry {\n\t// This will throw an exception\n\t$bins = BinSelection::sturges([]);\n} catch (InvalidArgumentException $e) {\n\techo \"Error: \" . $e-\u003egetMessage();\n\t// Output: \"Dataset cannot be empty to apply the Sturges' Rule.\"\n}\n\ntry {\n\t// This will throw an exception  \n\t$bins = BinSelection::suggestBins($data, 'invalid-method');\n} catch (InvalidArgumentException $e) {\n\techo \"Error: \" . $e-\u003egetMessage();\n\t// Output: \"Unknown binning method: invalid-method\"\n}\n```\n\n\n\n\n\n## Development\n\n### Clone repo and install requirements\n\n```bash\n$ git clone git@github.com:tomkyle/binning.git\n$ composer install\n$ pnpm install\n```\n\n### Watch source and run various tests\n\nThis will watch changes inside the **src/** and **tests/** directories and run a series of tests:\n\n1. Find and run the according unit test with *PHPUnit*.\n2. Find possible bugs and documentation isses using *phpstan*. \n3. Analyse code style and give hints on newer syntax using *Rector*.\n\n```bash\n$ npm run watch\n```\n\n**Run PhpUnit**\n\n```bash\n$ npm run phpunit\n```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomkyle%2Fbinning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomkyle%2Fbinning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomkyle%2Fbinning/lists"}