{"id":29830815,"url":"https://github.com/finite-sample/stableboost","last_synced_at":"2025-10-19T16:48:54.586Z","repository":{"id":304074508,"uuid":"1017660109","full_name":"finite-sample/stableboost","owner":"finite-sample","description":"Stable XGBoost","archived":false,"fork":false,"pushed_at":"2025-07-11T04:32:04.000Z","size":37,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-11T05:53:24.950Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/finite-sample.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-10T22:28:55.000Z","updated_at":"2025-07-11T04:32:08.000Z","dependencies_parsed_at":"2025-07-11T05:53:26.399Z","dependency_job_id":"b7cbd01a-5cce-4151-8d46-ffc196b3df9c","html_url":"https://github.com/finite-sample/stableboost","commit_stats":null,"previous_names":["finite-sample/stableboost"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/finite-sample/stableboost","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstableboost","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstableboost/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstableboost/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstableboost/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/finite-sample","download_url":"https://codeload.github.com/finite-sample/stableboost/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Fstableboost/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267668843,"owners_count":24124972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-29T10:11:34.182Z","updated_at":"2025-10-19T16:48:49.542Z","avatar_url":"https://github.com/finite-sample.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StableBoost: Stable XGBoost Predictions Under Data Shuffling\n\n## 1 | Why It Matters\n\nEven with a fixed random seed, **shuffling the order of training rows changes XGBoost’s histogram bins** (`tree_method='hist'`).\nThose altered cut-points yield a different forest and, consequently, **different predictions on exactly the same data**.\nSymptoms in production:\n\n* Apparent “drift” after a routine retrain\n* Flaky regression tests on model outputs\n* Spurious monitoring alerts\n\n---\n\n## 2 | Root Causes of Instability\n\n1. **Multi-thread histogram binning is row-order sensitive**  \n   When `tree_method='hist'` **and** `n_jobs \u003e 1`, each thread builds a local\n   quantile sketch on its chunk of rows; merging those sketches makes the final\n   bin boundaries depend on how the chunks were formed—hence on row order.\n   Single-thread `hist` and `tree_method='exact'` avoid this effect.\n   \n2. **Row subsampling amplifies sensitivity**  \n   With `subsample \u003c 1`, every boosting round trains on only a sample of rows.  \n   Shuffling the dataset changes which rows fall into that sample—even under a\n   fixed `random_state`—so the gradient seen by each new tree differs.\n\n3. **Column subsampling is a smaller, second-order factor**  \n   With `colsample_bytree \u003c 1`, each tree sees a random subset of features.\n   Different feature subsets nudge split choices; the resulting drift is\n   typically an order of magnitude smaller than the first two causes, but still\n   measurable.\n\n---\n\n## 3 | Remedies\n\n| Concept                             | Example Implementation                                   | Impact                                                  | Drawbacks                                 |\n| ----------------------------------- | -------------------------------------------------------- | ------------------------------------------------------- | ----------------------------------------- |\n| Fix the seed                        | Set `random_state=...`                                   | Reduces randomness in sampling                          | No effect unless subsampling present      |\n| Eliminate subsampling               | Set `subsample=1`, `colsample*=1`                        | Removes stochasticity in data use                       | Slower training; higher overfit risk      |\n| Use deterministic tree construction | Use `tree_method='exact'`                                | Fully reproducible split decisions (if subsampling off) | Much slower; infeasible on large datasets |\n| Ensembling over multiple fits       | Average predictions across K shuffles                    | Smooths variance; improves stability                    | Higher training + inference cost          |\n| Use inherently stable learners      | CatBoost (ordered boosting); LightGBM deterministic mode | Near-zero drift out of the box                          | May require reengineering and tuning      |\n\n\u003e ℹ️ The `exact` method performs greedy split finding by checking all possible thresholds for each feature value—no binning or approximation. It eliminates histogram-induced variance, but subsampling can still introduce model differences unless disabled.\n\n---\n\n## 4 | Our Baseline \u0026 Metric\n\n**Experimental design**\n\n1. **Fixed train/test split** (75% / 25%)\n2. **No resampling** – same rows every time, only shuffled order\n3. Fit *K* independent XGB models (different permutations)\n4. Evaluate on the held-out test set\n\n**Stability metric**\n\nFor test observation *j* and *K* models:\n\n$$\\displaystyle \\text{RMSE}_j\n    = \\sqrt{2\\,\\text{Var}_{i}\\!\\bigl(\\hat p_{ij}\\bigr)}  \n\\quad\\Longrightarrow\\quad\n\\text{MeanRMSE}\n  = \\frac{1}{N_{\\text{test}}}\\sum_j \\text{RMSE}_j$$\n\n\u003e Interprets as the **expected RMSE between predictions from two fresh retrains**.\n\n---\n\n## 5 | Illustrative Results (synthetic data, *K = 15*)\n\n| Variant                                   | Accuracy | ROC_AUC | Stability_RMSE |\n|-------------------------------------------|---------:|--------:|---------------:|\n| Single XGB (K=15)                         | 0.9285   | 0.9643  | 0.0313         |\n| Ensemble (K=15) × 5                       | 0.9310   | 0.9647  | 0.0072         |\n| XGB Random-Forest                         | 0.8957   | 0.9514  | 0.0086         |\n| XGB Exact (subsample = 1, colsample = 1)  | 0.9250   | 0.9632  | 0.0000         |\n\n\n*Row-order alone ≈ 3 pp RMSE; bagging drives it to (near) zero.*\n\n---\n\n## 6 | Clean Reproducible Notebook\n\n* [Notebook](https://github.com/finite-sample/stableboost/blob/main/stableboost.ipynb)\n\n---\n\n## 7 | Using This in Practice\n\n1. **Drop-in** your real feature matrix in place of `make_classification`.\n2. Tune *K* for runtime vs. stability; RMSE shrinks ∼1/√K.\n3. Integrate the metric into CI—fail builds when `Stability_RMSE` exceeds your tolerance (e.g., 0.01).\n4. Optionally extend with bootstrap or CV resamples to capture full pipeline variance.\n5. For stricter determinism:\n\n   * Set `subsample=1`, `colsample_bytree=1`, and related knobs.\n   * Use `random_state` for all randomness.\n   * Consider `tree_method='exact'` if dataset is small.\n\n## 8 | Authors\n\nVictor Shia and Gaurav Sood\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Fstableboost","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffinite-sample%2Fstableboost","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Fstableboost/lists"}