{"id":16597855,"url":"https://github.com/szilard/gbm-tune","last_synced_at":"2025-03-06T19:42:19.129Z","repository":{"id":74911307,"uuid":"101002403","full_name":"szilard/GBM-tune","owner":"szilard","description":"Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions","archived":false,"fork":false,"pushed_at":"2017-09-11T11:29:48.000Z","size":10860,"stargazers_count":21,"open_issues_count":3,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-17T04:46:05.711Z","etag":null,"topics":["gbm","gradient-boosting-machine","hyperparameter-optimization","machine-learning","overfitting"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szilard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-22T00:06:18.000Z","updated_at":"2024-06-25T13:49:01.000Z","dependencies_parsed_at":"2023-07-04T04:42:01.332Z","dependency_job_id":null,"html_url":"https://github.com/szilard/GBM-tune","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-tune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-tune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-tune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-tune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szilard","download_url":"https://codeload.github.com/szilard/GBM-tune/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242277215,"owners_count":20101530,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gbm","gradient-boosting-machine","hyperparameter-optimization","machine-learning","overfitting"],"created_at":"2024-10-12T00:06:51.018Z","updated_at":"2025-03-06T19:42:19.123Z","avatar_url":"https://github.com/szilard.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n## Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions\n\nThe goal of this repo is to study the impact of having one dataset/sample (\"the dataset\") \nwhen training and tuning machine learning models in practice (or in competitions) \non the prediction accuracy on new data (that usually comes from a slightly different\ndistribution due to non-stationarity).\n\nTo keep things simple we focus on binary classification, use only one source dataset \nwith mix of numeric and categorical features and no missing values, we don't perform feature engineering,\ntune only GBMs with `lightgbm` and random hyperparameter search (might also ensemble the best models later), and \nwe use only AUC as a measure of accuracy.\n\nUnlike in most practical applications or in competitions such as Kaggle, we create the following\nlaboratory/controlled environment that allows us to study the effects of sample variations in repeated \nexperiments. We pick a public dataset that spans over several years (the well known airline dataset).\nFrom this source data we pick 1 year of \"master\" data for training/tuning and the following 1 year for testing (hold-out).\nWe take samples of give sizes (e.g. 10K, 100K, 1M records) from the \"master\" training set and \nsamples of 100K from the \"master\" test set. \n\nWe choose a grid of hyperparameter values and take 100 random combinations from the grid.\nFor each hyperparameter combination we repeat the following resampling procedure 20 times:\nSplit the training set 80-10-10 into data used for (1) training (2) validation for early stopping\nand (3) evaluation for model selection. \nWe train the GBM models with early stopping and record the AUCs on the last split of data (3). We record \nthe average AUC and its standard deviation.\nFinally, we compute the AUC on the testset for the ensemble of the 20 models obtained\nwith resampling (simple average of their predictions).\n\nWe study the test AUC of the top performing hyperparameter combinations (selected based only on \nthe information from the resampling procedure without access to the test set). In fact, we resample\nthe test set itself as well, therefore we obtain averages and standard errors for the test AUC.\n\n\n\n### Train set size 100K records \n\nThe evaluation AUC of the 100 random hyperparameter trials vs their ranking\n(errorbars based on train 80-10-10 resampling):\n\n![](3-test_rs/fig-100K-AUCrs_rank.png)\n\nThe test AUC vs evaluation ranking (errorbars based on testset resampling):\n\n![](3-test_rs/fig-100K-AUCtest_rank.png)\n\nTest vs evaluation AUC (with errorbars based on train 80-10-10 and test resampling, respectively):\n\n![](3-test_rs/fig-100K-AUCcorr.png)\n\nThe top models selected by evaluation AUC are also top in test AUC, the correlation between\nevaluation/model selection AUC and test AUC is high (Pearson and rank correlation `~0.75`).\n\nA top model is:\n```\nnum_leaves = 1000\nlearning_rate = 0.03\nmin_data_in_leaf = 5\nfeature_fraction = 0.8\nbagging_fraction = 0.8\n```\n\nFor this combination, early stopping happens at `~200` trees in `~10 sec` for each resample (on a server with 16 cores/8 physical cores) \nleading to evaluation AUC `0.815` and test AUC `0.745` (the training data is coming from one given year, while the test\ndata is coming from the next year, therefore the decrease in prediction accuracy).\n\nThe runtime and number of trees for the different hyperparameter combinations vary, the total training time\nfor the 100 random hyperparameter trials with 20 train resamples each is `~6 hrs`, while adding prediction time \nfor 1 test set (initially) we have `~7 hrs` total runtime, while further on 20 resamples of the test set `~26 hrs`\ntotal run time (the experiment can be easily parallelized to multiple servers as the trials in the random\nsearch are completely independent).\n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-100K-100.html) and\n[here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/3-test_rs/analyze-100K.html).\n\n\n\n### Train set size 1M records \n\nThe correlation between evaluation/model selection AUC and test AUC is even higher (Pearson/rank correlation `~0.97`),\nand naturally the top models selected by evaluation AUC are also top in test AUC even more so.\n\nThe best models have now a larger `num_leaves` (as one would expect since there is more data and one can build deeper\ntrees without overfitting) and the early stopping stops later (more trees).\nRune time is approximately `10x`, best evaluation and test AUC in the table below.\n\n\nSize    |  eval AUC      |  test AUC     | \n--------|----------------|---------------|\n10K     | 0.701 / 0.682  | 0.660 / 0.670 |\n100K    |   0.815        |   0.745       |\n1M      |   0.952        |   0.847       |\n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-1M-100.html).\n\n\n\n### Train set size 10K records \n\nThe best models selected based on evaluation AUC are not anymore the best models on test, the correlation is now low `~0.25`.\n\n![](3-test_rs/fig-10K-AUCcorr.png)\n\nIt seems 10K is just *not enough data* for obtaining a good model out-of-sample \n(some of the variables have 100s of categories and some appear with low frequency,\nso this result is not completely unexpected), \nand even with cross validation there is some kind of *overfitting* here. \n\nThe best models based on evaluation AUC have deeper trees (evaluation AUC `0.701`, but low test AUC `0.660`), while\nthe best models on test have shallower trees (evaluation AUC `0.682`, test AUC `0.670`).\nTherefore one could reduce overfitting by restricting the space to shallower trees (effectively regularizing).\n\nAlso, the above evaluation AUC is a biased estimate for the AUC of the best model even on the training set (thanks @preko for\npointing this out). \n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-10K-100.html) and\n[here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/3-test_rs/analyze-10K.html).\n\n\n\n### Note on train set sizes\n\n...\n\n\n\n### Larger hyperparameter search (1000 random trials)\n\nWe ran 1000 random trials on 100K data (~60 hrs runtime on 8 physical core server, one could parallelize the trials on different servers/more cores).\n\n![](2-train_test_1each/fig-100K-1000-AUCcorr.png)\n\nWith these many trials the best model (cross-validation) is a bit overfit to the training set and it is not the best model on the test set.\n\nTrials  |  eval AUC      |  test AUC     | \n--------|----------------|---------------|\n100     |   0.815        |   0.745       |\n1000    | 0.821 / 0.807  | 0.744 / 0.746 |\n\nThe best model on the test set (you cannot do that for model selection!!!) is also a bit overfit to the test set. If one does multiple\nresamples of the test set (which we did not do here because of the extra computational costs), the best model on the \"average\" test set\nwould still be (slightly) overfitted to the test distribution (since that's likely to be a bit different than the train distribution\nbecause of temporal separation).\n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-100K-1000.html).\n\n\n\n### Variation of best model selected vs training sample\n\nWe repeat the above experiments with resampling the train set from the source data. This is to study\nthe sensitivity of the results w.r.t. a given sample. \n\nFor 100K records, the resample AUC for 2 train resamples with the same hyperparameter values:\n\n![](4-train_rs/fig-AUCcorr.png)\n\nThe correlation (both Pearson and rank) of the 2 above is `~0.95`.\n\nA similar graph and correlation is found if one looks at AUC on the test data. \n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/4-train_rs/analyze.html).\n\nTherefore, it seems that 100K records is enough to get similar best hyperparameter values \nnot depending too much of the given training sample. However, the test AUC shows some\nvariation therefore the best models must be somewhat different. TODO: More reaserch to clarify this.\n\n\n\n### Ensembles \n\nThe test AUC of the average of the top 10 models (of the 100 with random hyperparameter search, selected based on\nresample AUC) (in red) and the AUC of all the models (horizontal axis `rank` based on resample AUC):\n\n![](5-ensemble/fig-AUCens.png)\n\nMore details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/5-ensemble/analyze.html).\n\nInteresting that this simple ensemble does not beat the best model.\nTODO: Stacking or some other more sophisticated way of ensembling models (vs simple average of top 10 above).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fgbm-tune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszilard%2Fgbm-tune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fgbm-tune/lists"}