{"id":19469039,"url":"https://github.com/anothersamwilson/parbayesianoptimization","last_synced_at":"2025-04-24T03:47:50.958Z","repository":{"id":43043323,"uuid":"155993502","full_name":"AnotherSamWilson/ParBayesianOptimization","owner":"AnotherSamWilson","description":"Parallelizable Bayesian Optimization in R","archived":false,"fork":false,"pushed_at":"2022-10-18T14:05:24.000Z","size":2724,"stargazers_count":110,"open_issues_count":16,"forks_count":19,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-24T03:47:41.870Z","etag":null,"topics":["bayesian-inference","machine-learning","r"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AnotherSamWilson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-03T14:57:56.000Z","updated_at":"2025-03-26T07:20:11.000Z","dependencies_parsed_at":"2022-09-16T12:01:58.714Z","dependency_job_id":null,"html_url":"https://github.com/AnotherSamWilson/ParBayesianOptimization","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnotherSamWilson%2FParBayesianOptimization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnotherSamWilson%2FParBayesianOptimization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnotherSamWilson%2FParBayesianOptimization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnotherSamWilson%2FParBayesianOptimization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AnotherSamWilson","download_url":"https://codeload.github.com/AnotherSamWilson/ParBayesianOptimization/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250560007,"owners_count":21450168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian-inference","machine-learning","r"],"created_at":"2024-11-10T18:45:55.770Z","updated_at":"2025-04-24T03:47:50.940Z","avatar_url":"https://github.com/AnotherSamWilson.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n[![Build\nStatus](https://api.travis-ci.org/AnotherSamWilson/ParBayesianOptimization.svg)](https://travis-ci.org/AnotherSamWilson/ParBayesianOptimization)\n[![CRAN\\_Status\\_Badge](http://www.r-pkg.org/badges/version/ParBayesianOptimization)](https://CRAN.R-project.org/package=ParBayesianOptimization)\n[![DEV\\_Version\\_Badge](https://img.shields.io/badge/Dev-1.2.5-blue.svg)](https://CRAN.R-project.org/package=ParBayesianOptimization)\n[![CRAN\\_Downloads](https://cranlogs.r-pkg.org/badges/grand-total/mltools)](https://CRAN.R-project.org/package=ParBayesianOptimization)\n[![Coverage\nStatus](https://codecov.io/gh/AnotherSamWilson/ParBayesianOptimization/branch/master/graph/badge.svg)](https://codecov.io/gh/AnotherSamWilson/ParBayesianOptimization/branch/master)\n\n# Parallelizable Bayesian Optimization\n\n\u003cimg src='vignettes/icon.png' align = 'right' height=\"300\" /\u003e\n\nThis README contains a thorough walkthrough of Bayesian optimization and\nthe syntax needed to use this package, with simple and complex examples.\nMore information can be found in the package vignettes and manual.\n\n## Table of Contents\n\n  - [01 -\n    Installation](https://github.com/AnotherSamWilson/ParBayesianOptimization#Installation)  \n  - [02 - Package\n    Process](https://github.com/AnotherSamWilson/ParBayesianOptimization#Package-Process)  \n  - [03 - Bayesian Optimization\n    Intuition](https://github.com/AnotherSamWilson/ParBayesianOptimization#Bayesian-Optimization-Intuition)  \n  - [04 - Simple\n    Example](https://github.com/AnotherSamWilson/ParBayesianOptimization#Simple-Example)  \n  - [05 - Hyperparameter\n    Tuning](https://github.com/AnotherSamWilson/ParBayesianOptimization#Hyperparameter-Tuning)  \n  - [06 - Running In\n    Parallel](https://github.com/AnotherSamWilson/ParBayesianOptimization#Running-In-Parallel)  \n  - [07 - Sampling Multiple Promising Points at\n    Once](https://github.com/AnotherSamWilson/ParBayesianOptimization#Sampling-Multiple-Promising-Points-at-Once)  \n  - [08 - How Long Should it Run\n    For?](https://github.com/AnotherSamWilson/ParBayesianOptimization#how-long-should-it-run-for)  \n  - [09 - Setting Stopping\n    Criteria](https://github.com/AnotherSamWilson/ParBayesianOptimization#Setting-Time-Limits-and-Other-Halting-Criteria)\n\n## Installation\n\nYou can install the most recent stable version of\nParBayesianOptimization from CRAN with:\n\n``` r\ninstall.packages(\"ParBayesianOptimization\")\n```\n\nYou can also install the most recent development version from github\nusing devtools:\n\n``` r\n# install.packages(\"devtools\")\ndevtools::install_github(\"AnotherSamWilson/ParBayesianOptimization\")\n```\n\n## Package Process\n\nMachine learning projects will commonly require a user to “tune” a\nmodel’s hyperparameters to find a good balance between bias and\nvariance. Several tools are available in a data scientist’s toolbox to\nhandle this task, the most blunt of which is a grid search. A grid\nsearch gauges the model performance over a pre-defined set of\nhyperparameters without regard for past performance. As models increase\nin complexity and training time, grid searches become unwieldly.\n\nIdealy, we would use the information from prior model evaluations to\nguide us in our future parameter searches. This is precisely the idea\nbehind Bayesian Optimization, in which our prior response distribution\nis iteratively updated based on our best guess of where the best\nparameters are. The `ParBayesianOptimization` package does exactly this\nin the following process:\n\n1.  Initial parameters are scored\n2.  Gaussian Process is fit/updated  \n3.  Parameter is found which maximizes an acquisition function  \n4.  This parameter is scored  \n5.  Repeat steps 2-4 until some stopping criteria is met\n\n\u003ccenter\u003e\n\n\u003cimg src=\"vignettes/gpParBayesAnimationSmall.gif\" style=\"display: block; margin: auto;\" /\u003e\n\n\u003c/center\u003e\n\n## Bayesian Optimization Intuition\n\nAs an example, let’s say we are only tuning 1 hyperparameter in an\nrandom forest model, the number of trees, within the bounds \\[1,15000\\].\nWe have initialized the process by randomly sampling the scoring\nfunction 7 times, and get the following results:\n\n| Trees.In.Forest | Score |\n| --------------: | ----: |\n|            1000 |  0.30 |\n|            3000 |  0.31 |\n|            5000 |  0.14 |\n|            9000 |  0.40 |\n|           11000 |  0.40 |\n|           15000 |  0.16 |\n\nIn this example, Score can be generalized to any error metric that we\nwant to *maximize* (negative RMSE, AUC, etc.). *Keep in mind, Bayesian\noptimization can be used to maximize* any *black box function,\nhyperparameter tuning is just a common use case*. Given these scores,\nhow do we go about determining the best number of trees to try next? As\nit turns out, Gaussian processes can give us a very good definition of\nour assumption about how the Score (model performance) is distributed\nover the hyperparameters. Fitting a Gaussian process to the data above,\nwe can see the expected value of Score across our parameter bounds, as\nwell as the uncertainty bands:\n\n\u003ccenter\u003e\n\n\u003cimg src=\"vignettes/round1.png\" width=\"648px\" style=\"display: block; margin: auto;\" /\u003e\n\n\u003c/center\u003e\n\nBefore we can select our next candidate parameter to run the scoring\nfunction on, we need to determine how we define a “good” parameter\ninside this prior distribution. This is done by maximizing different\n***acquisition functions*** within the Gaussian process. The acquisition\nfunction tells is how much ***utility*** there is at a certain\nunexplored space. In the chart above, the lower 3 graphs show examples\ndifferent acquisition functions.\n\nOur expected improvement in the graph above is maximized at \\~10000. If\nwe run our process with the new `Trees in Forest = 10000`, we can update\nour Gaussian process for a new prediction about which would be best to\nsample next.\n\nThe utility functions that are maximized in this package are defined as\nfollows:\n\n\u003ccenter\u003e\n\n\u003cimg src=\"vignettes/UtilityFunctions.png\" width=\"648\" style=\"display: block; margin: auto;\" /\u003e\n\n\u003c/center\u003e\n\n## Simple Example\n\nIn this example, we are optimizing a simple function with 1 input and 1\noutput. We, the user, need to define the function that we want to\noptimize. This function should return, at a minimum, a list with a Score\nelement. You can also return other elements that you want to keep track\nof in each run of the scoring function, which we show in the section\n[Hyperparameter\nTuning](https://github.com/AnotherSamWilson/ParBayesianOptimization#Hyperparameter-Tuning).\n\n``` r\nsimpleFunction \u003c- function(x) dnorm(x,3,2)*1.5 + dnorm(x,7,1) + dnorm(x,10,2)\n\n# Find the x that maximizes our simpleFunction\nxmax \u003c- optim(8,simpleFunction,method = \"L-BFGS-B\",lower = 0, upper = 15,control = list(fnscale = -1))$par\n\n# Get a visual\nlibrary(ggplot2)\nggplot(data = data.frame(x=c(0,15)),aes(x=x)) + \n  stat_function(fun = simpleFunction) +\n  geom_vline(xintercept = xmax,linetype=\"dashed\") +\n  ggtitle(\"simpleFunction\") +\n  theme_bw()\n```\n\n![](man/figures/README-simpleFunction-1.png)\u003c!-- --\u003e\n\nWe can see that this function is maximized around x\\~7.023. We can use\n`bayesOpt` to find the global maximum of this function. We just need to\ndefine the bounds, and the initial parameters we want to sample:\n\n``` r\nbounds \u003c- list(x=c(0,15))\ninitGrid \u003c- data.frame(x=c(0,5,10))\n```\n\nHere, we run `bayesOpt`. The function begins by running `simpleFunction`\n3 times, and then fits a Gaussian process to the results in a process\ncalled [Kriging](https://en.wikipedia.org/wiki/Kriging). We then\ncalculate the `x` which maximizes our expected improvement, and run\n`simpleFunction` at this x. We then go through 1 more iteration of this:\n\n``` r\nlibrary(ParBayesianOptimization)\n\nFUN \u003c- function(x) list(Score = simpleFunction(x))\n\nset.seed(6)\noptObjSimp \u003c- bayesOpt(\n  FUN = FUN\n  , bounds = bounds\n  , initGrid = initGrid\n  , iters.n = 2\n)\n```\n\nLet’s see how close the algorithm got to the global maximum:\n\n``` r\ngetBestPars(optObjSimp)\n#\u003e $x\n#\u003e [1] 6.718184\n```\n\nThe process is getting pretty close\\! We were only about 3% shy of the\nglobal optimum:\n\n``` r\nsimpleFunction(getBestPars(optObjSimp)$x)/simpleFunction(7.023)\n#\u003e [1] 0.968611\n```\n\nLet’s run the process for a little longer:\n\n``` r\noptObjSimp \u003c- addIterations(optObjSimp,iters.n=3,verbose=0)\nsimpleFunction(getBestPars(optObjSimp)$x)/simpleFunction(7.023)\n#\u003e [1] 0.9958626\n```\n\nWe have now found an `x` very close to the global optimum.\n\n## Hyperparameter Tuning\n\nIn this example, we will be using the agaricus.train dataset provided in\nthe XGBoost package. Here, we load the packages, data, and create a\nfolds object to be used in the scoring function.\n\n``` r\nlibrary(\"xgboost\")\n\ndata(agaricus.train, package = \"xgboost\")\n\nFolds \u003c- list(\n    Fold1 = as.integer(seq(1,nrow(agaricus.train$data),by = 3))\n  , Fold2 = as.integer(seq(2,nrow(agaricus.train$data),by = 3))\n  , Fold3 = as.integer(seq(3,nrow(agaricus.train$data),by = 3))\n)\n```\n\nNow we need to define the scoring function. This function should, at a\nminimum, return a list with a `Score` element, which is the model\nevaluation metric we want to maximize. We can also retain other pieces\nof information created by the scoring function by including them as\nnamed elements of the returned list. In this case, we want to retain the\noptimal number of rounds determined by the `xgb.cv`:\n\n``` r\nscoringFunction \u003c- function(max_depth, min_child_weight, subsample) {\n\n  dtrain \u003c- xgb.DMatrix(agaricus.train$data,label = agaricus.train$label)\n  \n  Pars \u003c- list( \n      booster = \"gbtree\"\n    , eta = 0.001\n    , max_depth = max_depth\n    , min_child_weight = min_child_weight\n    , subsample = subsample\n    , objective = \"binary:logistic\"\n    , eval_metric = \"auc\"\n  )\n\n  xgbcv \u003c- xgb.cv(\n      params = Pars\n    , data = dtrain\n    , nround = 100\n    , folds = Folds\n    , early_stopping_rounds = 5\n    , maximize = TRUE\n    , verbose = 0\n  )\n\n  return(list(Score = max(xgbcv$evaluation_log$test_auc_mean)\n             , nrounds = xgbcv$best_iteration\n             )\n         )\n}\n```\n\nWe also need to tell our process the bounds it is allowed to search\nwithin:\n\n``` r\nbounds \u003c- list( \n    max_depth = c(1L, 5L)\n  , min_child_weight = c(0, 25)\n  , subsample = c(0.25, 1)\n)\n```\n\nWe are now ready to put this all into the `bayesOpt` function.\n\n``` r\nset.seed(0)\n\ntNoPar \u003c- system.time(\n  optObj \u003c- bayesOpt(\n      FUN = scoringFunction\n    , bounds = bounds\n    , initPoints = 4\n    , iters.n = 4\n    , iters.k = 1\n  )\n)\n```\n\nThe console informs us that the process initialized by running\n`scoringFunction` 4 times. It then fit a Gaussian process to the\nparameter-score pairs, found the global optimum of the acquisition\nfunction, and ran `scoringFunction` again. This process continued until\nwe had 6 parameter-score pairs. You can interrogate the `optObj` object\nto see the results:\n\n``` r\noptObj$scoreSummary\n#\u003e    Epoch Iteration max_depth min_child_weight subsample gpUtility acqOptimum inBounds Elapsed     Score nrounds errorMessage\n#\u003e 1:     0         1         2         1.670129 0.7880670        NA      FALSE     TRUE    0.14 0.9777163       2           NA\n#\u003e 2:     0         2         2        14.913213 0.8763154        NA      FALSE     TRUE    0.33 0.9763760      15           NA\n#\u003e 3:     0         3         4        18.833690 0.3403900        NA      FALSE     TRUE    0.43 0.9931657      18           NA\n#\u003e 4:     0         4         4         8.639925 0.5499186        NA      FALSE     TRUE    0.23 0.9981437       7           NA\n#\u003e 5:     1         5         4        21.871937 1.0000000 0.5857961       TRUE     TRUE    0.12 0.9945933       1           NA\n#\u003e 6:     2         6         4         0.000000 0.9439879 0.6668303       TRUE     TRUE    0.25 0.9990567       7           NA\n#\u003e 7:     3         7         5         1.395119 0.7071802 0.2973497       TRUE     TRUE    0.18 0.9984577       4           NA\n#\u003e 8:     4         8         5         0.000000 0.2500000 0.3221660       TRUE     TRUE    0.32 0.9994020      10           NA\n```\n\n``` r\ngetBestPars(optObj)\n#\u003e $max_depth\n#\u003e [1] 5\n#\u003e \n#\u003e $min_child_weight\n#\u003e [1] 0\n#\u003e \n#\u003e $subsample\n#\u003e [1] 0.25\n```\n\n## Running In Parallel\n\nThe process that the package uses to run in parallel is explained above.\nActually setting the process up to run in parallel is relatively simple,\nwe just need to take two extra steps. We need to load any packages and\nobjects required by `FUN` into the back ends, after registering our\ncluster:\n\n``` r\nlibrary(doParallel)\ncl \u003c- makeCluster(2)\nregisterDoParallel(cl)\nclusterExport(cl,c('Folds','agaricus.train'))\nclusterEvalQ(cl,expr= {\n  library(xgboost)\n})\n```\n\nWe can now run our process in paralel\\! Make sure you set iters.k to\nsome sensible value to take advantage of the parallelization setup.\nSince we have registered 2 cores, we set `iters.k` to 2:\n\n``` r\ntWithPar \u003c- system.time(\n  optObj \u003c- bayesOpt(\n      FUN = scoringFunction\n    , bounds = bounds\n    , initPoints = 4\n    , iters.n = 4\n    , iters.k = 2\n    , parallel = TRUE\n  )\n)\nstopCluster(cl)\nregisterDoSEQ()\n```\n\nWe managed to massively cut the process time by running the process on 2\ncores in parallel. However, keep in mind we only performed 2\noptimization steps, versus the 4 performed in the sequential example:\n\n``` r\ntWithPar\n#\u003e    user  system elapsed \n#\u003e    0.99    0.03    7.91\ntNoPar\n#\u003e    user  system elapsed \n#\u003e   24.13    2.40   21.70\n```\n\n## Sampling Multiple Promising Points at Once\n\nSometimes we may want to sample multiple promising points at the same\noptimization step (Epoch). This is especially effective if the process\nis being run in parallel. The `bayesOpt` function always samples the\nglobal optimum of the acquisition function, however it is also possible\nto tell it to sample local optimums of the acquisition function at the\nsame time.\n\nUsing the `acqThresh` parameter, you can specify the minimum percentage\nutility of the global optimum required for a different local optimum to\nbe considered. As an example, let’s say we are optimizing 1 input `x`,\nwhich is bounded between \\[0,1\\]. Our acquisition function may look like\nthe following:\n\n\u003cimg src=\"vignettes/UCB.png\" width=\"600px\" style=\"display: block; margin: auto;\" /\u003e\n\nIn this case, there are 3 promising candidate parameters: x \\~\n\\[0.318,0.541,0.782\\] with corresponding upper confidence bounds of y \\~\n\\[1.195,1.304,1.029\\], respectively. We may want to run our scoring\nfunction on several of the local maximums. If `acqThresh` is set to be\nbelow 1.029/1.304 \\~ 0.789 and `iters.k` is set to at least 3, the\nprocess would use all 3 of the local maximums as candidate parameter\nsets in the next round of scoring function runs.\n\n## How Long Should it Run For?\n\nGoing back to the example in [Simple\nExample](https://github.com/AnotherSamWilson/ParBayesianOptimization#Simple-Example),\n(if you let this run for a few more iterations and set `plotProgress =\nTRUE`) you will notice this chart is updated at each iteration:\n\n``` r\noptObjSimp \u003c- addIterations(optObjSimp,2,verbose=FALSE)\nplot(optObjSimp)\n```\n\n\u003cimg src=\"man/figures/README-plotObj-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\nAs you thoroughly explore the parameter space, you reduce the\nuncertainty in the unexplored areas. As you reduce uncertainty, you tend\nto reduce utility, which can be thought of as the potential to find a\nbetter parameter set than the one you already have. Notice that the\nexpected improvement converged to 0 after iteration 5. If you see a\nsimilar pattern, you can be fairly certain that you have found an\n(approximately) global optimum.\n\n## Setting Time Limits and Other Halting Criteria\n\nMany times the scoring function can vary in its completion time. It may\nbe difficult for the user to forecast how long a single run will take,\nlet alone X sequential runs. For this reason, you can set a time limit.\nYou can also set a minimum utility limit, or you can set *both*, in\nwhich case the process stops when either condition is met. You can see\nhow the process stopped by viewing the `stopStatus` element in the\nreturned object:\n\n``` r\nset.seed(0)\n\ntNoPar \u003c- system.time(\n  optObj \u003c- bayesOpt(\n      FUN = scoringFunction\n    , bounds = bounds\n    , initPoints = 4\n    , iters.n = 400\n    , iters.k = 1\n    , otherHalting = list(timeLimit = 5)\n  )\n)\n\noptObj$stopStatus\n#\u003e [1] \"Time Limit - 5 seconds.\"\n#\u003e attr(,\"class\")\n#\u003e [1] \"stopEarlyMsg\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanothersamwilson%2Fparbayesianoptimization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanothersamwilson%2Fparbayesianoptimization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanothersamwilson%2Fparbayesianoptimization/lists"}