{"id":19994051,"url":"https://github.com/kapsner/mlexperiments","last_synced_at":"2025-05-04T13:30:52.217Z","repository":{"id":108978536,"uuid":"541941792","full_name":"kapsner/mlexperiments","owner":"kapsner","description":"An extensible framework for reproducible machine learning experiments","archived":false,"fork":false,"pushed_at":"2025-03-05T07:31:35.000Z","size":531,"stargazers_count":5,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-29T01:51:23.138Z","etag":null,"topics":["cross-validation","experiment","hyperparameter-optimization","hyperparameter-tuning","machine-learning","nested","r","r-package"],"latest_commit_sha":null,"homepage":"https://github.com/kapsner/mlexperiments/wiki","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kapsner.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-27T06:45:12.000Z","updated_at":"2025-03-05T07:30:41.000Z","dependencies_parsed_at":"2024-07-14T14:15:41.530Z","dependency_job_id":null,"html_url":"https://github.com/kapsner/mlexperiments","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":"kapsner/rpkgTemplate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kapsner%2Fmlexperiments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kapsner%2Fmlexperiments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kapsner%2Fmlexperiments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kapsner%2Fmlexperiments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kapsner","download_url":"https://codeload.github.com/kapsner/mlexperiments/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252341191,"owners_count":21732467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-validation","experiment","hyperparameter-optimization","hyperparameter-tuning","machine-learning","nested","r","r-package"],"created_at":"2024-11-13T04:53:53.211Z","updated_at":"2025-05-04T13:30:52.209Z","avatar_url":"https://github.com/kapsner.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mlexperiments\n\n\u003c!-- badges: start --\u003e\n\n[![](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n[![](https://www.r-pkg.org/badges/version/mlexperiments)](https://cran.r-project.org/package=mlexperiments)\n[![CRAN\nchecks](https://badges.cranchecks.info/worst/mlexperiments.svg)](https://cran.r-project.org/web/checks/check_results_mlexperiments.html)\n[![](http://cranlogs.r-pkg.org/badges/grand-total/mlexperiments?color=blue)](https://cran.r-project.org/package=mlexperiments)\n[![](http://cranlogs.r-pkg.org/badges/last-month/mlexperiments?color=blue)](https://cran.r-project.org/package=mlexperiments)\n[![Dependencies](https://tinyverse.netlify.app/badge/mlexperiments)](https://cran.r-project.org/package=mlexperiments)\n[![R build\nstatus](https://github.com/kapsner/mlexperiments/workflows/R%20CMD%20Check%20via%20%7Btic%7D/badge.svg)](https://github.com/kapsner/mlexperiments/actions)\n[![R build\nstatus](https://github.com/kapsner/mlexperiments/workflows/lint/badge.svg)](https://github.com/kapsner/mlexperiments/actions)\n[![R build\nstatus](https://github.com/kapsner/mlexperiments/workflows/test-coverage/badge.svg)](https://github.com/kapsner/mlexperiments/actions)\n[![](https://codecov.io/gh/https://github.com/kapsner/mlexperiments/branch/main/graph/badge.svg)](https://app.codecov.io/gh/https://github.com/kapsner/mlexperiments)\n\u003c!-- badges: end --\u003e\n\nThe `mlexperiments` R package provides an extensible framework for\nreproducible machine learning (ML) experiments, namely:\n\n- Hyperparameter tuning: with the R6 class\n  `mlexperiments::MLTuneParameters`, to optimize the hyperparameters in\n  a k-fold cross-validation with one of the two strategies\n  - Grid search\n  - Bayesian optimization (using the\n    [`ParBayesianOptimization`](https://github.com/AnotherSamWilson/ParBayesianOptimization)\n    R package)\n- K-fold Cross-validation (CV): with the R6 class\n  `mlexperiments::MLCrossValidation`, to validate one hyperparameter\n  setting\n- Nested k-fold cross validation: with the R6 class\n  `mlexperiments::MLNestedCV`, which basically combines the two\n  experiments above to perform a hyperparameter optimization on an inner\n  CV loop, and to validate the best hyperparameter setting on an outer\n  CV loop\n\nThe package provides a minimal wrapper for these ML experiments, and -\nwith few adjustments - users can prepare different learner algorithms so\nthat they can be used with `mlexperiments`.\n\nAdditional learner algorithms are available via the R packages\n[`mllrnrs`](https://github.com/kapsner/mllrnrs) and\n[`mlsurvlrnrs`](https://github.com/kapsner/mlsurvlrnrs).\n\n## Installation\n\nTo install `mlexperiments` simply run\n\n``` r\n#| eval: false\ninstall.packages(\"mlexperiments\")\n```\n\nTo install the development version, run\n\n``` r\n#| eval: false\ninstall.packages(\"remotes\")\nremotes::install_github(\"kapsner/mlexperiments\")\n```\n\n## Purpose and Background\n\nThe `mlexperiments` package aims at providing as much flexibility as\npossible while being able to perform the machine learning experiments\nwith different learner algorithms using a common interface. The use of a\ncommon interface ensures, for example, the comparability of experiments\nthat were performed with different learner algorithms, since they use\nthe same underlying code for computing cross-validation folds, etc.\nFurthermore, the common interface also allows to quickly exchange the\nlearner algorithms.\n\nThe package development was performed with the idea in mind to leave as\nmuch flexibility as possible to users wherever possible. This includes,\nfor example, the necessity to provide certain learner-specific arguments\nto their fitting-functions or predict-functions (for example, some\n`xgboost`- or `lightgbm` users prefer to use `early_stopping` during the\ncross-validation while others like to optimize the number of boosting\niterations in a grid search). Thus, it was decided wherever possible to\nnot hard-code learner-specific arguments. Instead, some general fields\nwere added to the R6 classes of the experiments to be able to pass such\narguments, e.g., to the learners’ fitting-functions and\npredict-functions, respectively.\n\nThis flexibility might come at the expense of an intuitive usability as\nusers first need to define their `mlexperiments`-specific learner\nfunctions according to their needs. However, for users who did not use\nthe R language’s well established machine learning frameworks in their\nexperiments (e.g. [`tidymodels`](https://www.tidymodels.org/),\n[`caret`](https://topepo.github.io/caret/), and\n[`mlr3`](https://mlr3.mlr-org.com/)), this might not be such a big\nchallenge at all as they previously might have been already writing code\n\n- to perform a hyperparameter tuning (using a grid-search or even a\n  Bayesian optimization)\n- to validate a set of hyperparameters using a resampling strategy\n  (e.g., a k-fold cross-validation)\n- to fit a model with some training data\n- to apply a fitted model to predict the outcome in before unseen data\n\nThe `mlexperiments` R package provides a standardized interface to\ndefine these steps inside of R functions by making some restrictions on\nthe inputs and outputs of these functions.\n\nSome basic learners are included into the `mlexperiments` package,\nmainly to provide a set of baseline learners that can be used for\ncomparison throughout experiments (e.g., wrappers for `stats::lm()` and\n`stats::glm()`). Some more learners are prepared for the use with\n`mlexperiments` in the R package\n[`mllrnrs`](https://github.com/kapsner/mllrnrs). Generally, the\nflexibility of the `mlexperiments` package implies that users have a\ndeeper understanding of the algorithms they use, including the\nhyperparameters that can be optimized.\n\nHowever, `mlexperiments` aims not at providing a ready-to-use interface\nfor many learner algorithms. Instead, users are encouraged to prepare\nthe algorithms they want to use with `mlexperiments` according to their\ntasks, needs, experience, and personal preferences.\n\nDetails on how to prepare an algorithm for use with `mlexperiments` can\nbe found in the [package\nvignette](https://github.com/kapsner/mlexperiments/wiki/mlexperiments_starter).\nUsers that want to use a new algorithm with `mlexperiments` are also\nencouraged to dive into the available implementations, especially\n[`LearnerKnn`](R/LearnerKnn.R) and [`LearnerRpart`](R/LearnerRpart.R),\nin order to get an understanding of the functioning and the flexibility\nof the framework. Furthermore, there is a\n[wiki](https://github.com/kapsner/mlexperiments/wiki) to demonstrate the\napplication of some basic learners to common tasks.\n\nThe initial idea for this package was born when working on the project\nwork for my Medical Data Science Certificate study program. I wanted to\napply different machine learning algorithms to survival data and\ncouldn’t find a framework for machine learning experiments to analyze\nsurvival data with the algorithms `xgboost`, `glmnet` and `ranger`.\nWhile all of the three big frameworks for machine learning in R,\n[`tidymodels`](https://www.tidymodels.org/),\n[`caret`](https://topepo.github.io/caret/), and\n[`mlr3`](https://mlr3.mlr-org.com/), allow to perform hyperparameter\ntuning and (nested) cross validation, none of those frameworks had\nimplemented stable interfaces for all of these three algorithms that\ncould be executed on survival data at the time of starting with the\nproject work (end of April 2022). For\n[`tidymodels`](https://www.tidymodels.org/), the add-on package\n[`cencored`](https://censored.tidymodels.org/) addresses survival\nanalysis, but only supported the `glmnet` algorithm in April 2022. For\n[`mlr3`](https://mlr3.mlr-org.com/), the add-on package\n[`mlr3proba`](https://github.com/mlr-org/mlr3proba) addresses survival\nanalysis, with lots of learners capable to conduct survival analysis\navailable with the package\n[`mlr3learners`](https://mlr3extralearners.mlr-org.com/articles/learners/test_overview.html),\nincluding implementations for all of the three algorithms I wanted to\nuse. In contrast, the developer and maintainer of\n[`caret`](https://topepo.github.io/caret/) stated in a [comment on\nGitHub](https://github.com/topepo/caret/issues/959) that all efforts\nregarding survival analysis will be made in its successor framework,\n[`tidymodels`](https://www.tidymodels.org/). Thus, I initially decided\nto implement my analysis with [`mlr3`](https://mlr3.mlr-org.com/) /\n[`mlr3proba`](https://github.com/mlr-org/mlr3proba). However, when\nactually starting to implement things, I realized that in the meantime\n[`mlr3proba`](https://github.com/mlr-org/mlr3proba) has unfortunately\nbeen [archived on CRAN on\n2022-05-16](https://cran.r-project.org/web/packages/mlr3proba/index.html).\nFor the sake of stability throughout the project work, I finally decided\nto implement the whole logic myself as it “just includes some for loops\nand summarizing results” :joy: :joy:. In the end, implementing a common\ninterface for the three algorithms to perform survival analysis was a\nvery time-consuming effort. This was even more the case when trying to\nmake the code as generic and re-usable as possible, to generalize it to\ntasks other than survival analysis, as well as to allow for adding\n(potentially) any other learner.\n\nThe result of these efforts are:\n\n- the [`mlexperiments`](https://github.com/kapsner/mlexperiments) R\n  package, providing\n  - R6 classes to perform the machine learning experiments\n    (hyperparameter tuning, cross-validation, and nested\n    cross-validation)\n  - some base learners (`LearnerLm`, `LearnerGlm`, `LearnerRpart`, and\n    `LearnerKnn`)\n  - an R6 class to inherit new learners from (`MLLearnerBase`)\n  - as well as functions\n    - to validate the equality of folds used between different\n      experiments (`mlexperiments::validate_fold_equality()`)\n    - to apply learners to new data and predict the outcome\n      (`mlexperiments::predictions()`)\n    - to calculate performance measures with these predictions\n      (`mlexperiments::performance()`)\n    - and a utility function to select performance metrics from the\n      [`mlr3measures`](https://cran.r-project.org/web/packages/mlr3measures/index.html)\n      R package\n- the [`mllrnrs`](https://github.com/kapsner/mllrnrs) R package, which\n  enhances `mlexperiments` with some learner wrappers for algorithms I\n  commonly use. They were separated into their own package in order to\n  reduce overall maintenance load and to avoid having lots of\n  dependencies in the\n  [`mlexperiments`](https://github.com/kapsner/mlexperiments) R package.\n  Implemented learners are:\n  - LearnerGlmnet\n  - LearnerXgboost\n  - LearnerLightgbm\n  - LearnerRanger\n- the [`mlsurvlrnrs`](https://github.com/kapsner/mlsurvlrnrs) R package,\n  which enhances `mlexperiments` with some learner wrappers for survival\n  analysis. Implemented learners are:\n  - LearnerSurvCoxPHCox\n  - LearnerSurvGlmnetCox\n  - LearnerSurvRangerCox\n  - LearnerSurvRpartCox\n  - LearnerSurvXgboostCox\n  - LearnerSurvXgboostAft\n  - LearnerSurvSurvivalsvm\n\n## Examples\n\n### Preparations\n\nFirst of all, load the data and transform it into a matrix, and define\nthe training data and the target variable.\n\n``` r\nlibrary(mlexperiments)\nlibrary(mlbench)\n\ndata(\"DNA\")\ndataset \u003c- DNA |\u003e\n  data.table::as.data.table() |\u003e\n  na.omit()\n\nseed \u003c- 123\nfeature_cols \u003c- colnames(dataset)[1:180]\n\ntrain_x \u003c- model.matrix(\n  ~ -1 + .,\n  dataset[, .SD, .SDcols = feature_cols]\n)\ntrain_y \u003c- dataset[, get(\"Class\")]\n\nncores \u003c- ifelse(\n  test = parallel::detectCores() \u003e 4,\n  yes = 4L,\n  no = ifelse(\n    test = parallel::detectCores() \u003c 2L,\n    yes = 1L,\n    no = parallel::detectCores()\n  )\n)\nif (isTRUE(as.logical(Sys.getenv(\"_R_CHECK_LIMIT_CORES_\")))) {\n  # on cran\n  ncores \u003c- 2L\n}\n```\n\n### Hyperparameter Tuning\n\n#### Bayesian Tuning\n\nFor the Bayesian hyperparameter optimization, it is required to define a\ngrid with some hyperparameter combinations that is used for initializing\nthe Bayesian process. Furthermore, the borders (allowed extreme values)\nof the hyperparameters that are actually optimized need to be defined in\na list. Finally, further arguments that are passed to the function\n`ParBayesianOptimization::bayesOpt()` can be defined as well.\n\n``` r\nparam_list_knn \u003c- expand.grid(\n  k = seq(4, 68, 8),\n  l = 0,\n  test = parse(text = \"fold_test$x\")\n)\n\nknn_bounds \u003c- list(k = c(2L, 80L))\n\noptim_args \u003c- list(\n  iters.n = ncores,\n  kappa = 3.5,\n  acq = \"ucb\"\n)\n```\n\nThen, the created objects need to be assigned to the corresponding\nfields of the R6 class `mlexperiments::MLTuneParameters`:\n\n``` r\nknn_tune_bayesian \u003c- mlexperiments::MLTuneParameters$new(\n  learner = LearnerKnn$new(),\n  strategy = \"bayesian\",\n  ncores = ncores,\n  seed = seed\n)\n\nknn_tune_bayesian$parameter_bounds \u003c- knn_bounds\nknn_tune_bayesian$parameter_grid \u003c- param_list_knn\nknn_tune_bayesian$split_type \u003c- \"stratified\"\nknn_tune_bayesian$optim_args \u003c- optim_args\n\n# set data\nknn_tune_bayesian$set_data(\n  x = train_x,\n  y = train_y\n)\n\nresults \u003c- knn_tune_bayesian$execute(k = 3)\nhead(results)\n#\u003e    Epoch setting_id  k gpUtility acqOptimum inBounds Elapsed      Score metric_optim_mean errorMessage l\n#\u003e 1:     0          1  4        NA      FALSE     TRUE   2.009 -0.2247332         0.2247332           NA 0\n#\u003e 2:     0          2 12        NA      FALSE     TRUE   2.273 -0.1600753         0.1600753           NA 0\n#\u003e 3:     0          3 20        NA      FALSE     TRUE   2.376 -0.1381042         0.1381042           NA 0\n#\u003e 4:     0          4 28        NA      FALSE     TRUE   2.323 -0.1403013         0.1403013           NA 0\n#\u003e 5:     0          5 36        NA      FALSE     TRUE   2.128 -0.1315129         0.1315129           NA 0\n#\u003e 6:     0          6 44        NA      FALSE     TRUE   2.339 -0.1258632         0.1258632           NA 0\n```\n\n#### Grid Search\n\nTo carry out the hyperparameter optimization with a grid search, only\nthe `parameter_grid` is required:\n\n``` r\nknn_tune_grid \u003c- mlexperiments::MLTuneParameters$new(\n  learner = LearnerKnn$new(),\n  strategy = \"grid\",\n  ncores = ncores,\n  seed = seed\n)\n\nknn_tune_grid$parameter_grid \u003c- param_list_knn\nknn_tune_grid$split_type \u003c- \"stratified\"\n\n# set data\nknn_tune_grid$set_data(\n  x = train_x,\n  y = train_y\n)\n\nresults \u003c- knn_tune_grid$execute(k = 3)\nhead(results)\n#\u003e    setting_id metric_optim_mean  k l\n#\u003e 1:          1         0.2187696  4 0\n#\u003e 2:          2         0.1597615 12 0\n#\u003e 3:          3         0.1349655 20 0\n#\u003e 4:          4         0.1406152 28 0\n#\u003e 5:          5         0.1318267 36 0\n#\u003e 6:          6         0.1258632 44 0\n```\n\n### Cross-Validation\n\nFor the cross-validation experiments\n(`mlexperiments::MLCrossValidation`, and `mlexperiments::MLNestedCV`), a\nnamed list with the in-sample row indices of the folds is required.\n\n``` r\nfold_list \u003c- splitTools::create_folds(\n  y = train_y,\n  k = 3,\n  type = \"stratified\",\n  seed = seed\n)\nstr(fold_list)\n#\u003e List of 3\n#\u003e  $ Fold1: int [1:2124] 1 2 3 4 5 7 9 10 11 12 ...\n#\u003e  $ Fold2: int [1:2124] 1 2 3 6 8 9 11 13 16 17 ...\n#\u003e  $ Fold3: int [1:2124] 4 5 6 7 8 10 12 14 15 16 ...\n```\n\nFurthermore, a specific hyperparameter setting that should be validated\nwith the cross-validation needs to be selected:\n\n``` r\nknn_cv \u003c- mlexperiments::MLCrossValidation$new(\n  learner = LearnerKnn$new(),\n  fold_list = fold_list,\n  seed = seed\n)\n\nbest_grid_result \u003c- knn_tune_grid$results$best.setting\nbest_grid_result\n\nknn_cv$learner_args \u003c- best_grid_result[-1]\n\nknn_cv$predict_args \u003c- list(type = \"response\")\nknn_cv$performance_metric \u003c- metric(\"bacc\")\nknn_cv$return_models \u003c- TRUE\n\n# set data\nknn_cv$set_data(\n  x = train_x,\n  y = train_y\n)\n\nresults \u003c- knn_cv$execute()\nhead(results)\n#\u003e     fold performance  k l\n#\u003e 1: Fold1   0.8912781 68 0\n#\u003e 2: Fold2   0.8832388 68 0\n#\u003e 3: Fold3   0.8657147 68 0\n```\n\n### Nested Cross-Validation\n\nLast but not least, the hyperparameter optimization and validation can\nbe combined in a nested cross-validation. In each fold of the so-called\n“outer” cross-validation loop, the hyperparameters are optimized on the\nin-sample observations with one of the two strategies: Bayesian\noptimization or grid search. Both of these strategies are implemented\nagain with a “nested” (“inner”) cross-validation. The best\nhyperparameter setting as identified by the inner cross-validation is\nthen used to fit a model with all in-sample observations of the outer\ncross-validation loop and finally validate it on the respective\nout-sample observations.\n\nThe experiment classes must be parameterized as described above.\n\n#### Inner Bayesian Optimization\n\n``` r\nknn_cv_nested_bayesian \u003c- mlexperiments::MLNestedCV$new(\n  learner = LearnerKnn$new(),\n  strategy = \"bayesian\",\n  fold_list = fold_list,\n  k_tuning = 3L,\n  ncores = ncores,\n  seed = seed\n)\n\nknn_cv_nested_bayesian$parameter_grid \u003c- param_list_knn\nknn_cv_nested_bayesian$parameter_bounds \u003c- knn_bounds\nknn_cv_nested_bayesian$split_type \u003c- \"stratified\"\nknn_cv_nested_bayesian$optim_args \u003c- optim_args\n\nknn_cv_nested_bayesian$predict_args \u003c- list(type = \"response\")\nknn_cv_nested_bayesian$performance_metric \u003c- metric(\"bacc\")\n\n# set data\nknn_cv_nested_bayesian$set_data(\n  x = train_x,\n  y = train_y\n)\n\nresults \u003c- knn_cv_nested_bayesian$execute()\nhead(results)\n#\u003e     fold performance  k l\n#\u003e 1: Fold1   0.8912781 68 0\n#\u003e 2: Fold2   0.8832388 68 0\n#\u003e 3: Fold3   0.8657147 68 0\n```\n\n#### Inner Grid Search\n\n``` r\nknn_cv_nested_grid \u003c- mlexperiments::MLNestedCV$new(\n  learner = LearnerKnn$new(),\n  strategy = \"grid\",\n  fold_list = fold_list,\n  k_tuning = 3L,\n  ncores = ncores,\n  seed = seed\n)\n\nknn_cv_nested_grid$parameter_grid \u003c- param_list_knn\nknn_cv_nested_grid$split_type \u003c- \"stratified\"\n\nknn_cv_nested_grid$predict_args \u003c- list(type = \"response\")\nknn_cv_nested_grid$performance_metric \u003c- metric(\"bacc\")\n\n# set data\nknn_cv_nested_grid$set_data(\n  x = train_x,\n  y = train_y\n)\n\nresults \u003c- knn_cv_nested_grid$execute()\nhead(results)\n#\u003e     fold performance  k l\n#\u003e 1: Fold1   0.8959736 52 0\n#\u003e 2: Fold2   0.8832388 68 0\n#\u003e 3: Fold3   0.8657147 68 0\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkapsner%2Fmlexperiments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkapsner%2Fmlexperiments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkapsner%2Fmlexperiments/lists"}