{"id":32209412,"url":"https://github.com/thie1e/cutpointr","last_synced_at":"2026-02-25T15:31:05.660Z","repository":{"id":44921675,"uuid":"74686042","full_name":"Thie1e/cutpointr","owner":"Thie1e","description":"Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification","archived":false,"fork":false,"pushed_at":"2025-03-13T13:40:33.000Z","size":12040,"stargazers_count":89,"open_issues_count":8,"forks_count":13,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-10-19T19:58:37.306Z","etag":null,"topics":["bootstrapping","cutpoint-optimization","r","roc-curve"],"latest_commit_sha":null,"homepage":"https://cran.r-project.org/package=cutpointr","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Thie1e.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2016-11-24T15:41:00.000Z","updated_at":"2025-09-18T11:27:42.000Z","dependencies_parsed_at":"2025-09-08T13:36:20.089Z","dependency_job_id":"cb705951-3e92-49ee-a725-927fcb24f51d","html_url":"https://github.com/Thie1e/cutpointr","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/Thie1e/cutpointr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Thie1e%2Fcutpointr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Thie1e%2Fcutpointr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Thie1e%2Fcutpointr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Thie1e%2Fcutpointr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Thie1e","download_url":"https://codeload.github.com/Thie1e/cutpointr/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Thie1e%2Fcutpointr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280389373,"owners_count":26322516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-22T02:00:06.515Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bootstrapping","cutpoint-optimization","r","roc-curve"],"created_at":"2025-10-22T06:04:26.321Z","updated_at":"2025-10-22T06:05:57.947Z","avatar_url":"https://github.com/Thie1e.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: \n    github_document:\n        toc: true\n        toc_depth: 2\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\"\n)\n```\n\n```{r, include = FALSE, echo = FALSE}\nlibrary(ggplot2)\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(purrr)\n```\n\n# cutpointr\n\n[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/Thie1e/cutpointr?branch=master\u0026svg=true)](https://ci.appveyor.com/project/Thie1e/cutpointr)\n[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)\n[![codecov](https://codecov.io/github/thie1e/cutpointr/branch/master/graphs/badge.svg)](https://codecov.io/github/thie1e/cutpointr) \n[![CRAN_Release_Badge](http://www.r-pkg.org/badges/version-ago/cutpointr)](https://CRAN.R-project.org/package=cutpointr)\n\n**cutpointr** is an R package for tidy calculation of \"optimal\" cutpoints. It \nsupports several methods for calculating cutpoints and includes several \nmetrics that can be maximized or minimized by selecting a cutpoint. Some of these\nmethods are designed to be more robust than the simple empirical optimization \nof a metric. Additionally,\n**cutpointr** can automatically bootstrap the variability of the optimal \ncutpoints and return out-of-bag estimates of various performance metrics.\n\n## Installation\n\nYou can install **cutpointr** from CRAN using the menu in RStudio or simply:\n\n```{r CRAN, eval = FALSE}\ninstall.packages(\"cutpointr\")\n```\n\n## Example\n\nFor example, the optimal cutpoint for the included data set is 2 when maximizing the sum of sensitivity and specificity.\n\n```{r}\nlibrary(cutpointr)\ndata(suicide)\nhead(suicide)\ncp \u003c- cutpointr(suicide, dsi, suicide, \n                method = maximize_metric, metric = sum_sens_spec)\n```\n\n```{r}\nsummary(cp)\n```\n\n```{r}\nplot(cp)\n```\n\nWhen considering the optimality of a cutpoint, we can only make a judgement based\non the sample at hand. Thus, the estimated cutpoint may not be optimal within \nthe population or on unseen data, which is why we sometimes put the \"optimal\" in\nquotation marks.\n\n`cutpointr` makes assumptions about the direction of the dependency between \n`class` and `x`, if `direction` and / or `pos_class` or `neg_class` are not\nspecified. The same result as above can be achieved by manually defining `direction` and\nthe positive / negative classes which is slightly faster, since the classes and direction\ndon't have to be determined:\n\n```{r}\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, direction = \"\u003e=\", pos_class = \"yes\",\n                     neg_class = \"no\", method = maximize_metric, metric = youden)\n```\n\n`opt_cut` is a data frame that returns the input data and the ROC curve\n(and optionally the bootstrap results) in a \nnested tibble. Methods for summarizing and plotting the data and\nresults are included (e.g. `summary`, `plot`, `plot_roc`, `plot_metric`)\n\nTo inspect the optimization, the function of metric values per cutpoint can be\nplotted using `plot_metric`, if an optimization function was used that returns\na metric column in the `roc_curve` column. For example, the `maximize_metric`\nand `minimize_metric` functions do so:\n\n```{r}\nplot_metric(opt_cut)\n```\n\nPredictions for new data can be made using `predict`:\n\n```{r}\npredict(opt_cut, newdata = data.frame(dsi = 0:5))\n```\n\n\n## Features\n\n- Calculation of optimal cutpoints in binary classification tasks\n- Tidy output, integrates well with functions from the tidyverse\n- Functions for plotting ROC curves, metric distributions and more\n- Bootstrapping for simulating the cutpoint variability and for obtaining \nout-of-bag estimates of various metrics (as a form of internal validation)\nwith optional parallelisation\n- Multiple methods for calculating cutpoints\n- Multiple metrics can be chosen for maximization / minimization\n- Tidyeval\n\n# Calculating cutpoints\n\n## Method functions for cutpoint estimation\n\nThe included methods for calculating cutpoints are:\n\n- `maximize_metric`: Maximize the metric function\n- `minimize_metric`: Minimize the metric function\n- `maximize_loess_metric`: Maximize the metric function after LOESS smoothing\n- `minimize_loess_metric`: Minimize the metric function after LOESS smoothing\n- `maximize_gam_metric`: Maximize the metric function after smoothing via Generalized Additive Models\n- `minimize_gam_metric`: Minimize the metric function after smoothing via Generalized Additive Models\n- `maximize_boot_metric`: Bootstrap the optimal cutpoint when maximizing a metric\n- `minimize_boot_metric`: Bootstrap the optimal cutpoint when minimizing a metric\n- `oc_manual`: Specify the cutoff value manually\n- `oc_mean`: Use the sample mean as the \"optimal\" cutpoint\n- `oc_median`: Use the sample median as the \"optimal\" cutpoint\n- `oc_youden_kernel`: Maximize the Youden-Index after kernel smoothing\nthe distributions of the two classes\n- `oc_youden_normal`: Maximize the Youden-Index parametrically\nassuming normally distributed data in both classes\n\n## Metric functions\n\nThe included metrics to be used with the minimization and maximization methods \nare:\n\n- `accuracy`: Fraction correctly classified\n- `abs_d_sens_spec`: The absolute difference of sensitivity and specificity\n- `abs_d_ppv_npv`: The absolute difference between positive predictive\nvalue (PPV) and negative predictive value (NPV)\n- `roc01`: Distance to the point (0,1) on ROC space\n- `cohens_kappa`: Cohen's Kappa\n- `sum_sens_spec`: sensitivity + specificity\n- `sum_ppv_npv`: The sum of positive predictive value (PPV) and negative\npredictive value (NPV)\n- `prod_sens_spec`: sensitivity * specificity\n- `prod_ppv_npv`: The product of positive predictive value (PPV) and \nnegative predictive value (NPV)\n- `youden`: Youden- or J-Index = sensitivity + specificity - 1\n- `odds_ratio`: (Diagnostic) odds ratio\n- `risk_ratio`: risk ratio (relative risk)\n- `p_chisquared`: The p-value of a chi-squared test on the confusion\nmatrix\n- `misclassification_cost`: The sum of the misclassification cost of\nfalse positives and false negatives. Additional arguments: cost_fp, cost_fn\n- `total_utility`: The total utility of true / false positives / negatives.\nAdditional arguments: utility_tp, utility_tn, cost_fp, cost_fn\n- `F1_score`: The F1-score (2 * TP) / (2 * TP + FP + FN)\n- `metric_constrain`: Maximize a selected metric given a minimal value of \nanother selected metric\n- `sens_constrain`: Maximize sensitivity given a minimal value of specificity\n- `spec_constrain`: Maximize specificity given a minimal value of sensitivity\n- `acc_constrain`: Maximize accuracy given a minimal value of sensitivity\n\nFurthermore, the following functions are included which can be used as metric\nfunctions but are more useful for plotting purposes, for example in \n`plot_cutpointr`, or for defining new metric functions: \n`tp`, `fp`, `tn`, `fn`, `tpr`, `fpr`, `tnr`, `fnr`, `false_omission_rate`,\n`false_discovery_rate`, `ppv`, `npv`, `precision`, `recall`, `sensitivity`, and `specificity`.\n\nThe inputs to the arguments\n`method` and `metric` are functions so that user-defined functions can easily\nbe supplied instead of the built-in ones.\n\n\n## Separate subgroups and bootstrapping\n\nCutpoints can be separately estimated on subgroups that are defined by a third variable,\n`gender` in this case. Additionally, \nif `boot_runs` is larger zero, `cutpointr` will carry out the usual cutpoint\ncalculation on the full sample, just as before, and additionally on \n`boot_runs` bootstrap samples. This offers a way of gauging the out-of-sample\nperformance of the cutpoint estimation method. If a subgroup is given, \nthe bootstrapping is carried out separately for every\nsubgroup which is also reflected in the plots and output.\n\n```{r, cache=TRUE}\nset.seed(12)\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, boot_runs = 1000)\nopt_cut\n```\n\nThe returned object has the additional column `boot` which is a nested tibble that\nincludes the cutpoints per bootstrap sample along with the metric calculated using \nthe function in `metric` and\nvarious default metrics. The \nmetrics are suffixed by `_b` to indicate in-bag results or `_oob` to indicate\nout-of-bag results:\n\n```{r}\nopt_cut$boot\n```\n\nThe summary and plots include additional elements that summarize or display the\nbootstrap results:\n\n```{r}\nsummary(opt_cut)\nplot(opt_cut)\n```\n\n\n### Parallelized bootstrapping\n\nUsing `foreach` and `doRNG` the bootstrapping can be parallelized easily. The\n**doRNG** package is being used to make the bootstrap sampling reproducible.\n\n```{r, cache=TRUE}\nif (suppressPackageStartupMessages(require(doParallel) \u0026 require(doRNG))) {\n  cl \u003c- makeCluster(2) # 2 cores\n  registerDoParallel(cl)\n  registerDoRNG(12) # Reproducible parallel loops using doRNG\n  opt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, pos_class = \"yes\",\n                 direction = \"\u003e=\", boot_runs = 1000, allowParallel = TRUE)\n  stopCluster(cl)\n  opt_cut\n}\n```\n\n\n# More robust cutpoint estimation methods\n\n## Bootstrapped cutpoints\n\nIt has been shown that bagging can substantially improve performance of a wide range of types of models in regression as well as in classification tasks. This method is available for cutpoint estimation via the `maximize_boot_metric` and `minimize_boot_metric` functions. If one of these functions is used as `method`, `boot_cut` bootstrap samples are drawn, the cutpoint optimization is carried out in each one and a summary (e.g. the mean) of the resulting optimal cutpoints on the bootstrap samples is returned as the optimal cutpoint in `cutpointr`. Note that if bootstrap validation is run, i.e. if `boot_runs` is larger zero, an outer bootstrap will be executed. In the bootstrap validation routine `boot_runs` bootstrap samples are generated and each one is again bootstrapped `boot_cut` times. This may lead to long run times, so activating the built-in parallelization may be advisable. \n\nThe advantages of bootstrapping the optimal cutpoint are that the procedure doesn't possess parameters that have to be tuned, unlike the LOESS smoothing, that it doesn't rely on assumptions, unlike the Normal method, and that it is applicable to any metric that can be used with `minimize_metric` or `maximize_metric`, unlike the Kernel method. Furthermore, like Random Forests cannot be overfit by increasing the number of trees, the bootstrapped cutpoints cannot be overfit by running an excessive amount of `boot_cut` repetitions. \n\n```{r, cache=TRUE}\nset.seed(100)\ncutpointr(suicide, dsi, suicide, gender, \n          method = maximize_boot_metric,\n          boot_cut = 200, summary_func = mean,\n          metric = accuracy, silent = TRUE)\n```\n\n\n## LOESS smoothing for selecting a cutpoint\n\nWhen using `maximize_metric` and `minimize_metric` the optimal cutpoint is \nselected by searching the maximum or minimum of the metric function. For \nexample, we may want to minimize the misclassification cost. Since false \nnegatives (a suicide attempt was not anticipated) can be regarded as much more \nsevere than false positives we can set the cost of a false negative `cost_fn`\nfor example to ten times the cost of a false positive.\n\n```{r}\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,\n                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)\n```\n\n```{r}\nplot_metric(opt_cut)\n```\n\nAs this \"optimal\" cutpoint may depend on minor differences between the \npossible cutoffs, smoothing of the function of metric values by\ncutpoint value might be desirable, especially in small samples. The\n`minimize_loess_metric` and `maximize_loess_metric` functions can be used\nto smooth the function so that the optimal cutpoint is selected based on the\nsmoothed metric values. Options to modify the smoothing, which is implemented using\n`loess.as` from the **fANCOVA** package, include:\n\n- `criterion`: the criterion for automatic smoothing parameter selection: \"aicc\" denotes bias-corrected AIC criterion, \"gcv\" denotes generalized cross-validation.\n- `degree`: the degree of the local polynomials to be used. It can be 0, 1 or 2.\n- `family`: if \"gaussian\" fitting is by least-squares, and if \"symmetric\" a re-descending M estimator is used with Tukey's biweight function.\n- `user.span`: the user-defined parameter which controls the degree of smoothing.\n\nUsing parameters for the LOESS smoothing of `criterion = \"aicc\"`, `degree = 2`, \n`family = \"symmetric\"`, and `user.span = 0.7` we get the following smoothed\nversions of the above metrics:\n\n```{r}\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, \n                     method = minimize_loess_metric,\n                     criterion = \"aicc\", family = \"symmetric\", \n                     degree = 2, user.span = 0.7,\n                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)\n```\n\n```{r}\nplot_metric(opt_cut)\n```\n\n\nThe optimal cutpoint for the female subgroup changes to 3. Note, though, that there \nare no reliable rules for selecting the \"best\" smoothing parameters. Notably,\nthe LOESS smoothing is sensitive to the number of unique cutpoints. A large \nnumber of unique cutpoints generally leads to a more volatile curve of \nmetric values by cutpoint value, even after smoothing. Thus, the curve\ntends to be undersmoothed in that scenario. The unsmoothed metric\nvalues are returned in `opt_cut$roc_curve` in the column\n`m_unsmoothed`.\n\n\n## Smoothing via Generalized Additive Models for selecting a cutpoint\n\nIn a similar fashion, the function of metric values per cutpoint can be smoothed \nusing Generalized Additive Models with smooth terms. Internally, `mgcv::gam` \ncarries out the smoothing which can be customized via the arguments\n`formula` and `optimizer`, see `help(\"gam\", package = \"mgcv\")`. Most importantly,\nthe GAM can be specified by altering the default formula, for example the \nsmoothing function could be configured to apply cubic regression splines (`\"cr\"`)\nas the smooth term. As the `suicide` data has only very few unique cutpoints,\nit is not very suitable for showcasing the GAM smoothing, so we will use two \nclasses of the `iris` data here. In this case, the purely empirical method and\nthe GAM smoothing lead to identical cutpoints, but in practice the GAM smoothing\ntends to be more robust, especially with larger data. An attractive feature of\nthe GAM smoothing is that the default values tend to work quite well and\nusually require no tuning, eliminating researcher degrees of freedom. \n\n```{r}\nlibrary(ggplot2)\nexdat \u003c- iris\nexdat \u003c- exdat[exdat$Species != \"setosa\", ]\nopt_cut \u003c- cutpointr(exdat, Petal.Length, Species,\n                     method = minimize_gam_metric,\n                     formula = m ~ s(x.sorted, bs = \"cr\"),\n                     metric = abs_d_sens_spec)\nplot_metric(opt_cut)\n```\n\n\n### Parametric method assuming normality\n\nThe Normal method in `oc_youden_normal` is a parametric method for maximizing the Youden-Index or equivalently the sum of $Se$ and $Sp$. It relies on the assumption that the predictor for both the negative and positive observations is normally distributed. In that case it can be shown that\n\n$$c^* = \\frac{(\\mu_P \\sigma_N^2 - \\mu_N \\sigma_P^2) - \\sigma_N \\sigma_P \\sqrt{(\\mu_N - \\mu_P)^2 + (\\sigma_N^2 - \\sigma_P^2) log(\\sigma_N^2 / \\sigma_P^2)}}{\\sigma_N^2 - \\sigma_P^2}$$\n\nwhere the negative class is normally distributed with $\\sim N(\\mu_N, \\sigma_N^2)$ and the positive class independently normally distributed with $\\sim N(\\mu_P, \\sigma_P^2)$ provides the optimal cutpoint $c^*$ that maximizes the Youden-Index. If $\\sigma_N$ and $\\sigma_P$ are equal, the expression can be simplified to $c^* = \\frac{\\mu_N + \\mu_P}{2}$. However, the `oc_youden_normal` method in cutpointr always assumes unequal standard deviations. Since this method does not select a cutpoint from the observed predictor values, it is questionable which values for $Se$ and $Sp$ should be reported. Here, the Youden-Index can be calculated as \n\n$$J = \\Phi(\\frac{c^* - \\mu_N}{\\sigma_N}) - \\Phi(\\frac{c^* - \\mu_P}{\\sigma_P})$$\n\nif the assumption of normality holds. However, since there exist several methods that do not select cutpoints from the available observations and to unify the reporting of metrics for these methods, **cutpointr** reports all metrics, e.g. $Se$ and $Sp$, based on the empirical observations. \n\n\n```{r}\ncutpointr(suicide, dsi, suicide, gender, method = oc_youden_normal)\n```\n\n### Nonparametric kernel method\n\nA nonparametric alternative is the Kernel method [@fluss_estimation_2005]. Here, the empirical distribution functions are smoothed using the Gaussian kernel functions $\\hat{F}_N(t) = \\frac{1}{n} \\sum^n_{i=1} \\Phi(\\frac{t - y_i}{h_y})$ and $\\hat{G}_P(t) = \\frac{1}{m} \\sum^m_{i=1} \\Phi(\\frac{t - x_i}{h_x})$ for the negative and positive classes respectively. Following Silverman's plug-in \"rule of thumb\" the bandwidths are selected as $h_y = 0.9 * min\\{s_y, iqr_y/1.34\\} * n^{-0.2}$ and $h_x = 0.9 * min\\{s_x, iqr_x/1.34\\} * m^{-0.2}$ where $s$ is the sample standard deviation and $iqr$ is the inter quartile range. It has been demonstrated that AUC estimation is rather insensitive to the choice of the bandwidth procedure [@faraggi_estimation_2002] and thus the plug-in bandwidth estimator has also been recommended for cutpoint estimation. The `oc_youden_kernel` function in **cutpointr** uses a Gaussian kernel and the direct plug-in method for selecting the bandwidths. The kernel smoothing is done via the `bkde` function from the **KernSmooth** package [@wand_kernsmooth:_2013]. \n\nAgain, there is a way to calculate the Youden-Index from the results of this method [@fluss_estimation_2005] which is\n\n$$\\hat{J} = max_c \\{\\hat{F}_N(c) - \\hat{G}_N(c) \\}$$\n\nbut as before we prefer to report all metrics based on applying the cutpoint that was estimated using the Kernel method to the empirical observations. \n\n```{r}\ncutpointr(suicide, dsi, suicide, gender, method = oc_youden_kernel)\n```\n\n# Additional features\n\n## Calculating only the ROC curve \n\nWhen running `cutpointr`, a ROC curve is by default returned in the column `roc_curve`.\nThis ROC curve can be plotted using `plot_roc`. Alternatively, if only the\nROC curve is desired and no cutpoint needs to be calculated, the ROC curve\ncan be created using `roc()` and plotted using `plot_cutpointr`.\nThe `roc` function, unlike `cutpointr`, does not determine `direction`, `pos_class` or `neg_class`\nautomatically.\n\n```{r, fig.width=4, fig.height=3}\nroc_curve \u003c- roc(data = suicide, x = dsi, class = suicide,\n    pos_class = \"yes\", neg_class = \"no\", direction = \"\u003e=\")\nauc(roc_curve)\nhead(roc_curve)\nplot_roc(roc_curve)\n```\n\n\n\n## Midpoints\n\nSo far - which is the default in `cutpointr` - we have considered all unique values of the predictor as possible cutpoints. An alternative could be to use a sequence of equidistant values instead, for example in the case of the `suicide` data all integers in $[0, 10]$. However, with very sparse data and small intervals between the candidate cutpoints (i.e. a 'dense' sequence like `seq(0, 10, by = 0.01)`) this leads to the uninformative evaluation of large ranges of cutpoints that all result in the same metric value. A more elegant alternative, not only for the case of sparse data, that is supported by **cutpointr** is the use of a mean value of the optimal cutpoint and the next highest (if `direction = \"\u003e=\"`) or the next lowest (if `direction = \"\u003c=\"`) predictor value in the data. The result is an optimal cutpoint that is equal to the cutpoint that would be obtained using an infinitely dense sequence of candidate cutpoints and is thus usually more efficient computationally. This behavior can be activated by setting `use_midpoints = TRUE`, which is the default. If we use this setting, we obtain an optimal cutpoint of 1.5 for the complete sample on the `suicide` data instead of 2 when maximizing the sum of sensitivity and specificity.\n\nAssume the following small data set:\n\n```{r}\ndat \u003c- data.frame(outcome = c(\"neg\", \"neg\", \"neg\", \"pos\", \"pos\", \"pos\", \"pos\"),\n                  pred    = c(1, 2, 3, 8, 11, 11, 12))\n```\n\nSince the distance of the optimal cutpoint (8) to the next lowest \nobservation (3) is rather large we arrive at a range of possible cutpoints that\nall maximize the metric. In the case of this kind of sparseness it might for example be\ndesirable to classify a new observation with a predictor value of 4 as belonging\nto the negative class. If `use_midpoints` is set to `TRUE`, the mean of the \noptimal cutpoint and the next lowest observation is returned as the optimal\ncutpoint, if direction is `\u003e=`. The mean of the optimal cutpoint and the next\nhighest observation is returned as the optimal cutpoint, if `direction = \"\u003c=\"`.\n\n```{r}\nopt_cut \u003c- cutpointr(dat, x = pred, class = outcome, use_midpoints = TRUE)\nplot_x(opt_cut)\n```\n\nA simulation demonstrates more clearly that setting `use_midpoints = TRUE` avoids\nbiasing the cutpoints. To simulate the bias of the metric functions, the \npredictor values of both classes were drawn from normal distributions with \nconstant standard deviations of 10, a constant mean of the negative class of 100\nand higher mean values of the positive class that are selected in such a way \nthat optimal Youden-Index values of 0.2, 0.4, 0.6, and 0.8 result in the population.\nSamples of 9 different sizes were drawn and the cutpoints that maximize the \nYouden-Index were estimated. The simulation was repeated 10000 times. As can be\nseen by the mean error, `use_midpoints = TRUE` eliminates the bias that is \nintroduced by otherwise selecting the value of an observation as the optimal\ncutpoint. If `direction = \"\u003e=\"`, as in this case, the observation that \nrepresents the optimal cutpoint is the highest possible cutpoint that leads to the\noptimal metric value and thus the biases are positive. The methods `oc_youden_normal`\nand `oc_youden_kernel` are always unbiased, as they don't select a cutpoint \nbased on the ROC-curve or the function of metric values per cutpoint.\n\n```{r, echo = FALSE}\nplotdat_nomidpoints \u003c- structure(list(sim_nr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, \n2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, \n4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, \n6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, \n8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, \n10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, \n11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, \n13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, \n15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L, \n16L, 16L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, \n18L, 18L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, \n19L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L, \n21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 23L, \n23L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 24L, 24L, \n24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 26L, 26L, 26L, \n26L, 26L, 26L, 26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, \n28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 29L, 29L, 29L, 29L, 29L, \n29L, 29L, 29L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 31L, 31L, \n31L, 31L, 31L, 31L, 31L, 31L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, \n32L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 34L, 34L, 34L, 34L, \n34L, 34L, 34L, 34L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 36L, \n36L, 36L, 36L, 36L, 36L, 36L, 36L), method = structure(c(1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, \n2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c(\"emp\", \n\"normal\", \"loess\", \"boot\", \"spline\", \"spline_20\", \"kernel\", \"gam\"\n), class = \"factor\"), n = c(30, 30, 30, 30, 30, 30, 30, 30, 50, \n50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, \n100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150, \n150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500, \n500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750, \n1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, \n30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, \n75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150, \n150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250, \n250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750, \n750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000, \n1000, 1000, 30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50, \n50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100, \n100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250, \n250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500, \n500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000, \n1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, 30, 30, 30, \n30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, \n75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, \n150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, \n500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, \n750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000), mean_err = c(0.532157164015659, \n0.0344907054484091, 1.09430750651166, 0.847845162156675, 1.72337372126503, \n0.893756658507988, 0.0430309247027736, 0.785821459035346, 0.368063404388512, \n0.0256197760404459, 0.54480529648463, 0.54385597929651, 0.657325657699579, \n0.578611116865437, 0.0400491342691897, 0.515688005217413, 0.256713912589642, \n0.0444582875885996, 0.326975493112402, 0.371128780921122, 0.473515115741104, \n0.389519558405289, 0.105044360789378, 0.301924717299333, 0.207750921776918, \n-0.00318128936770314, 0.215170156089776, 0.27218780048926, 0.260519564021842, \n0.236792923882582, 0.0209319074923902, 0.232957055204834, 0.0726605917614469, \n-0.00282823355849125, 0.0753216783313991, 0.147121931849656, \n0.0986417724955371, 0.10048009778446, -0.0117861260923649, 0.0650845904350442, \n0.0985144485083747, 0.00601003227249322, 0.107439979908118, 0.120777421732797, \n0.098470489820427, 0.0940946984227826, 0.0340166854141625, 0.107851118082414, \n0.0249685210781582, -0.00275219600614378, 0.0258069390207201, \n0.0303381972274654, -0.000994602151869198, 0.00196854833683764, \n-0.0172319489562159, 0.0230957871932473, 0.00787486424680835, \n-0.018438041997315, -0.000808033567394628, -0.00151904153864496, \n-0.0258118523805697, -0.020984156892953, -0.0411584927473141, \n-0.0462075435919094, 0.0481149217843661, -0.0115241085997692, \n0.0278045708419731, 0.0358588316426625, 0.0424473909450939, 0.0379773233328197, \n0.0298772985879321, 0.0494939492036379, 0.561286668337778, 0.0210874502384648, \n0.607711822769155, 0.944733906256477, 1.32069801051061, 0.623604782428556, \n0.0138075109769806, 0.640859854412358, 0.284873604303057, -0.0170357985365701, \n0.273426417633118, 0.524432737895336, 0.355003110979807, 0.312837607951434, \n-0.0316296929553873, 0.270109834098986, 0.174110335581819, 0.0253101719615279, \n0.199956222742702, 0.375485416120771, 0.278956745944806, 0.244245525945888, \n0.0325126314233263, 0.253352659868514, 0.154840760004461, 0.00231589709639472, \n0.154412480165179, 0.264847742842386, 0.196572744185608, 0.182934783774992, \n0.0207139021497755, 0.17351041376412, 0.14190910348156, -0.000766834484010096, \n0.159975205214477, 0.191222128926019, 0.119768252112669, 0.12033372914036, \n-0.00429047209392067, 0.120982527821078, 0.0756304869484406, \n-0.00890884219048113, 0.0727782693168392, 0.118690444738942, \n0.0814898789647033, 0.0799724348001957, 0.0182926240912726, 0.0887155007804252, \n0.00799604720502299, 0.000599148388616836, -0.00567769035990384, \n0.0358412514670032, 0.0308474979074875, 0.0341668723768997, -0.000180318451026095, \n0.0180733341290925, 0.00456876236626807, 0.00150574966485876, \n0.011152095953916, 0.0176039119729626, 0.00608274255434991, 0.0146257828313115, \n-0.0108877417404102, 0.00341198000323035, -0.00198459880370283, \n0.0026551895445694, 0.00199040664534129, 0.0150165794544221, \n0.00646287144368147, 0.00999205240904708, -0.00850278571195971, \n0.000833666619266177, 0.714730067273087, -0.00916546079360956, \n0.662799490366986, 1.18552468844156, 1.25901933062308, 0.672701515532179, \n0.0311066140197676, 0.699068058809396, 0.451908043813962, 0.0131716226592205, \n0.429551369887738, 0.697928133235757, 0.445367768988423, 0.408463448982185, \n0.0318707241721211, 0.406284953982951, 0.257736307754364, -0.00525924423719458, \n0.236977000055322, 0.429144726596141, 0.291381752107184, 0.267557606428613, \n0.0103657879176852, 0.254728590646094, 0.187783398487578, -0.00216064381479362, \n0.209025103860707, 0.318293592390017, 0.216751610346408, 0.195630579126633, \n-0.00355723971644246, 0.174111826408428, 0.151010324964235, 0.0152409223899092, \n0.159002511320467, 0.214643583389694, 0.136211731513269, 0.138948149207635, \n0.00736196817594524, 0.115637867729083, 0.0491348055596302, -0.00133957946235943, \n0.0507437758212659, 0.103956325245849, 0.0641182216839426, 0.0721933081297794, \n-0.0124376134651938, 0.0632317888879588, 0.0322195712438111, \n0.00170122889182022, 0.0287526766624194, 0.0589662164030242, \n0.0348535721709848, 0.039527944642463, -0.00617539706415593, \n0.0274246010641889, 0.0325877909680824, 0.00530528253248245, \n0.0221776555499961, 0.0389702052631117, 0.0221602091288215, 0.0254478639695596, \n-0.0016189234058987, 0.0197144417326668, -0.00632485262604172, \n-0.00364979854195596, -0.00276076468388984, 0.0126267527301874, \n0.0123592498266038, 0.0154921632247644, -0.00591512196680815, \n0.0098685016547149, 1.19276916750486, -0.0296831640401583, 0.99406393593888, \n1.95758669445116, 1.45842790978446, 0.916899913902239, -0.0240222410217233, \n1.00771193034927, 0.748151091428865, 0.001671855025917, 0.665180535306263, \n1.1777049634557, 0.578603609273264, 0.546625362141714, 0.0292152981607387, \n0.615230814912951, 0.417886753756131, 0.00324593885807739, 0.406076310942717, \n0.732191741449251, 0.352684769616612, 0.326901376027897, -0.000759357576337989, \n0.350075431324921, 0.310927617707656, -0.0107255472998434, 0.28102101085112, \n0.514683023356017, 0.24913510139508, 0.235155452507568, -0.0220885572014814, \n0.243370611433649, 0.209652330609093, -0.00502865663759991, 0.2172246261346, \n0.356540958804122, 0.172121720418057, 0.17487914828986, 0.00365942442127361, \n0.176594681455494, 0.126956927057327, -0.00270525933073803, 0.120116234221594, \n0.210827536708082, 0.101520409193932, 0.101379097920023, 0.00238043252144371, \n0.113027315928011, 0.0598624378953727, -0.00538838415690431, \n0.0568400730102315, 0.0978115258288965, 0.0454207684906316, 0.0473140143579152, \n-0.00165813015281622, 0.0521772135812508, 0.0530224090961669, \n-0.000416993405198653, 0.0353236911458531, 0.0605493601241619, \n0.0316204159297213, 0.0344789374555544, -0.00446984887315054, \n0.0328807595695966, 0.0396438546423947, -0.00331466369719113, \n0.0379029847219126, 0.0572435100638761, 0.0253269328104989, 0.0235663211070417, \n0.00220241478536399, 0.0307132312422208), youden = c(0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, \n0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, \n0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, \n0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, \n0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, \n0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, \n0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, \n0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, \n0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, \n0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, \n0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, \n0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, \n0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8\n)), row.names = c(NA, -288L), group_sizes = c(8L, 8L, 8L, 8L, \n8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, \n8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L\n), biggest_group_size = 8L, class = c(\"grouped_df\", \"tbl_df\", \n\"tbl\", \"data.frame\"), groups = structure(list(sim_nr = 1:36, \n    .rows = list(1:8, 9:16, 17:24, 25:32, 33:40, 41:48, 49:56, \n        57:64, 65:72, 73:80, 81:88, 89:96, 97:104, 105:112, 113:120, \n        121:128, 129:136, 137:144, 145:152, 153:160, 161:168, \n        169:176, 177:184, 185:192, 193:200, 201:208, 209:216, \n        217:224, 225:232, 233:240, 241:248, 249:256, 257:264, \n        265:272, 273:280, 281:288)), row.names = c(NA, -36L), class = c(\"tbl_df\", \n\"tbl\", \"data.frame\"), .drop = TRUE))\n```\n\n```{r, echo = FALSE}\nlibrary(dplyr)\nggplot(plotdat_nomidpoints, \n       aes(x = n, y = mean_err, color = method, shape = method)) + \n    geom_line() + geom_point() +\n    facet_wrap(~ youden, scales = \"fixed\") +\n    scale_shape_manual(values = 1:nlevels(plotdat_nomidpoints$method)) +\n    scale_x_log10(breaks = c(30, 50, 75, 100, 150, 250, 500, 750, 1000)) +\n    ggtitle(\"Bias of all methods when use_midpoints = FALSE\",\n            \"normally distributed data, 10000 repetitions of simulation\")\n```\n\n\n## Finding all cutpoints with acceptable performance\n\nBy default, most packages only return the \"best\" cutpoint and disregard other cutpoints with quite similar performance, even if the performance differences are minuscule. **cutpointr** makes this process more explicit via the `tol_metric` argument. For example, if all cutpoints are of interest that achieve at least an accuracy within `0.05` of the optimally achievable accuracy, `tol_metric` can be set to `0.05` and also those cutpoints will be returned. \n\nIn the case of the `suicide` data and when maximizing the sum of sensitivity and specificity, empirically the cutpoints 2 and 3 lead to quite similar performances. If `tol_metric` is set to `0.05`, both will be returned.\n\n```{r}\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, metric = sum_sens_spec, \n                     tol_metric = 0.05, break_ties = c)\nlibrary(tidyr)\nopt_cut %\u003e% \n    select(optimal_cutpoint, sum_sens_spec) %\u003e% \n    unnest(cols = c(optimal_cutpoint, sum_sens_spec))\n```\n\n\n## Manual and mean / median cutpoints\n\nUsing the `oc_manual` function the optimal cutpoint will not be determined \nbased on, for example, a metric but is instead set manually using the \n`cutpoint` argument. This is useful for supplying and evaluating cutpoints that were found\nin the literature or in other external sources. \n\nThe `oc_manual` function could also be used to set the cutpoint to the sample\nmean using `cutpoint = mean(data$x)`. However, this may introduce a bias into the\nbootstrap validation procedure, since the actual mean of the population is\nnot known and thus the mean to be used as the cutpoint should be automatically determined in every resample.\nTo do so, the `oc_mean` and `oc_median` functions can be used.\n\n```{r}\nset.seed(100)\nopt_cut_manual \u003c- cutpointr(suicide, dsi, suicide, method = oc_manual, \n                       cutpoint = mean(suicide$dsi), boot_runs = 30)\nset.seed(100)\nopt_cut_mean \u003c- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 30)\n```\n\n\n\n## Nonstandard evaluation via tidyeval\n\nThe arguments to `cutpointr` do not need to be enclosed in quotes. This is \npossible thanks to nonstandard evaluation of the arguments, which are \nevaluated on `data`. \n\nFunctions that use nonstandard evaluation are often not suitable for \nprogramming with. The use of nonstandard evaluation may lead to scoping \nproblems and subsequent obvious as well as possibly subtle errors. \n**cutpointr** uses tidyeval internally and accordingly the same rules as \nfor programming with `dplyr` apply. Arguments can be unquoted with `!!`:\n\n```{r, eval = FALSE}\nmyvar \u003c- \"dsi\"\ncutpointr(suicide, !!myvar, suicide)\n```\n\n\n## ROC curve and optimal cutpoint for multiple variables\n\nAlternatively, we can map the standard evaluation version `cutpointr` to \nthe column names. If `direction` and / or `pos_class` and `neg_class` are unspecified, these parameters\nwill automatically be determined by **cutpointr** so that the AUC values for all\nvariables will be $\u003e 0.5$.\n\nWe could do this manually, e.g. using `purrr::map`, but to make this task more convenient \n`multi_cutpointr` can be used\nto achieve the same result. It maps multiple predictor columns to \n`cutpointr`, by default all numeric columns except for the class column.\n\n```{r}\nmcp \u003c- multi_cutpointr(suicide, class = suicide, pos_class = \"yes\", \n                use_midpoints = TRUE, silent = TRUE) \nsummary(mcp)\n```\n\n\n## Accessing `data`, `roc_curve`, and `boot` \n\nThe object returned by `cutpointr` is of the classes `cutpointr`, `tbl_df`,\n`tbl`, and `data.frame`. Thus, it can be handled like a usual data frame. The\ncolumns `data`, `roc_curve`, and `boot` consist of nested data frames, which means that\nthese are list columns whose elements are data frames. They can either be accessed\nusing `[` or by using functions from the tidyverse. If subgroups were given, \nthe output contains one row per subgroup and the function \nthat accesses the data should be mapped to every row or the data should be \ngrouped by subgroup.\n\n```{r, cache=TRUE}\n# Extracting the bootstrap results\nset.seed(123)\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)\n\n# Using base R to summarise the result of the bootstrap\nsummary(opt_cut$boot[[1]]$optimal_cutpoint)\nsummary(opt_cut$boot[[2]]$optimal_cutpoint)\n\n# Using dplyr and tidyr\nlibrary(tidyr)\nopt_cut %\u003e% \n  group_by(subgroup) %\u003e% \n  select(boot) %\u003e% \n  unnest(boot) %\u003e% \n  summarise(sd_oc_boot = sd(optimal_cutpoint),\n            m_oc_boot  = mean(optimal_cutpoint),\n            m_acc_oob  = mean(acc_oob))\n```\n\n\n\n## Adding metrics to the result of cutpointr() or roc()\n\nBy default, the output of `cutpointr` includes the optimized metric and several\nother metrics. The `add_metric` function adds further metrics. \nHere, we're adding the negative predictive value (NPV) and\nthe positive predictive value (PPV) at the optimal cutpoint per subgroup:\n\n```{r}\ncutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %\u003e% \n    add_metric(list(ppv, npv)) %\u003e% \n    select(subgroup, optimal_cutpoint, youden, ppv, npv)\n```\n\nIn the same fashion, additional metric columns can be added to a `roc_cutpointr`\nobject:\n\n```{r}\nroc(data = suicide, x = dsi, class = suicide, pos_class = \"yes\",\n    neg_class = \"no\", direction = \"\u003e=\") %\u003e% \n  add_metric(list(cohens_kappa, F1_score)) %\u003e% \n  select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score) %\u003e% \n  head()\n```\n\n\n## User-defined functions\n\n### method\n\nUser-defined functions can be supplied to `method`, which is the function that\nis responsible for returning the optimal cutpoint.\nTo define a new method function, create a function that may take\nas input(s):\n\n- `data`: A `data.frame` or `tbl_df`\n- `x`: (character) The name of the predictor variable\n- `class`: (character) The name of the class variable\n- `metric_func`: A function for calculating a metric, e.g. accuracy. Note\n that the method function does not necessarily have to accept this argument\n- `pos_class`: The positive class\n- `neg_class`: The negative class\n- `direction`: `\"\u003e=\"` if the positive class has higher x values, `\"\u003c=\"` otherwise\n- `tol_metric`: (numeric) In the built-in methods, all cutpoints will be returned that lead to a metric\nvalue in the interval [m_max - tol_metric, m_max + tol_metric] where\nm_max is the maximum achievable metric value. This can be used to return\nmultiple decent cutpoints and to avoid floating-point problems.\n- `use_midpoints`: (logical) In the built-in methods, if TRUE (default FALSE) the returned optimal\ncutpoint will be the mean of the optimal cutpoint and the next highest\nobservation (for direction = \"\u003e\") or the next lowest observation\n(for direction = \"\u003c\") which avoids biasing the optimal cutpoint.\n- `...`: Further arguments that are passed to `metric` or that can be captured\ninside of `method`\n\nThe function should return a data frame or tibble with\none row, the column `optimal_cutpoint`, and an optional column with an arbitrary name\nwith the metric value at the optimal cutpoint.\n\nFor example, a function for choosing the cutpoint as the mean of the independent\nvariable could look like this:\n\n```{r, eval = FALSE}\nmean_cut \u003c- function(data, x, ...) {\n    oc \u003c- mean(data[[x]])\n    return(data.frame(optimal_cutpoint = oc))\n}\n```\n\nIf a `method` function does not return a metric column, the default `sum_sens_spec`, the sum of sensitivity and \nspecificity, is returned as the extra metric column in addition to accuracy, \nsensitivity and specificity.\n\nSome `method` functions that make use of the additional arguments (that are \ncaptured by `...`) are already included in **cutpointr**, see\nthe list at the top. Since these functions are arguments to `cutpointr` \ntheir code can be accessed by simply typing their name, see for example\n`oc_youden_normal`. \n\n### metric\n\nUser defined `metric` functions can be used as well. They are mainly useful in\nconjunction with `method = maximize_metric`, `method = minimize_metric`, or one of\nthe other minimization and maximization functions. \nIn case of a different `method` function `metric` will only be used as the main\nout-of-bag metric when plotting the result. The `metric` function should \naccept the following inputs as vectors:\n\n- `tp`: Vector of true positives\n- `fp`: Vector of false positives\n- `tn`: Vector of true negatives\n- `fn`: Vector of false negatives\n- `...`: Further arguments\n\nThe function should return a numeric vector, a matrix, or a `data.frame` with one column. If the column is named, the name will be included in the output and plots. Avoid using names that are identical to the column names that are by default returned by **cutpointr**, as such names will be prefixed by `metric_` in the output. The inputs (`tp`, `fp`, `tn`, and `fn`) are vectors. \nThe code of the included metric functions can be accessed by simply typing their name.\n\nFor example, this is the `misclassification_cost` metric function:\n\n```{r}\nmisclassification_cost\n```\n\n\n# Plotting\n\n**cutpointr** includes several convenience functions for plotting data from a \n`cutpointr` object. These include:\n\n- `plot_cutpointr`: General purpose plotting function for cutpointr or roc_cutpointr objects\n- `plot_cut_boot`: Plot the bootstrapped distribution of optimal cutpoints\n- `plot_metric`: If `maximize_metric` or `minimize_metric` was used this function\nplots all possible cutoffs on the x-axis vs. the respective metric values on\nthe y-axis. If bootstrapping was run, a confidence interval based on the \nbootstrapped distribution of metric values at each cutpoint can be displayed. \nTo display no confidence interval set `conf_lvl = 0`.\n- `plot_metric_boot`: Plot the distribution of out-of-bag metric values\n- `plot_precision_recall`: Plot the precision recall curve\n- `plot_sensitivity_specificity`: Plot all cutpoints vs. sensitivity and specificity\n- `plot_roc`: Plot the ROC curve\n- `plot_x`: Plot the distribution of the predictor variable\n\n```{r, fig.width=4, fig.height=3}\nset.seed(102)\nopt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,\n                     metric = abs_d_sens_spec, boot_runs = 200, silent = TRUE)\nopt_cut\nplot_cut_boot(opt_cut)\nplot_metric(opt_cut, conf_lvl = 0.9)\nplot_metric_boot(opt_cut)\nplot_precision_recall(opt_cut)\nplot_sensitivity_specificity(opt_cut)\nplot_roc(opt_cut)\n```\n\nAll plot functions, except for the standard plot method that returns\na composed plot, return `ggplot` objects\nthan can be further modified. For example, changing labels, title, and the theme\ncan be achieved this way:\n\n```{r, fig.width=4, fig.height=3}\np \u003c- plot_x(opt_cut)\np + ggtitle(\"Distribution of dsi\") + theme_minimal() + xlab(\"Depression score\")\n```\n\n## Flexible plotting function\n\nUsing `plot_cutpointr` any metric can be chosen to be plotted on the x- or\ny-axis and results of `cutpointr()` as well as `roc()` can be plotted.\nIf a `cutpointr` object is to be plotted, it is thus irrelevant which `metric` \nfunction was chosen for cutpoint estimation. Any metric that can be calculated\nbased on the ROC curve can be subsequently plotted as only the true / false\npositives / negatives over all cutpoints are needed.\nThat way, not only the above plots can be produced, but also any \ncombination of two metrics (or metric functions) and / or cutpoints. The built-in\nmetric functions as well as user-defined functions or anonymous functions can\nbe supplied to `xvar` and `yvar`. If bootstrapping was run, confidence intervals\ncan be plotted around the y-variable. This is especially useful if the cutpoints,\navailable in the `cutpoints` function, are placed on the x-axis. \nNote that confidence intervals can only be correctly plotted if the values of \n`xvar` are constant across bootstrap samples. For example, confidence intervals \nfor TPR by FPR (a ROC curve) cannot be plotted easily, as the values of the false \npositive rate vary per bootstrap sample.\n\n```{r, fig.width=4, fig.height=3, cache=TRUE}\nset.seed(500)\noc \u003c- cutpointr(suicide, dsi, suicide, boot_runs = 1000, \n                metric = sum_ppv_npv) # metric irrelevant for plot_cutpointr\nplot_cutpointr(oc, xvar = cutpoints, yvar = sum_sens_spec, conf_lvl = 0.9)\nplot_cutpointr(oc, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)\nplot_cutpointr(oc, xvar = cutpoint, yvar = tp, conf_lvl = 0.9) + geom_point()\n```\n\n\n## Manual plotting\n\nSince `cutpointr` returns a `data.frame` with the original data, bootstrap\nresults, and the ROC curve in nested tibbles, these data can be conveniently \nextracted and plotted manually. The relevant\nnested tibbles are in the columns `data`, `roc_curve` and `boot`. The following\nis an example of accessing and plotting the grouped data.\n\n```{r, fig.width=4, fig.height=3, cache=TRUE}\nset.seed(123) \nopt_cut \u003c- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)\n\nopt_cut %\u003e% \n    select(data, subgroup) %\u003e% \n    unnest %\u003e% \n    ggplot(aes(x = suicide, y = dsi)) + \n    geom_boxplot(alpha = 0.3) + facet_grid(~subgroup)\n```\n\n\n# Benchmarks\n\nTo offer a comparison to established solutions,\n**cutpointr** 1.0.0 will be benchmarked against `optimal.cutpoints` \nfrom **OptimalCutpoints** 1.1-4, **ThresholdROC** 2.7 and custom functions based on\n**ROCR** 1.0-7 and **pROC** 1.15.0. By generating data of different sizes, \nthe benchmarks will offer a comparison of the scalability of the different \nsolutions. \n\nUsing `prediction` and `performance` from the **ROCR** package and `roc` from the\n**pROC** package, we can write functions for computing the cutpoint that maximizes the sum of sensitivity and\nspecificity. **pROC** has a built-in function to optimize a few metrics:\n\n```{r, eval = FALSE}\n# Return cutpoint that maximizes the sum of sensitivity and specificiy\n# ROCR package\nrocr_sensspec \u003c- function(x, class) {\n    pred \u003c- ROCR::prediction(x, class)\n    perf \u003c- ROCR::performance(pred, \"sens\", \"spec\")\n    sens \u003c- slot(perf, \"y.values\")[[1]]\n    spec \u003c- slot(perf, \"x.values\")[[1]]\n    cut \u003c- slot(perf, \"alpha.values\")[[1]]\n    cut[which.max(sens + spec)]\n}\n\n# pROC package\nproc_sensspec \u003c- function(x, class) {\n    r \u003c- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = \"\u003c\")\n    pROC::coords(r, \"best\", ret=\"threshold\", transpose = FALSE)[1]\n}\n```\n\nThe benchmarking will be carried out using the **microbenchmark** package and randomly\ngenerated data. The values of the `x` predictor variable are drawn from a normal distribution\nwhich leads to a lot more unique values than were encountered before in the \n`suicide` data. Accordingly, the search for an optimal cutpoint is much more \ndemanding, if all possible cutpoints are evaluated.\n\nBenchmarks are run for sample sizes of 100, 1000, 1e4, 1e5, 1e6, and 1e7.\nFor low sample sizes **cutpointr** is slower than the other\nsolutions. While this should be of low practical importance, **cutpointr** scales\nmore favorably with increasing sample size. The speed disadvantage in small\nsamples that leads to the lower limit of around 25ms is mainly due to the nesting\nof the original data and the results that makes the compact output of `cutpointr`\npossible. This observation is emphasized by the fact that `cutpointr::roc` is \nquite fast also in small samples. For sample sizes \u003e 1e5 **cutpointr**\nis a little faster than the function based on **ROCR** and **pROC**. Both of these \nsolutions are generally faster than **OptimalCutpoints** and **ThresholdROC** with the exception of\nsmall samples. **OptimalCutpoints** and **ThresholdROC** had to be excluded from benchmarks with \nmore than 1e4 observations due to high memory requirements and/or excessive run times, rendering\nthe use of these packages in larger samples impractical.\n\n\n```{r, eval = FALSE, echo = FALSE}\nlibrary(OptimalCutpoints)\nlibrary(ThresholdROC)\nn \u003c- 100\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nx_pos \u003c- dat$x[dat$y == 1]\nx_neg \u003c- dat$x[dat$y == 0]\nbench_100 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean),\n    rocr_sensspec(dat$x, dat$y),\n    proc_sensspec(dat$x, dat$y),\n    optimal.cutpoints(X = \"x\", status = \"y\", tag.healthy = 0, methods = \"Youden\",\n                      data = dat),\n    thres2(k1 = x_neg, k2 = x_pos, rho = 0.5,\n           method = \"empirical\", ci = FALSE),\n    times = 100, unit = \"ms\"\n)\n\nn \u003c- 1000\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nx_pos \u003c- dat$x[dat$y == 1]\nx_neg \u003c- dat$x[dat$y == 0]\nbench_1000 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean),\n    rocr_sensspec(dat$x, dat$y),\n    proc_sensspec(dat$x, dat$y),\n    optimal.cutpoints(X = \"x\", status = \"y\", tag.healthy = 0, methods = \"Youden\",\n                      data = dat),\n    thres2(k1 = x_neg, k2 = x_pos, rho = 0.5,\n           method = \"empirical\", ci = FALSE),\n    times = 100, unit = \"ms\"\n)\n\nn \u003c- 10000\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nx_pos \u003c- dat$x[dat$y == 1]\nx_neg \u003c- dat$x[dat$y == 0]\nbench_10000 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean, silent = TRUE),\n    rocr_sensspec(dat$x, dat$y),\n    optimal.cutpoints(X = \"x\", status = \"y\", tag.healthy = 0, methods = \"Youden\",\n                      data = dat),\n    proc_sensspec(dat$x, dat$y),\n    thres2(k1 = x_neg, k2 = x_pos, rho = 0.5,\n           method = \"empirical\", ci = FALSE),\n    times = 100\n)\n\nn \u003c- 1e5\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e5 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean),\n    rocr_sensspec(dat$x, dat$y),\n    proc_sensspec(dat$x, dat$y),\n    times = 100, unit = \"ms\"\n)\n\nn \u003c- 1e6\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e6 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean),\n    rocr_sensspec(dat$x, dat$y),\n    proc_sensspec(dat$x, dat$y),\n    times = 30, unit = \"ms\"\n)\n\nn \u003c- 1e7\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e7 \u003c- microbenchmark::microbenchmark(\n    cutpointr(dat, x, y, pos_class = 1, neg_class = 0,\n              direction = \"\u003e=\", metric = youden, break_ties = mean),\n    rocr_sensspec(dat$x, dat$y),\n    proc_sensspec(dat$x, dat$y),\n    times = 30, unit = \"ms\"\n)\n\nresults \u003c- rbind(\n    data.frame(time = summary(bench_100)$median,\n               Solution = summary(bench_100)$expr,\n               n = 100),\n    data.frame(time = summary(bench_1000)$median,\n               Solution = summary(bench_1000)$expr,\n               n = 1000),\n    data.frame(time = summary(bench_10000)$median,\n               Solution = summary(bench_10000)$expr,\n               n = 10000),\n    data.frame(time = summary(bench_1e5)$median,\n               Solution = summary(bench_1e5)$expr,\n               n = 1e5),\n    data.frame(time = summary(bench_1e6)$median,\n               Solution = summary(bench_1e6)$expr,\n               n = 1e6),\n    data.frame(time = summary(bench_1e7)$median,\n               Solution = summary(bench_1e7)$expr,\n               n = 1e7)\n)\nresults$Solution \u003c- as.character(results$Solution)\nresults$Solution[grep(pattern = \"cutpointr\", x = results$Solution)] \u003c- \"cutpointr\"\nresults$Solution[grep(pattern = \"rocr\", x = results$Solution)] \u003c- \"ROCR\"\nresults$Solution[grep(pattern = \"optimal\", x = results$Solution)] \u003c- \"OptimalCutpoints\"\nresults$Solution[grep(pattern = \"proc\", x = results$Solution)] \u003c- \"pROC\"\nresults$Solution[grep(pattern = \"thres\", x = results$Solution)] \u003c- \"ThresholdROC\"\n\nresults$task \u003c- \"Cutpoint Estimation\"\n```\n\n```{r, echo = FALSE}\n# These are the original results on our system\n# dput(results)\nresults \u003c- structure(list(time = c(4.5018015, 1.812802, 0.662101, 2.2887015, \n1.194301, 4.839401, 2.1764015, 0.981001, 45.0568005, 36.2398515, \n8.5662515, 5.667101, 2538.612001, 4.031701, 2503.8012505, 45.384501, \n43.118751, 37.150151, 465.003201, 607.023851, 583.0950005, 5467.332801, \n7850.2587, 7339.356101), Solution = c(\"cutpointr\", \"ROCR\", \"pROC\", \n\"OptimalCutpoints\", \"ThresholdROC\", \"cutpointr\", \"ROCR\", \"pROC\", \n\"OptimalCutpoints\", \"ThresholdROC\", \"cutpointr\", \"ROCR\", \"OptimalCutpoints\", \n\"pROC\", \"ThresholdROC\", \"cutpointr\", \"ROCR\", \"pROC\", \"cutpointr\", \n\"ROCR\", \"pROC\", \"cutpointr\", \"ROCR\", \"pROC\"), n = c(100, 100, \n100, 100, 100, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, \n10000, 10000, 1e+05, 1e+05, 1e+05, 1e+06, 1e+06, 1e+06, 1e+07, \n1e+07, 1e+07), task = c(\"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\", \"Cutpoint Estimation\", \"Cutpoint Estimation\", \n\"Cutpoint Estimation\")), row.names = c(NA, -24L), class = \"data.frame\")\n```\n\n\n```{r, eval = FALSE}\n# ROCR package\nrocr_roc \u003c- function(x, class) {\n    pred \u003c- ROCR::prediction(x, class)\n    perf \u003c- ROCR::performance(pred, \"sens\", \"spec\")\n    return(NULL)\n}\n\n# pROC package\nproc_roc \u003c- function(x, class) {\n    r \u003c- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = \"\u003c\")\n    return(NULL)\n}\n```\n\n\n\n```{r, eval = FALSE, echo = FALSE}\nn \u003c- 100\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_100 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1,\n                   neg_class = 0, direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 100, unit = \"ms\"\n)\nn \u003c- 1000\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1000 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1, neg_class = 0,\n                   direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 100, unit = \"ms\"\n)\nn \u003c- 10000\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_10000 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1, neg_class = 0,\n                   direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 100, unit = \"ms\"\n)\nn \u003c- 1e5\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e5 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1, neg_class = 0,\n                   direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 100, unit = \"ms\"\n)\nn \u003c- 1e6\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e6 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1, neg_class = 0,\n                   direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 30, unit = \"ms\"\n)\nn \u003c- 1e7\nset.seed(123)\ndat \u003c- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE))\nbench_1e7 \u003c- microbenchmark::microbenchmark(\n    cutpointr::roc(dat, \"x\", \"y\", pos_class = 1, neg_class = 0,\n                   direction = \"\u003e=\"),\n    rocr_roc(dat$x, dat$y),\n    proc_roc(dat$x, dat$y),\n    times = 30, unit = \"ms\"\n)\n\nresults_roc \u003c- rbind(\n    data.frame(time = summary(bench_100)$median,\n               Solution = summary(bench_100)$expr,\n               n = 100),\n    data.frame(time = summary(bench_1000)$median,\n               Solution = summary(bench_1000)$expr,\n               n = 1000),\n    data.frame(time = summary(bench_10000)$median,\n               Solution = summary(bench_10000)$expr,\n               n = 10000),\n    data.frame(time = summary(bench_1e5)$median,\n               Solution = summary(bench_1e5)$expr,\n               n = 1e5),\n    data.frame(time = summary(bench_1e6)$median,\n               Solution = summary(bench_1e6)$expr,\n               n = 1e6),\n    data.frame(time = summary(bench_1e7)$median,\n               Solution = summary(bench_1e7)$expr,\n               n = 1e7)\n)\nresults_roc$Solution \u003c- as.character(results_roc$Solution)\nresults_roc$Solution[grep(pattern = \"cutpointr\", x = results_roc$Solution)] \u003c- \"cutpointr\"\nresults_roc$Solution[grep(pattern = \"rocr\", x = results_roc$Solution)] \u003c- \"ROCR\"\nresults_roc$Solution[grep(pattern = \"proc\", x = results_roc$Solution)] \u003c- \"pROC\"\nresults_roc$task \u003c- \"ROC curve calculation\"\n```\n\n```{r, echo = FALSE}\n# Our results\nresults_roc \u003c- structure(list(time = c(0.7973505, 1.732651, 0.447701, 0.859301, \n2.0358515, 0.694802, 1.878151, 5.662151, 3.6580505, 11.099251, \n42.8208515, 35.3293005, 159.8100505, 612.471901, 610.4337005, \n2032.693551, 7806.3854515, 7081.897251), Solution = c(\"cutpointr\", \n\"ROCR\", \"pROC\", \"cutpointr\", \"ROCR\", \"pROC\", \"cutpointr\", \"ROCR\", \n\"pROC\", \"cutpointr\", \"ROCR\", \"pROC\", \"cutpointr\", \"ROCR\", \"pROC\", \n\"cutpointr\", \"ROCR\", \"pROC\"), n = c(100, 100, 100, 1000, 1000, \n1000, 10000, 10000, 10000, 1e+05, 1e+05, 1e+05, 1e+06, 1e+06, \n1e+06, 1e+07, 1e+07, 1e+07), task = c(\"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\", \"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\", \"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\", \"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\", \"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\", \"ROC curve calculation\", \n\"ROC curve calculation\", \"ROC curve calculation\")), row.names = c(NA, \n-18L), class = \"data.frame\")\n```\n\n```{r, echo = FALSE}\nresults_all \u003c- dplyr::bind_rows(results, results_roc)\n\nggplot(results_all, aes(x = n, y = time, col = Solution, shape = Solution)) +\n  geom_point(size = 3) + geom_line() +\n  scale_y_log10(breaks = c(0.5, 1, 2, 3, 5, 10, 25, 100, 250, 1000, 5000, 1e4, 15000)) +\n  scale_x_log10(breaks = c(100, 1000, 1e4, 1e5, 1e6, 1e7)) +\n  ylab(\"Median Time (milliseconds, log scale)\") + xlab(\"Sample Size (log scale)\") +\n  theme_bw() +\n  theme(legend.position = \"bottom\", \n        legend.key.width = unit(0.8, \"cm\"), \n        panel.spacing = unit(1, \"lines\")) +\n  facet_grid(~task)\n```\n\n\n```{r, echo = FALSE}\nres_table \u003c- tidyr::spread(results_all, Solution, time) %\u003e% \n  arrange(task)\nknitr::kable(res_table)\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthie1e%2Fcutpointr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthie1e%2Fcutpointr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthie1e%2Fcutpointr/lists"}