Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tidymodels/tidyposterior

Bayesian comparisons of models using resampled statistics
https://github.com/tidymodels/tidyposterior

Last synced: 4 days ago
JSON representation

Bayesian comparisons of models using resampled statistics

Host: GitHub
URL: https://github.com/tidymodels/tidyposterior
Owner: tidymodels
License: other
Created: 2017-10-15T17:39:33.000Z (about 7 years ago)
Default Branch: main
Last Pushed: 2024-10-17T15:29:44.000Z (2 months ago)
Last Synced: 2024-12-11T00:01:34.407Z (11 days ago)
Language: R
Homepage: https://tidyposterior.tidymodels.org
Size: 32.3 MB
Stars: 103
Watchers: 8
Forks: 10
Open Issues: 6
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# tidyposterior 

[![R-CMD-check](https://github.com/tidymodels/tidyposterior/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/tidyposterior/actions/workflows/R-CMD-check.yaml)

[![Codecov test coverage](https://codecov.io/gh/tidymodels/tidyposterior/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/tidyposterior?branch=main)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidyposterior)](https://CRAN.r-project.org/package=tidyposterior)

[![Downloads](http://cranlogs.r-pkg.org/badges/tidyposterior)](https://CRAN.r-project.org/package=tidyposterior)

![](https://img.shields.io/badge/lifecycle-maturing-blue.svg)

This package can be used to conduct _post hoc_ analyses of resampling results generated by models. 

For example, if two models are evaluated with the root mean squared error (RMSE) using 10-fold cross-validation, there are 10 paired statistics. These can be used to make comparisons between models without involving a test set. 

There is a rich literature on the analysis of model resampling results such as McLachlan's [_Discriminant Analysis and Statistical Pattern Recognition_](https://books.google.com/books?id=O_qHDLaWpDUC&lpg=PR7&ots=6GJnIREXZM&dq=%22Discriminant%20Analysis%20and%20Statistical%20Pattern%20Recognition%22&lr&pg=PR7#v=onepage&q=%22Discriminant%20Analysis%20and%20Statistical%20Pattern%20Recognition%22&f=false) and the references therein. This package follows _the spirit_ of [Benavoli _et al_ (2017)](https://people.idsia.ch//~marco/papers/2017jmlr-tests.pdf). 

tidyposterior uses Bayesian generalized linear models for this purpose and can be considered an upgraded version of the [`caret::resamples()`](https://topepo.github.io/caret/model-training-and-tuning.html#exploring-and-comparing-resampling-distributions) function. The package works with [rsample](https://rsample.tidymodels.org/) objects natively but any results in a data frame can be used. 

See [Chapter 11](https://www.tmwr.org/compare.html) of [_Tidy Models with R_](https://www.tmwr.org) for examples and more details. 

## Installation

You can install the released version of tidyposterior from [CRAN](https://CRAN.R-project.org) with:

``` r

install.packages("tidyposterior")

```

And the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("pak")

pak::pak("tidymodels/tidyposterior")

```

## Example

To illustrate, here are some example objects using 10-fold cross-validation for a simple two-class problem: 

```{r setup, results = "hide"}

library(tidymodels)

library(tidyposterior)

data(two_class_dat, package = "modeldata")

set.seed(100)

folds <- vfold_cv(two_class_dat)

```

We can define two different models (for simplicity, with no tuning parameters).

```{r model-specs}

logistic_reg_glm_spec <-

  logistic_reg() %>%

  set_engine('glm')

mars_earth_spec <-

  mars(prod_degree = 1) %>%

  set_engine('earth') %>%

  set_mode('classification')

```

For tidymodels, the [tune::fit_resamples()] function can be used to estimate performance for each model/resample:

```{r tm-resamples}

rs_ctrl <- control_resamples(save_workflow = TRUE)

logistic_reg_glm_res <- 

  logistic_reg_glm_spec %>% 

  fit_resamples(Class ~ ., resamples = folds, control = rs_ctrl)

mars_earth_res <- 

  mars_earth_spec %>% 

  fit_resamples(Class ~ ., resamples = folds, control = rs_ctrl)

```

From these, there are several ways to pass the results to the `perf_mod()` function. The most general approach is to have a data frame with the resampling labels (i.e., one or more id columns) as well as columns for each model that you would like to compare. 

For the model results above, [tune::collect_metrics()] can be used along with some basic data manipulation steps: 

```{r df-results}

logistic_roc <- 

  collect_metrics(logistic_reg_glm_res, summarize = FALSE) %>% 

  dplyr::filter(.metric == "roc_auc") %>% 

  dplyr::select(id, logistic = .estimate)

mars_roc <- 

  collect_metrics(mars_earth_res, summarize = FALSE) %>% 

  dplyr::filter(.metric == "roc_auc") %>% 

  dplyr::select(id, mars = .estimate)

resamples_df <- full_join(logistic_roc, mars_roc, by = "id")

resamples_df

```

We can then give this directly to `perf_mod()`: 

```{r df-mod}

set.seed(101)

roc_model_via_df <- perf_mod(resamples_df, iter = 2000)

```

From this, the posterior distributions for each model can be obtained from the `tidy()` method: 

```{r post}

#| fig-alt: "Faceted histogram chart. Area Under the ROC Curve along the x-axis, count along the y-axis. The two facets are logistic and mars. Both histogram looks fairly normally distributed, with a mean of 0.89 for logistic and 0.88 for mars. The full range is 0.84 to 0.93."

roc_model_via_df %>% 

  tidy() %>% 

  ggplot(aes(x = posterior)) + 

  geom_histogram(bins = 40, col = "blue", fill = "blue", alpha = .4) + 

  facet_wrap(~ model, ncol = 1) + 

  xlab("Area Under the ROC Curve")

```

See `contrast_models()` for how to analyze these distributions 

## Contributing

This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/tidyposterior/issues).

- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.

- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).