Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tidymodels/themis
Extra recipes steps for dealing with unbalanced data
https://github.com/tidymodels/themis
Last synced: 3 days ago
JSON representation
Extra recipes steps for dealing with unbalanced data
- Host: GitHub
- URL: https://github.com/tidymodels/themis
- Owner: tidymodels
- License: other
- Created: 2019-10-12T18:46:35.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-10-29T22:10:07.000Z (14 days ago)
- Last Synced: 2024-10-30T00:38:58.840Z (14 days ago)
- Language: R
- Homepage: https://themis.tidymodels.org/
- Size: 72.4 MB
- Stars: 141
- Watchers: 5
- Forks: 11
- Open Issues: 19
-
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - tidymodels/themis - Extra recipes steps for dealing with unbalanced data (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```[![R-CMD-check](https://github.com/tidymodels/themis/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/themis/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/themis/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/themis?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/themis)](https://CRAN.R-project.org/package=themis)
[![Downloads](http://cranlogs.r-pkg.org/badges/themis)](https://CRAN.R-project.org/package=themis)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)**themis** contains extra steps for the
[`recipes`](https://CRAN.R-project.org/package=recipes) package for
dealing with unbalanced data. The name **themis** is that of the [ancient Greek god](https://thishollowearth.wordpress.com/2012/07/02/god-of-the-week-themis/) who is typically depicted with a balance.## Installation
You can install the released version of themis from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("themis")
```Install the development version from GitHub with:
``` r
# install.packages("pak")
pak::pak("tidymodels/themis")
```## Example
Following is a example of using the [SMOTE](https://jair.org/index.php/jair/article/view/10302/24590) algorithm to deal with unbalanced data
```{r example, message=FALSE}
library(recipes)
library(modeldata)
library(themis)data("credit_data", package = "modeldata")
credit_data0 <- credit_data %>%
filter(!is.na(Job))count(credit_data0, Job)
ds_rec <- recipe(Job ~ Time + Age + Expenses, data = credit_data0) %>%
step_impute_mean(all_predictors()) %>%
step_smote(Job, over_ratio = 0.25) %>%
prep()ds_rec %>%
bake(new_data = NULL) %>%
count(Job)
```## Methods
Below is some unbalanced data. Used for examples latter.
```{r}
#| fig-alt: "Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 10, b has 20, c has 30, d has 40, and e has 50."
example_data <- data.frame(class = letters[rep(1:5, 1:5 * 10)],
x = rnorm(150))library(ggplot2)
example_data %>%
ggplot(aes(class)) +
geom_bar()
```### Upsample / Over-sampling
The following methods all share the tuning parameter `over_ratio`, which is the ratio of the minority-to-majority frequencies.
| name | function | Multi-class |
|---|---|---|
| Random minority over-sampling with replacement | `step_upsample()` | :heavy_check_mark: |
| Synthetic Minority Over-sampling Technique | `step_smote()` | :heavy_check_mark: |
| Borderline SMOTE-1 | `step_bsmote(method = 1)` | :heavy_check_mark: |
| Borderline SMOTE-2 | `step_bsmote(method = 2)` | :heavy_check_mark: |
| Adaptive synthetic sampling approach for imbalanced learning | `step_adasyn()` | :heavy_check_mark: |
| Generation of synthetic data by Randomly Over Sampling Examples| `step_rose()` | |By setting `over_ratio = 1` you bring the number of samples of all minority classes equal to 100% of the majority class.
```{r}
#| fig-alt: "Bar chart with 5 columns. class on the x-axis and count on the y-axis. class a, b, c, d, and e all have a height of 50."
recipe(~., example_data) %>%
step_upsample(class, over_ratio = 1) %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot(aes(class)) +
geom_bar()
```and by setting `over_ratio = 0.5` we upsample any minority class with less samples then 50% of the majority up to have 50% of the majority.
```{r}
#| fig-alt: "Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 25, b has 25, c has 30, d has 40, and e has 50."
recipe(~., example_data) %>%
step_upsample(class, over_ratio = 0.5) %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot(aes(class)) +
geom_bar()
```### Downsample / Under-sampling
Most of the the following methods all share the tuning parameter `under_ratio`, which is the ratio of the majority-to-minority frequencies.
| name | function | Multi-class | under_ratio |
|---|---|---|---|
| Random majority under-sampling with replacement | `step_downsample()` | :heavy_check_mark: | :heavy_check_mark: |
| NearMiss-1 | `step_nearmiss()` | :heavy_check_mark: |:heavy_check_mark: |
| Extraction of majority-minority Tomek links | `step_tomek()` | | |By setting `under_ratio = 1` you bring the number of samples of all majority classes equal to 100% of the minority class.
```{r}
#| fig-alt: "Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a, b, c, d, and e all have a height of 10."
recipe(~., example_data) %>%
step_downsample(class, under_ratio = 1) %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot(aes(class)) +
geom_bar()
```and by setting `under_ratio = 2` we downsample any majority class with more then 200% samples of the minority class down to have to 200% samples of the minority.
```{r}
#| fig-alt: "Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 10, b, c, d, and e have ha height of 20."
recipe(~., example_data) %>%
step_downsample(class, under_ratio = 2) %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot(aes(class)) +
geom_bar()
```## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, [join us on RStudio Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/themis/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).