https://github.com/ludvigolsen/groupdata2
R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
https://github.com/ludvigolsen/groupdata2
balance cross-validation data data-frame fold group-factor groups participants partition rstats split staircase
Last synced: 3 months ago
JSON representation
R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
- Host: GitHub
- URL: https://github.com/ludvigolsen/groupdata2
- Owner: LudvigOlsen
- License: other
- Created: 2016-10-30T19:39:22.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-12-18T17:12:15.000Z (about 1 year ago)
- Last Synced: 2025-09-27T10:31:07.914Z (4 months ago)
- Topics: balance, cross-validation, data, data-frame, fold, group-factor, groups, participants, partition, rstats, split, staircase
- Language: R
- Homepage:
- Size: 1.84 MB
- Stars: 27
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 92,
fig.retina = 2
)
# Digits to print
options("digits"=3)
set.seed(1)
# Get minimum R requirement
dep <- as.vector(read.dcf('DESCRIPTION')[, 'Depends'])
rvers <- substring(dep, 7, nchar(dep)-1)
# m <- regexpr('R *\\\\(>= \\\\d+.\\\\d+.\\\\d+\\\\)', dep)
# rm <- regmatches(dep, m)
# rvers <- gsub('.*(\\\\d+.\\\\d+.\\\\d+).*', '\\\\1', dep)
# Function for TOC
# https://gist.github.com/gadenbuie/c83e078bf8c81b035e32c3fc0cf04ee8
```
**Author:** [Ludvig R. Olsen](https://www.ludvigolsen.dk/) ( r-pkgs@ludvigolsen.dk )
**License:** [MIT](https://opensource.org/license/mit)
**Started:** October 2016
[](https://cran.r-project.org/package=groupdata2)
[](https://cran.r-project.org/package=groupdata2)
[](https://cran.r-project.org/)
[](https://app.codecov.io/gh/ludvigolsen/groupdata2?branch=master)
[](https://github.com/ludvigolsen/groupdata2/actions/workflows/R-check.yaml?branch=master)
[](https://ci.appveyor.com/project/LudvigOlsen/groupdata2)
[](https://zenodo.org/badge/latestdoi/72371128)
## Overview
R package for dividing data into groups.
* Create **balanced partitions** and cross-validation **folds**.
* Perform time series **windowing** and general **grouping** and **splitting** of data.
* **Balance** existing groups with **up- and downsampling**.
* **Collapse** existing groups to fewer, balanced groups.
* Finds values, or indices of values, that **differ** from the previous value by some threshold(s).
* Check if two grouping factors have the same groups, **memberwise**.
### Main functions
|Function |Description |
|:---------------------|:------------------------------------------------------------------|
|`group_factor()` |Divides data into groups by a wide range of methods. |
|`group()` |Creates grouping factor and adds to the given data frame. |
|`splt()` |Creates grouping factor and splits the data by these groups. |
|`partition()` |Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition. |
|`fold()` |Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. |
|`collapse_groups()` |Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. |
|`balance()` |Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: `downsample()`, `upsample()`.|
### Other tools
|Function |Description |
|:-------------------------|:------------------------------------------------------------------|
|`all_groups_identical()` |Checks whether two grouping factors contain the same groups, *memberwise*.|
|`differs_from_previous()` |Finds values, or indices of values, that differ from the previous value by some threshold(s).|
|`find_starts()` |Finds values or indices of values that are not the same as the previous value.|
|`find_missing_starts()` |Finds missing starts for the `l_starts` method.|
|`summarize_group_cols()` |Calculates summary statistics about group columns (i.e. `factor`s).|
|`summarize_balances()` |Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. |
|`ranked_balances()` |Extracts the standard deviations from the `Summary` data frame from the output of `summarize_balances()` |
|`%primes%` |Finds remainder for the `primes` method. |
|`%staircase%` |Finds remainder for the `staircase` method.|
## Table of Contents
```{r toc, echo=FALSE}
groupdata2:::render_toc("README.Rmd")
```
## Installation
CRAN version:
> `install.packages("groupdata2")`
Development version:
> `install.packages("devtools")`
> `devtools::install_github("LudvigOlsen/groupdata2")`
## Vignettes
`groupdata2` contains a number of vignettes with relevant use cases and descriptions:
> `vignette(package = "groupdata2")` # for an overview
> `vignette("introduction_to_groupdata2")` # begin here
## Data for examples
```{r error=FALSE, warning=FALSE, message=FALSE}
# Attach packages
library(groupdata2)
library(dplyr) # %>% filter() arrange() summarize()
library(knitr) # kable()
```
```{r}
# Create small data frame
df_small <- data.frame(
"x" = c(1:12),
"species" = rep(c('cat', 'pig', 'human'), 4),
"age" = sample(c(1:100), 12),
stringsAsFactors = FALSE
)
```
```{r}
# Create medium data frame
df_medium <- data.frame(
"participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
"age" = rep(c(20, 33, 27, 21, 32, 25), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)),
"diagnosis2" = factor(sample(c('x','z','y'), 18, replace = TRUE)),
"score" = c(10, 24, 15, 35, 24, 14, 24, 40, 30,
50, 54, 25, 45, 67, 40, 78, 62, 30))
df_medium <- df_medium %>% arrange(participant)
df_medium$session <- rep(c('1','2', '3'), 6)
```
## Functions
### group_factor()
Returns a factor with group numbers, e.g. `factor(c(1,1,1,2,2,2,3,3,3))`.
This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting `force_equal = TRUE`
Randomize grouping factor by setting `randomize = TRUE`
```{r}
# Create grouping factor
group_factor(
data = df_small,
n = 5,
method = "n_dist"
)
```
### group()
Creates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in `magrittr` (`%>%`) pipelines.
```{r}
# Use group()
group(data = df_small, n = 5, method = 'n_dist') %>%
kable()
```
```{r}
# Use group() in a pipeline
# Get average age per group
df_small %>%
group(n = 5, method = 'n_dist') %>%
dplyr::summarise(mean_age = mean(age)) %>%
kable()
```
```{r}
# Using group() with 'l_starts' method
# Starts group at the first 'cat',
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df_small %>%
group(n = list("cat", c("pig", 2), "cat"),
method = 'l_starts',
starts_col = "species") %>%
kable()
```
### splt()
Creates the specified groups with `group_factor()` and splits the given data by the grouping factor with `base::split`. Returns the splits in a list.
```{r}
splt(data = df_small,
n = 3,
method = 'n_dist') %>%
kable()
```
### partition()
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on categorical variable(s) and/or a numerical variable. Make sure that all datapoints sharing an ID is in the same partition.
```{r}
# First set seed to ensure reproducibility
set.seed(1)
# Use partition() with categorical and numerical balancing,
# while ensuring all rows per ID are in the same partition
df_partitioned <- partition(
data = df_medium,
p = 0.7,
cat_col = 'diagnosis',
num_col = "age",
id_col = 'participant'
)
df_partitioned %>%
kable()
```
### fold()
Creates (optionally) balanced folds for use in cross-validation. Balance folds on categorical variable(s) and/or a numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.
```{r}
# First set seed to ensure reproducibility
set.seed(1)
# Use fold() with categorical and numerical balancing,
# while ensuring all rows per ID are in the same fold
df_folded <- fold(
data = df_medium,
k = 3,
cat_col = 'diagnosis',
num_col = "age",
id_col = 'participant'
)
# Show df_folded ordered by folds
df_folded %>%
arrange(.folds) %>%
kable()
```
```{r}
# Show distribution of diagnoses and participants
df_folded %>%
group_by(.folds) %>%
count(diagnosis, participant) %>%
kable()
```
```{r}
# Show age representation in folds
# Notice that we would get a more even distribution if we had more data.
# As age is fixed per ID, we only have 3 ages per category to balance with.
df_folded %>%
group_by(.folds) %>%
summarize(mean_age = mean(age),
sd_age = sd(age)) %>%
kable()
```
**Notice**, that the we now have the opportunity to include the *session* variable and/or use *participant* as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.
We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.
### collapse_groups()
Collapses a set of groups into a smaller set of groups while attempting to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows.
```{r}
# We consider each participant a group
# and collapse them into 3 new groups
# We balance the number of levels in diagnosis2 column,
# as this diagnosis is not constant within the participants
df_collapsed <- collapse_groups(
data = df_medium,
n = 3,
group_cols = 'participant',
cat_cols = 'diagnosis2',
num_cols = "score"
)
# Show df_collapsed ordered by new collapsed groups
df_collapsed %>%
arrange(.coll_groups) %>%
kable()
# Summarize the balances of the new groups
coll_summ <- df_collapsed %>%
summarize_balances(group_cols = '.coll_groups',
cat_cols = "diagnosis2",
num_cols = "score")
coll_summ$Groups %>%
kable()
coll_summ$Summary %>%
kable()
# Check the across-groups standard deviations
# This is a measure of how balanced the groups are (lower == more balanced)
# and is especially useful when comparing multiple group columns
coll_summ %>%
ranked_balances() %>%
kable()
```
**Recommended**: By enabling the `auto_tune` setting, we often get a much better balance.
### balance()
Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows.
Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.
```{r}
# Lets first unbalance the dataset by removing some rows
df_b <- df_medium %>%
arrange(diagnosis) %>%
filter(!row_number() %in% c(5,7,8,13,14,16,17,18))
# Show distribution of diagnoses and participants
df_b %>%
count(diagnosis, participant) %>%
kable()
```
```{r}
# First set seed to ensure reproducibility
set.seed(1)
# Downsampling by diagnosis
balance(
data = df_b,
size = "min",
cat_col = "diagnosis"
) %>%
count(diagnosis, participant) %>%
kable()
```
```{r}
# Downsampling the IDs
balance(
data = df_b,
size = "min",
cat_col = "diagnosis",
id_col = "participant",
id_method = "n_ids"
) %>%
count(diagnosis, participant) %>%
kable()
```
## Grouping Methods
There are currently 10 methods available. They can be divided into 6 categories.
*Examples of group sizes are based on a vector with 57 elements.*
### Specify group size
##### Method: greedy
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
### Specify number of groups
##### Method: n_dist (Default)
Divides the data into a specified number of groups and
distributes excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
##### Method: n_fill
Divides the data into a specified number of groups and
fills up groups with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
##### Method: n_last
Divides the data into a specified number of groups.
The algorithm finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
##### Method: n_rand
Divides the data into a specified number of groups.
Excess data points are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
### Specify list
##### Method: l_sizes
Uses a list / vector of group sizes to divide up the data.
Excess data points are placed in an extra group.
E.g. `n = c(11, 11)` returns group sizes: 11, 11, 35
##### Method: l_starts
Uses a list of starting positions to divide up the data.
Starting positions are values in a vector (e.g. column in data frame).
Skip to a specific nth appearance of a value by using `c(value, skip_to)`.
E.g. `n = c(11, 15, 27, 43)` returns group sizes: 10, 4, 12, 16, 15
Identical to `n = list(11, 15, c(27, 1), 43` where `1` specifies that we
want the first appearance of 27 after the previous value 15.
If passing `n = "auto"` starting positions are automatically found with `find_starts()`.
### Specify distance between members
##### Method: every
Every `n`th data point is combined to a group.
E.g. group sizes: 12, 12, 11, 11, 11
### Specify step size
##### Method: staircase
Uses step_size to divide up the data.
Group size increases with 1 step for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
### Specify start at
##### Method: primes
Creates groups with sizes corresponding to prime numbers.
Starts at `n` (prime number). Increases to the the next prime number until there is no more data.
E.g. group sizes: 5, 7, 11, 13, 17, 4
## Balancing ID Methods
There are currently 4 methods for balancing (up-/downsampling) on ID level in `balance()`.
##### ID method: n_ids
Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.
##### ID method: n_rows_c
Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.
##### ID method: distributed
Distributes the lacking/excess rows equally between the IDs. If the number to distribute cannot be equally divided, some IDs will have 1 row more/less than the others.
##### ID method: nested
Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.
