https://github.com/ludvigolsen/groupdata2

R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
https://github.com/ludvigolsen/groupdata2
balance cross-validation data data-frame fold group-factor groups participants partition rstats split staircase
Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/ludvigolsen/groupdata2
Owner: LudvigOlsen
License: other
Created: 2016-10-30T19:39:22.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-12-18T17:12:15.000Z (about 1 year ago)
Last Synced: 2025-09-27T10:31:07.914Z (4 months ago)
Topics: balance, cross-validation, data, data-frame, fold, group-factor, groups, participants, partition, rstats, split, staircase
Language: R
Homepage:
Size: 1.84 MB
Stars: 27
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%",

  dpi = 92,

  fig.retina = 2

)

# Digits to print

options("digits"=3)

set.seed(1)

# Get minimum R requirement 

dep <- as.vector(read.dcf('DESCRIPTION')[, 'Depends'])

rvers <- substring(dep, 7, nchar(dep)-1)

# m <- regexpr('R *\\\\(>= \\\\d+.\\\\d+.\\\\d+\\\\)', dep)

# rm <- regmatches(dep, m)

# rvers <- gsub('.*(\\\\d+.\\\\d+.\\\\d+).*', '\\\\1', dep)

# Function for TOC

# https://gist.github.com/gadenbuie/c83e078bf8c81b035e32c3fc0cf04ee8

```

# groupdata2 

**Author:** [Ludvig R. Olsen](https://www.ludvigolsen.dk/) ( r-pkgs@ludvigolsen.dk ) 


**License:** [MIT](https://opensource.org/license/mit) 


**Started:** October 2016 

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/groupdata2)](https://cran.r-project.org/package=groupdata2)

[![metacran downloads](https://cranlogs.r-pkg.org/badges/groupdata2)](https://cran.r-project.org/package=groupdata2)

[![minimal R version](https://img.shields.io/badge/R%3E%3D-`r rvers`-6666ff.svg)](https://cran.r-project.org/)

[![Codecov test coverage](https://codecov.io/gh/ludvigolsen/groupdata2/branch/master/graph/badge.svg)](https://app.codecov.io/gh/ludvigolsen/groupdata2?branch=master)

[![GitHub Actions CI status](https://github.com/ludvigolsen/groupdata2/actions/workflows/R-check.yaml/badge.svg?branch=master)](https://github.com/ludvigolsen/groupdata2/actions/workflows/R-check.yaml?branch=master)

[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/LudvigOlsen/groupdata2?branch=master&svg=true)](https://ci.appveyor.com/project/LudvigOlsen/groupdata2)

[![DOI](https://zenodo.org/badge/72371128.svg)](https://zenodo.org/badge/latestdoi/72371128)

## Overview

R package for dividing data into groups.

* Create **balanced partitions** and cross-validation **folds**. 

* Perform time series **windowing** and general **grouping** and **splitting** of data. 

* **Balance** existing groups with **up- and downsampling**.

* **Collapse** existing groups to fewer, balanced groups.

* Finds values, or indices of values, that **differ** from the previous value by some threshold(s).

* Check if two grouping factors have the same groups, **memberwise**.

### Main functions

|Function              |Description                                                        |

|:---------------------|:------------------------------------------------------------------|

|`group_factor()`      |Divides data into groups by a wide range of methods.               |

|`group()`             |Creates grouping factor and adds to the given data frame.          |

|`splt()`              |Creates grouping factor and splits the data by these groups.       |

|`partition()`         |Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition.  |

|`fold()`              |Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. |

|`collapse_groups()`   |Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. |

|`balance()`           |Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: `downsample()`, `upsample()`.|

### Other tools

|Function                  |Description                                                        |

|:-------------------------|:------------------------------------------------------------------|

|`all_groups_identical()`  |Checks whether two grouping factors contain the same groups, *memberwise*.|

|`differs_from_previous()` |Finds values, or indices of values, that differ from the previous value by some threshold(s).|

|`find_starts()`           |Finds values or indices of values that are not the same as the previous value.|

|`find_missing_starts()`   |Finds missing starts for the `l_starts` method.|

|`summarize_group_cols()`  |Calculates summary statistics about group columns (i.e. `factor`s).|

|`summarize_balances()`    |Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. |

|`ranked_balances()`       |Extracts the standard deviations from the `Summary` data frame from the output of `summarize_balances()` |

|`%primes%`                |Finds remainder for the `primes` method.   |

|`%staircase%`             |Finds remainder for the `staircase` method.|

## Table of Contents

```{r toc, echo=FALSE}

groupdata2:::render_toc("README.Rmd")

```

## Installation

CRAN version:

> `install.packages("groupdata2")`  

Development version:  

> `install.packages("devtools")`  

> `devtools::install_github("LudvigOlsen/groupdata2")`  

## Vignettes

`groupdata2` contains a number of vignettes with relevant use cases and descriptions:  

  

> `vignette(package = "groupdata2")` # for an overview   

> `vignette("introduction_to_groupdata2")` # begin here   

## Data for examples

```{r error=FALSE, warning=FALSE, message=FALSE}

# Attach packages

library(groupdata2)

library(dplyr)       # %>% filter() arrange() summarize()

library(knitr)       # kable()

```

```{r}

# Create small data frame

df_small <- data.frame(

  "x" = c(1:12),

  "species" = rep(c('cat', 'pig', 'human'), 4),

  "age" = sample(c(1:100), 12),

  stringsAsFactors = FALSE

)

```

```{r}

# Create medium data frame

df_medium <- data.frame(

  "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),

  "age" = rep(c(20, 33, 27, 21, 32, 25), 3),

  "diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)),

  "diagnosis2" = factor(sample(c('x','z','y'), 18, replace = TRUE)),

  "score" = c(10, 24, 15, 35, 24, 14, 24, 40, 30, 

              50, 54, 25, 45, 67, 40, 78, 62, 30))

df_medium <- df_medium %>% arrange(participant)

df_medium$session <- rep(c('1','2', '3'), 6)

```

## Functions

### group_factor()

Returns a factor with group numbers, e.g. `factor(c(1,1,1,2,2,2,3,3,3))`.  

This can be used to subset, aggregate, group_by, etc.   

Create equally sized groups by setting `force_equal = TRUE`  

Randomize grouping factor by setting `randomize = TRUE`  

```{r}

# Create grouping factor

group_factor(

  data = df_small, 

  n = 5, 

  method = "n_dist"

)

```

### group()

Creates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in `magrittr` (`%>%`) pipelines.  

```{r}

# Use group()

group(data = df_small, n = 5, method = 'n_dist') %>%

  kable()

```

```{r}

# Use group() in a pipeline 

# Get average age per group

df_small %>%

  group(n = 5, method = 'n_dist') %>% 

  dplyr::summarise(mean_age = mean(age)) %>%

  kable()

```

```{r}

# Using group() with 'l_starts' method

# Starts group at the first 'cat', 

# then skips to the second appearance of "pig" after "cat",

# then starts at the following "cat".

df_small %>%

  group(n = list("cat", c("pig", 2), "cat"),

        method = 'l_starts',

        starts_col = "species") %>%

  kable()

```

### splt()

Creates the specified groups with `group_factor()` and splits the given data by the grouping factor with `base::split`. Returns the splits in a list.  

```{r}

splt(data = df_small,

     n = 3,

     method = 'n_dist') %>%

  kable()

```

### partition()

Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on categorical variable(s) and/or a numerical variable. Make sure that all datapoints sharing an ID is in the same partition.

```{r}

# First set seed to ensure reproducibility

set.seed(1)

# Use partition() with categorical and numerical balancing,

# while ensuring all rows per ID are in the same partition

df_partitioned <- partition(

  data = df_medium, 

  p = 0.7,

  cat_col = 'diagnosis',

  num_col = "age",

  id_col = 'participant'

)

df_partitioned %>% 

  kable()

```

### fold()

Creates (optionally) balanced folds for use in cross-validation. Balance folds on categorical variable(s) and/or a numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.  

```{r}

# First set seed to ensure reproducibility

set.seed(1)

# Use fold() with categorical and numerical balancing,

# while ensuring all rows per ID are in the same fold

df_folded <- fold(

  data = df_medium, 

  k = 3,

  cat_col = 'diagnosis',

  num_col = "age",

  id_col = 'participant'

)

# Show df_folded ordered by folds

df_folded %>% 

  arrange(.folds) %>%

  kable()

```

```{r}

# Show distribution of diagnoses and participants

df_folded %>% 

  group_by(.folds) %>% 

  count(diagnosis, participant) %>% 

  kable()

```

```{r}

# Show age representation in folds

# Notice that we would get a more even distribution if we had more data.

# As age is fixed per ID, we only have 3 ages per category to balance with.

df_folded %>% 

  group_by(.folds) %>% 

  summarize(mean_age = mean(age),

            sd_age = sd(age)) %>% 

  kable()

```

**Notice**, that the we now have the opportunity to include the *session* variable and/or use *participant* as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.  

We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.  

### collapse_groups()

Collapses a set of groups into a smaller set of groups while attempting to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows.

```{r}

# We consider each participant a group

# and collapse them into 3 new groups

# We balance the number of levels in diagnosis2 column, 

# as this diagnosis is not constant within the participants

df_collapsed <- collapse_groups(

  data = df_medium,

  n = 3,

  group_cols = 'participant',

  cat_cols = 'diagnosis2',

  num_cols = "score"

) 

# Show df_collapsed ordered by new collapsed groups

df_collapsed %>% 

  arrange(.coll_groups) %>%

  kable()

# Summarize the balances of the new groups

coll_summ <- df_collapsed %>% 

  summarize_balances(group_cols = '.coll_groups',

                     cat_cols = "diagnosis2",

                     num_cols = "score")

coll_summ$Groups %>% 

  kable()

coll_summ$Summary %>% 

  kable()

# Check the across-groups standard deviations 

# This is a measure of how balanced the groups are (lower == more balanced)

# and is especially useful when comparing multiple group columns

coll_summ %>% 

  ranked_balances() %>%

  kable()

```

**Recommended**: By enabling the `auto_tune` setting, we often get a much better balance.

### balance()

Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows.

Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.   

 

```{r}

# Lets first unbalance the dataset by removing some rows

df_b <- df_medium %>% 

  arrange(diagnosis) %>% 

  filter(!row_number() %in% c(5,7,8,13,14,16,17,18))

# Show distribution of diagnoses and participants

df_b %>% 

  count(diagnosis, participant) %>% 

  kable()

```

```{r}

# First set seed to ensure reproducibility

set.seed(1)

# Downsampling by diagnosis

balance(

  data = df_b, 

  size = "min", 

  cat_col = "diagnosis"

) %>% 

  count(diagnosis, participant) %>% 

  kable()

```

```{r}

# Downsampling the IDs

balance(

  data = df_b, 

  size = "min", 

  cat_col = "diagnosis", 

  id_col = "participant", 

  id_method = "n_ids"

) %>% 

  count(diagnosis, participant) %>% 

  kable()

  

```

 

## Grouping Methods

There are currently 10 methods available. They can be divided into 6 categories.  

*Examples of group sizes are based on a vector with 57 elements.*  

### Specify group size

##### Method: greedy

Divides up the data greedily given a specified group size.  

E.g. group sizes: 10, 10, 10, 10, 10, 7   

### Specify number of groups

##### Method: n_dist (Default)

Divides the data into a specified number of groups and 

distributes excess data points across groups.  

E.g. group sizes: 11, 11, 12, 11, 12  

##### Method: n_fill

Divides the data into a specified number of groups and 

fills up groups with excess data points from the beginning.   

E.g. group sizes: 12, 12, 11, 11, 11  

##### Method: n_last

Divides the data into a specified number of groups. 

The algorithm finds the most equal group sizes possible, 

using all data points. Only the last group is able to differ in size.  

E.g. group sizes: 11, 11, 11, 11, 13  

##### Method: n_rand

Divides the data into a specified number of groups. 

Excess data points are placed randomly in groups (only 1 per group).  

E.g. group sizes: 12, 11, 11, 11, 12  

### Specify list

##### Method: l_sizes

Uses a list / vector of group sizes to divide up the data.  

Excess data points are placed in an extra group.  

E.g. `n = c(11, 11)` returns group sizes: 11, 11, 35  

##### Method: l_starts

Uses a list of starting positions to divide up the data.  

Starting positions are values in a vector (e.g. column in data frame). 

Skip to a specific nth appearance of a value by using `c(value, skip_to)`.  

E.g. `n = c(11, 15, 27, 43)` returns group sizes: 10, 4, 12, 16, 15  

Identical to `n = list(11, 15, c(27, 1), 43` where `1` specifies that we 

want the first appearance of 27 after the previous value 15.  

If passing `n = "auto"` starting positions are automatically found with `find_starts()`.  

### Specify distance between members

##### Method: every

Every `n`th data point is combined to a group.

E.g. group sizes: 12, 12, 11, 11, 11  

### Specify step size

##### Method: staircase

Uses step_size to divide up the data. 

Group size increases with 1 step for every group, until there is no more data.  

E.g. group sizes: 5, 10, 15, 20, 7  

### Specify start at

##### Method: primes

Creates groups with sizes corresponding to prime numbers.  

Starts at `n` (prime number). Increases to the the next prime number until there is no more data.

E.g. group sizes: 5, 7, 11, 13, 17, 4  

## Balancing ID Methods

There are currently 4 methods for balancing (up-/downsampling) on ID level in `balance()`.

##### ID method: n_ids

Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.

##### ID method: n_rows_c

Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.  

##### ID method: distributed

Distributes the lacking/excess rows equally between the IDs. If the number to distribute cannot be equally divided, some IDs will have 1 row more/less than the others.  

##### ID method: nested

Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ludvigolsen/groupdata2

Awesome Lists containing this project

README