{"id":17000024,"url":"https://github.com/ludvigolsen/groupdata2","last_synced_at":"2025-10-12T13:13:48.437Z","repository":{"id":56934468,"uuid":"72371128","full_name":"LudvigOlsen/groupdata2","owner":"LudvigOlsen","description":"R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.","archived":false,"fork":false,"pushed_at":"2024-12-18T17:12:15.000Z","size":1933,"stargazers_count":27,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-27T10:31:07.914Z","etag":null,"topics":["balance","cross-validation","data","data-frame","fold","group-factor","groups","participants","partition","rstats","split","staircase"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LudvigOlsen.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-30T19:39:22.000Z","updated_at":"2025-02-01T18:54:54.000Z","dependencies_parsed_at":"2022-08-21T06:50:47.576Z","dependency_job_id":"e048d0df-f139-430a-9d36-28adda7979d2","html_url":"https://github.com/LudvigOlsen/groupdata2","commit_stats":{"total_commits":501,"total_committers":3,"mean_commits":167.0,"dds":"0.053892215568862256","last_synced_commit":"7aa295664c8e4eccf9020fae8446ccac72241a2f"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/LudvigOlsen/groupdata2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LudvigOlsen%2Fgroupdata2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LudvigOlsen%2Fgroupdata2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LudvigOlsen%2Fgroupdata2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LudvigOlsen%2Fgroupdata2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LudvigOlsen","download_url":"https://codeload.github.com/LudvigOlsen/groupdata2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LudvigOlsen%2Fgroupdata2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279011468,"owners_count":26084947,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["balance","cross-validation","data","data-frame","fold","group-factor","groups","participants","partition","rstats","split","staircase"],"created_at":"2024-10-14T04:10:49.202Z","updated_at":"2025-10-12T13:13:48.380Z","avatar_url":"https://github.com/LudvigOlsen.png","language":"R","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\",\n  dpi = 92,\n  fig.retina = 2\n)\n\n# Digits to print\noptions(\"digits\"=3)\n\nset.seed(1)\n\n# Get minimum R requirement \ndep \u003c- as.vector(read.dcf('DESCRIPTION')[, 'Depends'])\nrvers \u003c- substring(dep, 7, nchar(dep)-1)\n# m \u003c- regexpr('R *\\\\\\\\(\u003e= \\\\\\\\d+.\\\\\\\\d+.\\\\\\\\d+\\\\\\\\)', dep)\n# rm \u003c- regmatches(dep, m)\n# rvers \u003c- gsub('.*(\\\\\\\\d+.\\\\\\\\d+.\\\\\\\\d+).*', '\\\\\\\\1', dep)\n\n# Function for TOC\n# https://gist.github.com/gadenbuie/c83e078bf8c81b035e32c3fc0cf04ee8\n```\n\n# groupdata2 \u003ca href='https://github.com/LudvigOlsen/groupdata2'\u003e\u003cimg src='man/figures/groupdata2_logo_242x280_250dpi.png' align=\"right\" height=\"140\" /\u003e\u003c/a\u003e\n\n**Author:** [Ludvig R. Olsen](https://www.ludvigolsen.dk/) ( r-pkgs@ludvigolsen.dk ) \u003cbr/\u003e\n**License:** [MIT](https://opensource.org/license/mit) \u003cbr/\u003e\n**Started:** October 2016 \n\n[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/groupdata2)](https://cran.r-project.org/package=groupdata2)\n[![metacran downloads](https://cranlogs.r-pkg.org/badges/groupdata2)](https://cran.r-project.org/package=groupdata2)\n[![minimal R version](https://img.shields.io/badge/R%3E%3D-`r rvers`-6666ff.svg)](https://cran.r-project.org/)\n[![Codecov test coverage](https://codecov.io/gh/ludvigolsen/groupdata2/branch/master/graph/badge.svg)](https://app.codecov.io/gh/ludvigolsen/groupdata2?branch=master)\n[![GitHub Actions CI status](https://github.com/ludvigolsen/groupdata2/actions/workflows/R-check.yaml/badge.svg?branch=master)](https://github.com/ludvigolsen/groupdata2/actions/workflows/R-check.yaml?branch=master)\n[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/LudvigOlsen/groupdata2?branch=master\u0026svg=true)](https://ci.appveyor.com/project/LudvigOlsen/groupdata2)\n[![DOI](https://zenodo.org/badge/72371128.svg)](https://zenodo.org/badge/latestdoi/72371128)\n\n## Overview\n\nR package for dividing data into groups.\n\n* Create **balanced partitions** and cross-validation **folds**. \n* Perform time series **windowing** and general **grouping** and **splitting** of data. \n* **Balance** existing groups with **up- and downsampling**.\n* **Collapse** existing groups to fewer, balanced groups.\n* Finds values, or indices of values, that **differ** from the previous value by some threshold(s).\n* Check if two grouping factors have the same groups, **memberwise**.\n\n\n### Main functions\n\n|Function              |Description                                                        |\n|:---------------------|:------------------------------------------------------------------|\n|`group_factor()`      |Divides data into groups by a wide range of methods.               |\n|`group()`             |Creates grouping factor and adds to the given data frame.          |\n|`splt()`              |Creates grouping factor and splits the data by these groups.       |\n|`partition()`         |Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition.  |\n|`fold()`              |Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. |\n|`collapse_groups()`   |Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. |\n|`balance()`           |Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: `downsample()`, `upsample()`.|\n\n### Other tools\n\n|Function                  |Description                                                        |\n|:-------------------------|:------------------------------------------------------------------|\n|`all_groups_identical()`  |Checks whether two grouping factors contain the same groups, *memberwise*.|\n|`differs_from_previous()` |Finds values, or indices of values, that differ from the previous value by some threshold(s).|\n|`find_starts()`           |Finds values or indices of values that are not the same as the previous value.|\n|`find_missing_starts()`   |Finds missing starts for the `l_starts` method.|\n|`summarize_group_cols()`  |Calculates summary statistics about group columns (i.e. `factor`s).|\n|`summarize_balances()`    |Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. |\n|`ranked_balances()`       |Extracts the standard deviations from the `Summary` data frame from the output of `summarize_balances()` |\n|`%primes%`                |Finds remainder for the `primes` method.   |\n|`%staircase%`             |Finds remainder for the `staircase` method.|\n\n## Table of Contents\n\n```{r toc, echo=FALSE}\ngroupdata2:::render_toc(\"README.Rmd\")\n```\n\n\n## Installation\n\nCRAN version:\n\n\u003e `install.packages(\"groupdata2\")`  \n\nDevelopment version:  \n\n\u003e `install.packages(\"devtools\")`  \n\u003e `devtools::install_github(\"LudvigOlsen/groupdata2\")`  \n\n## Vignettes\n\n`groupdata2` contains a number of vignettes with relevant use cases and descriptions:  \n  \n\u003e `vignette(package = \"groupdata2\")` # for an overview   \n\u003e `vignette(\"introduction_to_groupdata2\")` # begin here   \n\n## Data for examples\n\n```{r error=FALSE, warning=FALSE, message=FALSE}\n# Attach packages\nlibrary(groupdata2)\nlibrary(dplyr)       # %\u003e% filter() arrange() summarize()\nlibrary(knitr)       # kable()\n```\n\n```{r}\n# Create small data frame\ndf_small \u003c- data.frame(\n  \"x\" = c(1:12),\n  \"species\" = rep(c('cat', 'pig', 'human'), 4),\n  \"age\" = sample(c(1:100), 12),\n  stringsAsFactors = FALSE\n)\n```\n\n```{r}\n# Create medium data frame\ndf_medium \u003c- data.frame(\n  \"participant\" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),\n  \"age\" = rep(c(20, 33, 27, 21, 32, 25), 3),\n  \"diagnosis\" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)),\n  \"diagnosis2\" = factor(sample(c('x','z','y'), 18, replace = TRUE)),\n  \"score\" = c(10, 24, 15, 35, 24, 14, 24, 40, 30, \n              50, 54, 25, 45, 67, 40, 78, 62, 30))\ndf_medium \u003c- df_medium %\u003e% arrange(participant)\ndf_medium$session \u003c- rep(c('1','2', '3'), 6)\n\n```\n\n\n## Functions\n\n### group_factor()\n\nReturns a factor with group numbers, e.g. `factor(c(1,1,1,2,2,2,3,3,3))`.  \n\nThis can be used to subset, aggregate, group_by, etc.   \n\nCreate equally sized groups by setting `force_equal = TRUE`  \n\nRandomize grouping factor by setting `randomize = TRUE`  \n\n```{r}\n# Create grouping factor\ngroup_factor(\n  data = df_small, \n  n = 5, \n  method = \"n_dist\"\n)\n```\n\n\n### group()\n\nCreates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in `magrittr` (`%\u003e%`) pipelines.  \n\n```{r}\n# Use group()\ngroup(data = df_small, n = 5, method = 'n_dist') %\u003e%\n  kable()\n```\n\n```{r}\n# Use group() in a pipeline \n# Get average age per group\ndf_small %\u003e%\n  group(n = 5, method = 'n_dist') %\u003e% \n  dplyr::summarise(mean_age = mean(age)) %\u003e%\n  kable()\n```\n\n```{r}\n# Using group() with 'l_starts' method\n# Starts group at the first 'cat', \n# then skips to the second appearance of \"pig\" after \"cat\",\n# then starts at the following \"cat\".\ndf_small %\u003e%\n  group(n = list(\"cat\", c(\"pig\", 2), \"cat\"),\n        method = 'l_starts',\n        starts_col = \"species\") %\u003e%\n  kable()\n\n```\n\n\n### splt()\n\nCreates the specified groups with `group_factor()` and splits the given data by the grouping factor with `base::split`. Returns the splits in a list.  \n\n```{r}\nsplt(data = df_small,\n     n = 3,\n     method = 'n_dist') %\u003e%\n  kable()\n```\n\n\n### partition()\n\nCreates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on categorical variable(s) and/or a numerical variable. Make sure that all datapoints sharing an ID is in the same partition.\n\n```{r}\n# First set seed to ensure reproducibility\nset.seed(1)\n\n# Use partition() with categorical and numerical balancing,\n# while ensuring all rows per ID are in the same partition\ndf_partitioned \u003c- partition(\n  data = df_medium, \n  p = 0.7,\n  cat_col = 'diagnosis',\n  num_col = \"age\",\n  id_col = 'participant'\n)\n\ndf_partitioned %\u003e% \n  kable()\n```\n\n\n### fold()\n\nCreates (optionally) balanced folds for use in cross-validation. Balance folds on categorical variable(s) and/or a numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.  \n\n```{r}\n# First set seed to ensure reproducibility\nset.seed(1)\n\n# Use fold() with categorical and numerical balancing,\n# while ensuring all rows per ID are in the same fold\ndf_folded \u003c- fold(\n  data = df_medium, \n  k = 3,\n  cat_col = 'diagnosis',\n  num_col = \"age\",\n  id_col = 'participant'\n)\n\n# Show df_folded ordered by folds\ndf_folded %\u003e% \n  arrange(.folds) %\u003e%\n  kable()\n```\n\n```{r}\n# Show distribution of diagnoses and participants\ndf_folded %\u003e% \n  group_by(.folds) %\u003e% \n  count(diagnosis, participant) %\u003e% \n  kable()\n```\n\n```{r}\n# Show age representation in folds\n# Notice that we would get a more even distribution if we had more data.\n# As age is fixed per ID, we only have 3 ages per category to balance with.\ndf_folded %\u003e% \n  group_by(.folds) %\u003e% \n  summarize(mean_age = mean(age),\n            sd_age = sd(age)) %\u003e% \n  kable()\n\n```\n\n**Notice**, that the we now have the opportunity to include the *session* variable and/or use *participant* as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.  \n\nWe also have a balance in the representation of each diagnosis, which could give us better, more consistent results.  \n\n### collapse_groups()\n\nCollapses a set of groups into a smaller set of groups while attempting to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows.\n\n```{r}\n# We consider each participant a group\n# and collapse them into 3 new groups\n# We balance the number of levels in diagnosis2 column, \n# as this diagnosis is not constant within the participants\ndf_collapsed \u003c- collapse_groups(\n  data = df_medium,\n  n = 3,\n  group_cols = 'participant',\n  cat_cols = 'diagnosis2',\n  num_cols = \"score\"\n) \n\n# Show df_collapsed ordered by new collapsed groups\ndf_collapsed %\u003e% \n  arrange(.coll_groups) %\u003e%\n  kable()\n\n# Summarize the balances of the new groups\ncoll_summ \u003c- df_collapsed %\u003e% \n  summarize_balances(group_cols = '.coll_groups',\n                     cat_cols = \"diagnosis2\",\n                     num_cols = \"score\")\n\ncoll_summ$Groups %\u003e% \n  kable()\n\ncoll_summ$Summary %\u003e% \n  kable()\n\n# Check the across-groups standard deviations \n# This is a measure of how balanced the groups are (lower == more balanced)\n# and is especially useful when comparing multiple group columns\ncoll_summ %\u003e% \n  ranked_balances() %\u003e%\n  kable()\n\n```\n\n**Recommended**: By enabling the `auto_tune` setting, we often get a much better balance.\n\n### balance()\n\nUses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows.\nBalancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.   \n \n\n```{r}\n# Lets first unbalance the dataset by removing some rows\ndf_b \u003c- df_medium %\u003e% \n  arrange(diagnosis) %\u003e% \n  filter(!row_number() %in% c(5,7,8,13,14,16,17,18))\n\n# Show distribution of diagnoses and participants\ndf_b %\u003e% \n  count(diagnosis, participant) %\u003e% \n  kable()\n```\n\n```{r}\n# First set seed to ensure reproducibility\nset.seed(1)\n\n# Downsampling by diagnosis\nbalance(\n  data = df_b, \n  size = \"min\", \n  cat_col = \"diagnosis\"\n) %\u003e% \n  count(diagnosis, participant) %\u003e% \n  kable()\n```\n\n```{r}\n# Downsampling the IDs\nbalance(\n  data = df_b, \n  size = \"min\", \n  cat_col = \"diagnosis\", \n  id_col = \"participant\", \n  id_method = \"n_ids\"\n) %\u003e% \n  count(diagnosis, participant) %\u003e% \n  kable()\n  \n```\n \n## Grouping Methods\nThere are currently 10 methods available. They can be divided into 6 categories.  \n\n*Examples of group sizes are based on a vector with 57 elements.*  \n\n### Specify group size\n##### Method: greedy\nDivides up the data greedily given a specified group size.  \n\nE.g. group sizes: 10, 10, 10, 10, 10, 7   \n\n### Specify number of groups\n##### Method: n_dist (Default)\nDivides the data into a specified number of groups and \ndistributes excess data points across groups.  \n\nE.g. group sizes: 11, 11, 12, 11, 12  \n\n##### Method: n_fill\nDivides the data into a specified number of groups and \nfills up groups with excess data points from the beginning.   \n\nE.g. group sizes: 12, 12, 11, 11, 11  \n\n##### Method: n_last\nDivides the data into a specified number of groups. \nThe algorithm finds the most equal group sizes possible, \nusing all data points. Only the last group is able to differ in size.  \n\nE.g. group sizes: 11, 11, 11, 11, 13  \n\n##### Method: n_rand\nDivides the data into a specified number of groups. \nExcess data points are placed randomly in groups (only 1 per group).  \n\nE.g. group sizes: 12, 11, 11, 11, 12  \n\n### Specify list\n##### Method: l_sizes\nUses a list / vector of group sizes to divide up the data.  \nExcess data points are placed in an extra group.  \n\nE.g. `n = c(11, 11)` returns group sizes: 11, 11, 35  \n\n##### Method: l_starts\nUses a list of starting positions to divide up the data.  \nStarting positions are values in a vector (e.g. column in data frame). \nSkip to a specific nth appearance of a value by using `c(value, skip_to)`.  \n\nE.g. `n = c(11, 15, 27, 43)` returns group sizes: 10, 4, 12, 16, 15  \n\nIdentical to `n = list(11, 15, c(27, 1), 43` where `1` specifies that we \nwant the first appearance of 27 after the previous value 15.  \n\nIf passing `n = \"auto\"` starting positions are automatically found with `find_starts()`.  \n\n### Specify distance between members\n##### Method: every\nEvery `n`th data point is combined to a group.\n\nE.g. group sizes: 12, 12, 11, 11, 11  \n\n### Specify step size\n##### Method: staircase\nUses step_size to divide up the data. \nGroup size increases with 1 step for every group, until there is no more data.  \n\nE.g. group sizes: 5, 10, 15, 20, 7  \n\n### Specify start at\n##### Method: primes\nCreates groups with sizes corresponding to prime numbers.  \nStarts at `n` (prime number). Increases to the the next prime number until there is no more data.\n\nE.g. group sizes: 5, 7, 11, 13, 17, 4  \n\n## Balancing ID Methods\nThere are currently 4 methods for balancing (up-/downsampling) on ID level in `balance()`.\n\n##### ID method: n_ids\nBalances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.\n\n##### ID method: n_rows_c\nAttempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.  \n\n##### ID method: distributed\nDistributes the lacking/excess rows equally between the IDs. If the number to distribute cannot be equally divided, some IDs will have 1 row more/less than the others.  \n\n##### ID method: nested\nBalances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fludvigolsen%2Fgroupdata2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fludvigolsen%2Fgroupdata2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fludvigolsen%2Fgroupdata2/lists"}