{"id":13666028,"url":"https://github.com/trinker/wakefield","last_synced_at":"2025-04-04T22:08:45.124Z","repository":{"id":30376161,"uuid":"33928775","full_name":"trinker/wakefield","owner":"trinker","description":"Generate random data sets","archived":false,"fork":false,"pushed_at":"2022-10-03T16:32:15.000Z","size":3967,"stargazers_count":256,"open_issues_count":16,"forks_count":28,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-10-11T18:26:32.140Z","etag":null,"topics":["data-generation","r","wakefield"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trinker.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-14T11:52:47.000Z","updated_at":"2024-08-26T18:58:05.000Z","dependencies_parsed_at":"2022-09-08T08:50:41.495Z","dependency_job_id":null,"html_url":"https://github.com/trinker/wakefield","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fwakefield","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fwakefield/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fwakefield/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fwakefield/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trinker","download_url":"https://codeload.github.com/trinker/wakefield/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247256115,"owners_count":20909240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-generation","r","wakefield"],"created_at":"2024-08-02T06:00:55.878Z","updated_at":"2025-04-04T22:08:45.105Z","avatar_url":"https://github.com/trinker.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\ntitle: \"wakefield\"\ndate: \"`r format(Sys.time(), '%d %B, %Y')`\"\noutput:\n  md_document:\n    toc: true\n    toc_depth: 4\n---\n\n```{r, echo=FALSE}\ndesc \u003c- suppressWarnings(readLines(\"DESCRIPTION\"))\nregex \u003c- \"(^Version:\\\\s+)(\\\\d+\\\\.\\\\d+\\\\.\\\\d+)\"\nloc \u003c- grep(regex, desc)\nver \u003c- gsub(regex, \"\\\\2\", desc[loc])\nlibrary(pacman)\n# verbadge \u003c- sprintf('\u003ca href=\"https://img.shields.io/badge/Version-%s-orange.svg\"\u003e\u003cimg src=\"https://img.shields.io/badge/Version-%s-orange.svg\" alt=\"Version\"/\u003e\u003c/a\u003e\u003c/p\u003e', ver, ver)\nverbadge \u003c- ''\np_load(dplyr, wakefield, knitr, tidyr, ggplot2)\n````\n\n```{r, echo=FALSE}\nknit_hooks$set(htmlcap = function(before, options, envir) {\n  if(!before) {\n    paste('\u003cp class=\"caption\"\u003e\u003cb\u003e\u003cem\u003e',options$htmlcap,\"\u003c/em\u003e\u003c/b\u003e\u003c/p\u003e\",sep=\"\")\n    }\n    })\nknitr::opts_knit$set(self.contained = TRUE, cache = FALSE)\nknitr::opts_chunk$set(fig.path = \"tools/figure/\")\n```\n\n[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active)\n[![Build Status](https://travis-ci.org/trinker/wakefield.svg?branch=master)](https://travis-ci.org/trinker/wakefield)\n[![Coverage Status](https://s3.amazonaws.com/assets.coveralls.io/badges/coveralls_0.svg)](https://coveralls.io/github/trinker/wakefield)\n[![DOI](https://zenodo.org/badge/5398/trinker/wakefield.svg)](https://dx.doi.org/10.5281/zenodo.17172)\n[![](https://cranlogs.r-pkg.org/badges/wakefield)](https://cran.r-project.org/package=wakefield)\n`r verbadge`\n\n\n**wakefield** is designed to quickly generate random data sets.  The user passes `n` (number of rows) and predefined vectors to the `r_data_frame` function to produce a `dplyr::tbl_df` object.\n\n![](tools/wakefield_logo/r_wakefield.png)  \n\n# Installation\n\nTo download the development version of **wakefield**:\n\nDownload the [zip ball](https://github.com/trinker/wakefield/zipball/master) or [tar ball](https://github.com/trinker/wakefield/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:\n\n```r\nif (!require(\"pacman\")) install.packages(\"pacman\")\npacman::p_load_gh(\"trinker/wakefield\")\npacman::p_load(dplyr, tidyr, ggplot2)\n```\n\n\n# Contact\n\nYou are welcome to:\n* submit suggestions and bug-reports at: \u003chttps://github.com/trinker/wakefield/issues\u003e\n* send a pull request on: \u003chttps://github.com/trinker/wakefield/\u003e\n* compose a friendly e-mail to: \u003ctyler.rinker@gmail.com\u003e\n\n# Demonstration\n## Getting Started\n\nThe `r_data_frame` function (random data frame) takes `n` (the number of rows) and any number of variables (columns).  These columns are typically produced from a **wakefield** variable function.  Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis).  The column name is hidden as a `varname` attribute.  For example here we see the `race` variable function:\n\n```{r}\nrace(n=10)\nattributes(race(n=10))\n```\n\nWhen this variable is used inside of `r_data_frame` the `varname` is used as a column name.  Additionally, the `n` argument is not set within variable functions but is set once in `r_data_frame`:\n\n```{r}\nr_data_frame(\n    n = 500,\n    race\n)\n```\n\nThe power of `r_data_frame` is apparent when we use many modular variable functions:\n\n```{r}\nr_data_frame(\n    n = 500,\n    id,\n    race,\n    age,\n    sex,\n    hour,\n    iq,\n    height,\n    died\n)\n```\n\n\nThere are `r length(variables())` **wakefield** based variable functions to chose from, spanning **R**'s various data types (see `?variables` for details).  \n\n```{r, results='asis', echo=FALSE, comment=NA, warning=FALSE, htmlcap=\"Available Variable Functions\"}\np_load(pander, xtable)\n\nvariables(\"matrix\", ncol=5) %\u003e%\n    xtable() %\u003e%\n    print(type = 'html', include.colnames = FALSE, include.rownames = FALSE,\n        html.table.attributes = '')\n\n#matrix(c(sprintf(\"`%s`\", vect), blanks), ncol=4) %\u003e%\n#    pandoc.table(format = \"markdown\", caption = \"Available variable functions.\")\n```\n\nHowever, the user may also pass their own vector producing functions or vectors to `r_data_frame`.  Those with an `n` argument can be set by `r_data_frame`:\n\n```{r}\nr_data_frame(\n    n = 500,\n    id,\n    Scoring = rnorm,\n    Smoker = valid,\n    race,\n    age,\n    sex,\n    hour,\n    iq,\n    height,\n    died\n)\n```\n\n\n```{r}\nr_data_frame(\n    n = 500,\n    id,\n    age, age, age,\n    grade, grade, grade\n)\n```\n\n\nWhile passing variable functions to `r_data_frame` without call parenthesis is handy, the user may wish to set arguments.  This can be done through call parenthesis as we do with `data.frame` or `dplyr::data_frame`:\n\n```{r}\nr_data_frame(\n    n = 500,\n    id,\n    Scoring = rnorm,\n    Smoker = valid,\n    `Reading(mins)` = rpois(lambda=20),  \n    race,\n    age(x = 8:14),\n    sex,\n    hour,\n    iq,\n    height(mean=50, sd = 10),\n    died\n)\n```\n\n## Random Missing Observations\n\nOften data contains missing values.  **wakefield** allows the user to add a proportion of missing values per column/vector via the `r_na` (random `NA`).  This works nicely within a **dplyr**/**magrittr** `%\u003e%` *then* pipeline:\n\n```{r}\nr_data_frame(\n    n = 30,\n    id,\n    race,\n    age,\n    sex,\n    hour,\n    iq,\n    height,\n    died,\n    Scoring = rnorm,\n    Smoker = valid\n) %\u003e%\n    r_na(prob=.4)\n```\n\n## Repeated Measures \u0026 Time Series\n\nThe `r_series` function allows the user to pass a single **wakefield** function and dictate how many columns (`j`) to produce.  \n\n```{r}\nset.seed(10)\n\nr_series(likert, j = 3, n=10)\n```\n\nOften the user wants a numeric score for Likert type columns and similar variables.  For series with multiple factors the `as_integer` converts all columns to integer values.  Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function's `name` argument.  Both of these features are demonstrated here.\n\n```{r}\nset.seed(10)\n\nas_integer(r_series(likert, j = 5, n=10, name = \"Item\"))\n```\n\n`r_series` can be used within a `r_data_frame` as well.  \n\n```{r}\nset.seed(10)\n\nr_data_frame(n=100,\n    id,\n    age,\n    sex,\n    r_series(likert, 3, name = \"Question\")\n)\n```\n\n\n```{r}\nset.seed(10)\n\nr_data_frame(n=100,\n    id,\n    age,\n    sex,\n    r_series(likert, 5, name = \"Item\", integer = TRUE)\n)\n```\n\n### Related Series\n\nThe user can also create related series via the `relate` argument in `r_series`.  It allows the user to specify the relationship between columns.  `relate` may be a named list of \\code{c(\"operation\", \"mean\", \"sd\")} or a short hand string of the form of `\"fM_sd\"` where:\n\n- `f` is one of (+, -, *, /)\n- `M` is a mean value\n- `sd` is a standard deviation of the mean value \n\nFor example you may use `relate = \"*4_1\"`.  If `relate = NULL` no relationship is generated between columns.  I will use the short hand string form here.\n\n#### Some Examples With Variation\n\n```{r}\nr_series(grade, j = 5, n = 100, relate = \"+1_6\")\nr_series(age, 5, 100, relate = \"+5_0\")\nr_series(likert, 5,  100, name =\"Item\", relate = \"-.5_.1\")\nr_series(grade, j = 5, n = 100, relate = \"*1.05_.1\")\n```\n\n#### Adjust Correlations\n\nUse the `sd` command to adjust correlations.\n\n```{r}\nround(cor(r_series(grade, 8, 10, relate = \"+1_2\")), 2)\nround(cor(r_series(grade, 8, 10, relate = \"+1_0\")), 2)\nround(cor(r_series(grade, 8, 10, relate = \"+1_20\")), 2)\nround(cor(r_series(grade, 8, 10, relate = \"+15_20\")), 2)\n```\n\n#### Visualize the Relationship\n\n```{r, fig.height=7, fig.width=11}\ndat \u003c- r_data_frame(12,\n    name,\n    r_series(grade, 100, relate = \"+1_6\")\n) \n\ndat %\u003e%\n    gather(Time, Grade, -c(Name)) %\u003e%\n    mutate(Time = as.numeric(gsub(\"\\\\D\", \"\", Time))) %\u003e%\n    ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +\n        geom_line(size=.8) + \n        theme_bw()\n```\n\n\n## Expanded Dummy Coding\n\nThe user may wish to expand a `factor` into `j` dummy coded columns.  The `r_dummy` function expands a factor into `j` columns and works similar to the `r_series` function.  The user may wish to use the original factor name as the prefix to the `j` columns.  Setting `prefix = TRUE` within `r_dummy` accomplishes this.\n\n\n```{r}\nset.seed(10)\nr_data_frame(n=100,\n    id,\n    age,\n    r_dummy(sex, prefix = TRUE),\n    r_dummy(political)\n)\n```\n\n\n## Visualizing Column Types\n\nIt is helpful to see the column types and `NA`s as a visualization.  The `table_heat` (also the `plot` method assigned to `tbl_df` as well) can provide visual glimpse of data types and missing cells.\n\n```{r, fig.height=7, fig.width=11}\nset.seed(10)\n\nr_data_frame(n=100,\n    id,\n    dob,\n    animal,\n    grade, grade,\n    death,\n    dummy,\n    grade_letter,\n    gender,\n    paragraph,\n    sentence\n) %\u003e%\n   r_na() %\u003e%\n   plot(palette = \"Set1\")\n```\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrinker%2Fwakefield","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrinker%2Fwakefield","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrinker%2Fwakefield/lists"}