Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/trinker/wakefield
Generate random data sets
https://github.com/trinker/wakefield
data-generation r wakefield
Last synced: 7 days ago
JSON representation
Generate random data sets
- Host: GitHub
- URL: https://github.com/trinker/wakefield
- Owner: trinker
- Created: 2015-04-14T11:52:47.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2022-10-03T16:32:15.000Z (over 2 years ago)
- Last Synced: 2024-10-11T18:26:32.140Z (4 months ago)
- Topics: data-generation, r, wakefield
- Language: R
- Size: 3.78 MB
- Stars: 256
- Watchers: 16
- Forks: 28
- Open Issues: 16
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- jimsghstars - trinker/wakefield - Generate random data sets (R)
README
---
title: "wakefield"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
toc_depth: 4
---```{r, echo=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
library(pacman)
# verbadge <- sprintf('', ver, ver)
verbadge <- ''
p_load(dplyr, wakefield, knitr, tidyr, ggplot2)
```````{r, echo=FALSE}
",sep="")
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")
```[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/wakefield.svg?branch=master)](https://travis-ci.org/trinker/wakefield)
[![Coverage Status](https://s3.amazonaws.com/assets.coveralls.io/badges/coveralls_0.svg)](https://coveralls.io/github/trinker/wakefield)
[![DOI](https://zenodo.org/badge/5398/trinker/wakefield.svg)](https://dx.doi.org/10.5281/zenodo.17172)
[![](https://cranlogs.r-pkg.org/badges/wakefield)](https://cran.r-project.org/package=wakefield)
`r verbadge`**wakefield** is designed to quickly generate random data sets. The user passes `n` (number of rows) and predefined vectors to the `r_data_frame` function to produce a `dplyr::tbl_df` object.
![](tools/wakefield_logo/r_wakefield.png)
# Installation
To download the development version of **wakefield**:
Download the [zip ball](https://github.com/trinker/wakefield/zipball/master) or [tar ball](https://github.com/trinker/wakefield/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)
```# Contact
You are welcome to:
* submit suggestions and bug-reports at:
* send a pull request on:
* compose a friendly e-mail to:# Demonstration
## Getting StartedThe `r_data_frame` function (random data frame) takes `n` (the number of rows) and any number of variables (columns). These columns are typically produced from a **wakefield** variable function. Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis). The column name is hidden as a `varname` attribute. For example here we see the `race` variable function:
```{r}
race(n=10)
attributes(race(n=10))
```When this variable is used inside of `r_data_frame` the `varname` is used as a column name. Additionally, the `n` argument is not set within variable functions but is set once in `r_data_frame`:
```{r}
r_data_frame(
n = 500,
race
)
```The power of `r_data_frame` is apparent when we use many modular variable functions:
```{r}
r_data_frame(
n = 500,
id,
race,
age,
sex,
hour,
iq,
height,
died
)
```There are `r length(variables())` **wakefield** based variable functions to chose from, spanning **R**'s various data types (see `?variables` for details).
```{r, results='asis', echo=FALSE, comment=NA, warning=FALSE, htmlcap="Available Variable Functions"}
p_load(pander, xtable)variables("matrix", ncol=5) %>%
xtable() %>%
print(type = 'html', include.colnames = FALSE, include.rownames = FALSE,
html.table.attributes = '')#matrix(c(sprintf("`%s`", vect), blanks), ncol=4) %>%
# pandoc.table(format = "markdown", caption = "Available variable functions.")
```However, the user may also pass their own vector producing functions or vectors to `r_data_frame`. Those with an `n` argument can be set by `r_data_frame`:
```{r}
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
race,
age,
sex,
hour,
iq,
height,
died
)
``````{r}
r_data_frame(
n = 500,
id,
age, age, age,
grade, grade, grade
)
```While passing variable functions to `r_data_frame` without call parenthesis is handy, the user may wish to set arguments. This can be done through call parenthesis as we do with `data.frame` or `dplyr::data_frame`:
```{r}
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
`Reading(mins)` = rpois(lambda=20),
race,
age(x = 8:14),
sex,
hour,
iq,
height(mean=50, sd = 10),
died
)
```## Random Missing Observations
Often data contains missing values. **wakefield** allows the user to add a proportion of missing values per column/vector via the `r_na` (random `NA`). This works nicely within a **dplyr**/**magrittr** `%>%` *then* pipeline:
```{r}
r_data_frame(
n = 30,
id,
race,
age,
sex,
hour,
iq,
height,
died,
Scoring = rnorm,
Smoker = valid
) %>%
r_na(prob=.4)
```## Repeated Measures & Time Series
The `r_series` function allows the user to pass a single **wakefield** function and dictate how many columns (`j`) to produce.
```{r}
set.seed(10)r_series(likert, j = 3, n=10)
```Often the user wants a numeric score for Likert type columns and similar variables. For series with multiple factors the `as_integer` converts all columns to integer values. Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function's `name` argument. Both of these features are demonstrated here.
```{r}
set.seed(10)as_integer(r_series(likert, j = 5, n=10, name = "Item"))
````r_series` can be used within a `r_data_frame` as well.
```{r}
set.seed(10)r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 3, name = "Question")
)
``````{r}
set.seed(10)r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 5, name = "Item", integer = TRUE)
)
```### Related Series
The user can also create related series via the `relate` argument in `r_series`. It allows the user to specify the relationship between columns. `relate` may be a named list of \code{c("operation", "mean", "sd")} or a short hand string of the form of `"fM_sd"` where:
- `f` is one of (+, -, *, /)
- `M` is a mean value
- `sd` is a standard deviation of the mean valueFor example you may use `relate = "*4_1"`. If `relate = NULL` no relationship is generated between columns. I will use the short hand string form here.
#### Some Examples With Variation
```{r}
r_series(grade, j = 5, n = 100, relate = "+1_6")
r_series(age, 5, 100, relate = "+5_0")
r_series(likert, 5, 100, name ="Item", relate = "-.5_.1")
r_series(grade, j = 5, n = 100, relate = "*1.05_.1")
```#### Adjust Correlations
Use the `sd` command to adjust correlations.
```{r}
round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2)
round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)
```#### Visualize the Relationship
```{r, fig.height=7, fig.width=11}
dat <- r_data_frame(12,
name,
r_series(grade, 100, relate = "+1_6")
)dat %>%
gather(Time, Grade, -c(Name)) %>%
mutate(Time = as.numeric(gsub("\\D", "", Time))) %>%
ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +
geom_line(size=.8) +
theme_bw()
```## Expanded Dummy Coding
The user may wish to expand a `factor` into `j` dummy coded columns. The `r_dummy` function expands a factor into `j` columns and works similar to the `r_series` function. The user may wish to use the original factor name as the prefix to the `j` columns. Setting `prefix = TRUE` within `r_dummy` accomplishes this.
```{r}
set.seed(10)
r_data_frame(n=100,
id,
age,
r_dummy(sex, prefix = TRUE),
r_dummy(political)
)
```## Visualizing Column Types
It is helpful to see the column types and `NA`s as a visualization. The `table_heat` (also the `plot` method assigned to `tbl_df` as well) can provide visual glimpse of data types and missing cells.
```{r, fig.height=7, fig.width=11}
set.seed(10)r_data_frame(n=100,
id,
dob,
animal,
grade, grade,
death,
dummy,
grade_letter,
gender,
paragraph,
sentence
) %>%
r_na() %>%
plot(palette = "Set1")
```