Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/trinker/wakefield

Generate random data sets
https://github.com/trinker/wakefield

data-generation r wakefield

Last synced: 7 days ago
JSON representation

Generate random data sets

Awesome Lists containing this project

README

        

---
title: "wakefield"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
toc_depth: 4
---

```{r, echo=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
library(pacman)
# verbadge <- sprintf('Version', ver, ver)
verbadge <- ''
p_load(dplyr, wakefield, knitr, tidyr, ggplot2)
````

```{r, echo=FALSE}
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('

',options$htmlcap,"

",sep="")
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")
```

[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/wakefield.svg?branch=master)](https://travis-ci.org/trinker/wakefield)
[![Coverage Status](https://s3.amazonaws.com/assets.coveralls.io/badges/coveralls_0.svg)](https://coveralls.io/github/trinker/wakefield)
[![DOI](https://zenodo.org/badge/5398/trinker/wakefield.svg)](https://dx.doi.org/10.5281/zenodo.17172)
[![](https://cranlogs.r-pkg.org/badges/wakefield)](https://cran.r-project.org/package=wakefield)
`r verbadge`

**wakefield** is designed to quickly generate random data sets. The user passes `n` (number of rows) and predefined vectors to the `r_data_frame` function to produce a `dplyr::tbl_df` object.

![](tools/wakefield_logo/r_wakefield.png)

# Installation

To download the development version of **wakefield**:

Download the [zip ball](https://github.com/trinker/wakefield/zipball/master) or [tar ball](https://github.com/trinker/wakefield/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)
```

# Contact

You are welcome to:
* submit suggestions and bug-reports at:
* send a pull request on:
* compose a friendly e-mail to:

# Demonstration
## Getting Started

The `r_data_frame` function (random data frame) takes `n` (the number of rows) and any number of variables (columns). These columns are typically produced from a **wakefield** variable function. Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis). The column name is hidden as a `varname` attribute. For example here we see the `race` variable function:

```{r}
race(n=10)
attributes(race(n=10))
```

When this variable is used inside of `r_data_frame` the `varname` is used as a column name. Additionally, the `n` argument is not set within variable functions but is set once in `r_data_frame`:

```{r}
r_data_frame(
n = 500,
race
)
```

The power of `r_data_frame` is apparent when we use many modular variable functions:

```{r}
r_data_frame(
n = 500,
id,
race,
age,
sex,
hour,
iq,
height,
died
)
```

There are `r length(variables())` **wakefield** based variable functions to chose from, spanning **R**'s various data types (see `?variables` for details).

```{r, results='asis', echo=FALSE, comment=NA, warning=FALSE, htmlcap="Available Variable Functions"}
p_load(pander, xtable)

variables("matrix", ncol=5) %>%
xtable() %>%
print(type = 'html', include.colnames = FALSE, include.rownames = FALSE,
html.table.attributes = '')

#matrix(c(sprintf("`%s`", vect), blanks), ncol=4) %>%
# pandoc.table(format = "markdown", caption = "Available variable functions.")
```

However, the user may also pass their own vector producing functions or vectors to `r_data_frame`. Those with an `n` argument can be set by `r_data_frame`:

```{r}
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
race,
age,
sex,
hour,
iq,
height,
died
)
```

```{r}
r_data_frame(
n = 500,
id,
age, age, age,
grade, grade, grade
)
```

While passing variable functions to `r_data_frame` without call parenthesis is handy, the user may wish to set arguments. This can be done through call parenthesis as we do with `data.frame` or `dplyr::data_frame`:

```{r}
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
`Reading(mins)` = rpois(lambda=20),
race,
age(x = 8:14),
sex,
hour,
iq,
height(mean=50, sd = 10),
died
)
```

## Random Missing Observations

Often data contains missing values. **wakefield** allows the user to add a proportion of missing values per column/vector via the `r_na` (random `NA`). This works nicely within a **dplyr**/**magrittr** `%>%` *then* pipeline:

```{r}
r_data_frame(
n = 30,
id,
race,
age,
sex,
hour,
iq,
height,
died,
Scoring = rnorm,
Smoker = valid
) %>%
r_na(prob=.4)
```

## Repeated Measures & Time Series

The `r_series` function allows the user to pass a single **wakefield** function and dictate how many columns (`j`) to produce.

```{r}
set.seed(10)

r_series(likert, j = 3, n=10)
```

Often the user wants a numeric score for Likert type columns and similar variables. For series with multiple factors the `as_integer` converts all columns to integer values. Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function's `name` argument. Both of these features are demonstrated here.

```{r}
set.seed(10)

as_integer(r_series(likert, j = 5, n=10, name = "Item"))
```

`r_series` can be used within a `r_data_frame` as well.

```{r}
set.seed(10)

r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 3, name = "Question")
)
```

```{r}
set.seed(10)

r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 5, name = "Item", integer = TRUE)
)
```

### Related Series

The user can also create related series via the `relate` argument in `r_series`. It allows the user to specify the relationship between columns. `relate` may be a named list of \code{c("operation", "mean", "sd")} or a short hand string of the form of `"fM_sd"` where:

- `f` is one of (+, -, *, /)
- `M` is a mean value
- `sd` is a standard deviation of the mean value

For example you may use `relate = "*4_1"`. If `relate = NULL` no relationship is generated between columns. I will use the short hand string form here.

#### Some Examples With Variation

```{r}
r_series(grade, j = 5, n = 100, relate = "+1_6")
r_series(age, 5, 100, relate = "+5_0")
r_series(likert, 5, 100, name ="Item", relate = "-.5_.1")
r_series(grade, j = 5, n = 100, relate = "*1.05_.1")
```

#### Adjust Correlations

Use the `sd` command to adjust correlations.

```{r}
round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2)
round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)
```

#### Visualize the Relationship

```{r, fig.height=7, fig.width=11}
dat <- r_data_frame(12,
name,
r_series(grade, 100, relate = "+1_6")
)

dat %>%
gather(Time, Grade, -c(Name)) %>%
mutate(Time = as.numeric(gsub("\\D", "", Time))) %>%
ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +
geom_line(size=.8) +
theme_bw()
```

## Expanded Dummy Coding

The user may wish to expand a `factor` into `j` dummy coded columns. The `r_dummy` function expands a factor into `j` columns and works similar to the `r_series` function. The user may wish to use the original factor name as the prefix to the `j` columns. Setting `prefix = TRUE` within `r_dummy` accomplishes this.

```{r}
set.seed(10)
r_data_frame(n=100,
id,
age,
r_dummy(sex, prefix = TRUE),
r_dummy(political)
)
```

## Visualizing Column Types

It is helpful to see the column types and `NA`s as a visualization. The `table_heat` (also the `plot` method assigned to `tbl_df` as well) can provide visual glimpse of data types and missing cells.

```{r, fig.height=7, fig.width=11}
set.seed(10)

r_data_frame(n=100,
id,
dob,
animal,
grade, grade,
death,
dummy,
grade_letter,
gender,
paragraph,
sentence
) %>%
r_na() %>%
plot(palette = "Set1")
```