Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/WinVector/cdata

Higher order fluid or coordinatized data transforms in R. Distributed under choice of GPL-2 or GPL-3 license.
https://github.com/WinVector/cdata

Last synced: about 1 month ago
JSON representation

Higher order fluid or coordinatized data transforms in R. Distributed under choice of GPL-2 or GPL-3 license.

Awesome Lists containing this project

README

        

---
output: github_document
---

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/cdata)](https://cran.r-project.org/package=cdata)
[![status](https://tinyverse.netlify.com/badge/cdata)](https://CRAN.R-project.org/package=cdata)

```{r, echo = FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = " # ",
fig.path = "tools/README-"
)
```

[`cdata`](https://CRAN.R-project.org/package=cdata) is a general data re-shaper that has the great virtue of adhering to Raymond's "Rule of Representation", and using Codd's "Guaranteed Access Rule".

> Fold knowledge into data, so program logic can be stupid and robust.
>
> [*The Art of Unix Programming*, Erick S. Raymond, Addison-Wesley, 2003](http://www.catb.org/esr/writings/taoup/html/ch01s06.html#id2878263)

> Rule 2: The guaranteed access rule.
>
> Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
>
> [Edgar F. Codd](https://en.wikipedia.org/wiki/Codd%27s_12_rules)

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off. `cdata` also has a `Python` implementation that it can inter-operate with in the [`data_algebra` package](https://github.com/WinVector/data_algebra) (example [here](https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata.ipynb)).

![](https://raw.githubusercontent.com/WinVector/cdata/master/tools/cdata.png)

Briefly: `cdata` supplies data transform operators that:

* Work on local data or with any `DBI` data source.
* Are powerful generalizations of the operations commonly called `pivot` and `un-pivot`.
* Allow for example-driven graphical specification of data transforms or data layout control.
* Work in-memory or with `SQL` databases.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

```{r ex0}
iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))

# show the data
head(iris)

library("ggplot2")
library("cdata")

#
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
"flower_part", "Length" , "Width" |
"Petal" , Petal.Length , Petal.Width |
"Sepal" , Sepal.Length , Sepal.Width )

transform <- rowrecs_to_blocks_spec(
controlTable,
recordKeys = c("iris_id", "Species"))

# do the unpivot to convert the row records to block records
iris_aug <- iris %.>% transform

# show the tranformed data
head(iris_aug)

# plot the graph
ggplot(iris_aug, aes(x=Length, y=Width)) +
geom_point(aes(color=Species, shape=Species)) +
facet_wrap(~flower_part, labeller = label_both, scale = "free") +
ggtitle("Iris dimensions") + scale_color_brewer(palette = "Dark2")

# show the transform
print(transform)

# show the representation of the transform
unclass(transform)
```

More details on the above example can be found [here](https://win-vector.com/2018/10/21/faceted-graphs-with-cdata-and-ggplot2/). A tutorial on how to design a `controlTable` can be found [here](https://winvector.github.io/cdata/articles/design.html). And some discussion of the nature of records in `cdata` can be found [here](https://winvector.github.io/cdata/articles/blocksrecs.html).

----

A more detailed video tutorial is available [here](https://github.com/WinVector/cdata/blob/master/Examples/OrderedGrouping/OrderedGrouping.md).

----

We can also exhibit a larger example of using `cdata` to create a scatter-plot matrix, or pair plot:

```{r ex0_1}

iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))

library("ggplot2")
library("cdata")

# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"

# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars,
stringsAsFactors = FALSE))
# one copy of columns is coordinate names second copy is values
controlTable <- cbind(controlTable, controlTable)
# name the value columns value1 and value2
colnames(controlTable) <- qc(v1, v2, value1, value2)
transform <- rowrecs_to_blocks_spec(
controlTable,
recordKeys = c("iris_id", "Species"),
controlTableKeys = qc(v1, v2),
checkKeys = FALSE)

# do the unpivot to convert the row records to multiple block records
iris_aug <- iris %.>% transform
# alternate notation: layout_by(transform, iris)

ggplot(iris_aug, aes(x=value1, y=value2)) +
geom_point(aes_string(color=category_variable, shape=category_variable)) +
facet_grid(v2~v1, labeller = label_both, scale = "free") +
ggtitle("Iris dimensions") +
scale_color_brewer(palette = "Dark2") +
ylab(NULL) +
xlab(NULL)

# show transform
print(transform)
```

The above is now wrapped into a [one-line command in `WVPlots`](https://winvector.github.io/WVPlots/reference/PairPlot.html).

----

The `cdata` package develops the idea of the ["coordinatized data" theory](https://winvector.github.io/FluidData/RowsAndColumns.html) and includes an implementation of the ["fluid data" methodology](https://winvector.github.io/FluidData/FluidData.html).

The main `cdata` interfaces are given by the following set of methods:

* [`rowrecs_to_blocks_spec()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_spec.html), for specifying how single row records map to general multi-row (or block) records.
* [`blocks_to_rowrecs_spec()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_spec.html), for specifying how multi-row block records map to single-row records.
* [`layout_specification()`](https://winvector.github.io/cdata/reference/layout_specification.html), for specifying transforms from multi-row records to other multi-row records.
* [`layout_by()`](https://winvector.github.io/cdata/reference/layout_by.html) or the [wrapr dot arrow pipe](https://winvector.github.io/wrapr/reference/dot_arrow.html) for applying a layout to re-arrange data.
* `t()` (transpose/adjoint) to invert or reverse layout specifications.

Some convenience functions include:

* [`pivot_to_rowrecs()`](https://winvector.github.io/cdata/reference/pivot_to_rowrecs.html), for moving data from multi-row block records with one value per row (a single column of values) to single-row records `spread` or `dcast`.
* [`pivot_to_blocks()`/`unpivot_to_blocks()`](https://winvector.github.io/cdata/reference/unpivot_to_blocks.html), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) `gather` or `melt`.
* [`wrapr::qchar_frame()`](https://winvector.github.io/wrapr/reference/qchar_frame.html) a helper function for specifying record control table layout specifications.
* [`wrapr::build_frame()`](https://winvector.github.io/wrapr/reference/build_frame.html) a helper function for specifying data frames.

The package vignettes can be found in the "Articles" tab of [the `cdata` documentation site](https://winvector.github.io/cdata/).

The (older) recommended tutorial is: [Fluid data reshaping with cdata](https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html). We also have a (older) [short free cdata screencast](https://youtu.be/4cYbP3kbc0k) (and another example can be found [here](https://winvector.github.io/FluidData/DataWranglingAtScale.html)). These concepts were later adapted from `cdata` by the `tidyr` package.

----

Install via CRAN:

```{r, eval=FALSE}
install.packages("cdata")
```

----

Note: `cdata` is targeted at data with "tame column names" (column names that are valid both in databases, and as `R` unquoted variable names) and basic types (column values that are simple `R` types such as `character`, `numeric`, `logical`, and so on).