https://github.com/WinVector/cdata

Higher order fluid or coordinatized data transforms in R. Distributed under choice of GPL-2 or GPL-3 license.
https://github.com/WinVector/cdata

Last synced: 12 months ago
JSON representation

Higher order fluid or coordinatized data transforms in R. Distributed under choice of GPL-2 or GPL-3 license.

Host: GitHub
URL: https://github.com/WinVector/cdata
Owner: WinVector
License: other
Created: 2017-03-29T06:04:37.000Z (over 9 years ago)
Default Branch: main
Last Pushed: 2024-09-26T15:47:00.000Z (almost 2 years ago)
Last Synced: 2025-06-14T04:19:36.187Z (about 1 year ago)
Language: R
Homepage: https://winvector.github.io/cdata/
Size: 8.69 MB
Stars: 45
Watchers: 7
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - WinVector/cdata - Higher order fluid or coordinatized data transforms in R. Distributed under choice of GPL-2 or GPL-3 license. (R)

README

          ---

output: github_document

---

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/cdata)](https://cran.r-project.org/package=cdata)

[![status](https://tinyverse.netlify.com/badge/cdata)](https://CRAN.R-project.org/package=cdata)

```{r, echo = FALSE, warning=FALSE, message=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = " # ",

  fig.path = "tools/README-"

)

```

[`cdata`](https://CRAN.R-project.org/package=cdata) is a general data re-shaper that has the great virtue of adhering to Raymond's "Rule of Representation", and using Codd's "Guaranteed Access Rule".

> Fold knowledge into data, so program logic can be stupid and robust.

>

> [*The Art of Unix Programming*, Erick S. Raymond, Addison-Wesley, 2003](http://www.catb.org/esr/writings/taoup/html/ch01s06.html#id2878263)

> Rule 2: The guaranteed access rule.

>

> Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

>

> [Edgar F. Codd](https://en.wikipedia.org/wiki/Codd%27s_12_rules)

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.  `cdata` also has a `Python` implementation that it can inter-operate with in the [`data_algebra` package](https://github.com/WinVector/data_algebra) (example [here](https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata.ipynb)).

![](https://raw.githubusercontent.com/WinVector/cdata/master/tools/cdata.png)

Briefly: `cdata` supplies data transform operators that:

 * Work on local data or with any `DBI` data source.

 * Are powerful generalizations of the operations commonly called `pivot` and `un-pivot`.

 * Allow for example-driven graphical specification of data transforms or data layout control.

 * Work in-memory or with `SQL` databases.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

```{r ex0}

iris <- data.frame(iris)

iris$iris_id <- seq_len(nrow(iris))

# show the data

head(iris)

library("ggplot2")

library("cdata")

#

# build a control table with a "key column" flower_part

# and "value columns" Length and Width

#

controlTable <- wrapr::qchar_frame(

  "flower_part", "Length"     , "Width"     |

    "Petal"    , Petal.Length , Petal.Width |

    "Sepal"    , Sepal.Length , Sepal.Width )

transform <- rowrecs_to_blocks_spec(

  controlTable,

  recordKeys = c("iris_id", "Species"))

# do the unpivot to convert the row records to block records

iris_aug <- iris %.>% transform

# show the tranformed data

head(iris_aug)

# plot the graph

ggplot(iris_aug, aes(x=Length, y=Width)) +

  geom_point(aes(color=Species, shape=Species)) + 

  facet_wrap(~flower_part, labeller = label_both, scale = "free") +

  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

# show the transform

print(transform)

# show the representation of the transform

unclass(transform)

```

More details on the above example can be found [here](https://win-vector.com/2018/10/21/faceted-graphs-with-cdata-and-ggplot2/). A tutorial on how to design a `controlTable` can be found [here](https://winvector.github.io/cdata/articles/design.html).  And some discussion of the nature of records in `cdata` can be found [here](https://winvector.github.io/cdata/articles/blocksrecs.html).

----

A more detailed video tutorial is available [here](https://github.com/WinVector/cdata/blob/master/Examples/OrderedGrouping/OrderedGrouping.md).

----

We can also exhibit a larger example of using `cdata` to create a scatter-plot matrix, or pair plot:

```{r ex0_1}

iris <- data.frame(iris)

iris$iris_id <- seq_len(nrow(iris))

library("ggplot2")

library("cdata")

# declare our columns of interest

meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

category_variable <- "Species"

# build a control with all pairs of variables as value columns

# and pair_key as the key column

controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 

                                       stringsAsFactors = FALSE))

# one copy of columns is coordinate names second copy is values

controlTable <- cbind(controlTable, controlTable)

# name the value columns value1 and value2

colnames(controlTable) <- qc(v1, v2, value1, value2)

transform <- rowrecs_to_blocks_spec(

  controlTable,

  recordKeys = c("iris_id", "Species"),

  controlTableKeys = qc(v1, v2),

  checkKeys = FALSE)

# do the unpivot to convert the row records to multiple block records

iris_aug <- iris %.>% transform

# alternate notation: layout_by(transform, iris)

ggplot(iris_aug, aes(x=value1, y=value2)) +

  geom_point(aes_string(color=category_variable, shape=category_variable)) + 

  facet_grid(v2~v1, labeller = label_both, scale = "free") +

  ggtitle("Iris dimensions") +

  scale_color_brewer(palette = "Dark2") +

  ylab(NULL) + 

  xlab(NULL)

# show transform

print(transform)

```

The above is now wrapped into a [one-line command in `WVPlots`](https://winvector.github.io/WVPlots/reference/PairPlot.html).

----

The `cdata` package develops the idea of the ["coordinatized data" theory](https://winvector.github.io/FluidData/RowsAndColumns.html) and includes an implementation of the ["fluid data" methodology](https://winvector.github.io/FluidData/FluidData.html).   

The main `cdata` interfaces are given by the following set of methods:

  * [`rowrecs_to_blocks_spec()`](https://winvector.github.io/cdata/reference/rowrecs_to_blocks_spec.html), for specifying how single row records map to general multi-row (or block) records.

  * [`blocks_to_rowrecs_spec()`](https://winvector.github.io/cdata/reference/blocks_to_rowrecs_spec.html), for specifying how multi-row block records map to single-row records.

  * [`layout_specification()`](https://winvector.github.io/cdata/reference/layout_specification.html), for specifying transforms from multi-row records to other multi-row records.

  * [`layout_by()`](https://winvector.github.io/cdata/reference/layout_by.html) or the [wrapr dot arrow pipe](https://winvector.github.io/wrapr/reference/dot_arrow.html) for applying a layout to re-arrange data.

  * `t()` (transpose/adjoint) to invert or reverse layout specifications.

Some convenience functions include:

  * [`pivot_to_rowrecs()`](https://winvector.github.io/cdata/reference/pivot_to_rowrecs.html), for moving data from multi-row block records with one value per row (a single column of values) to single-row records `spread` or `dcast`.

  * [`pivot_to_blocks()`/`unpivot_to_blocks()`](https://winvector.github.io/cdata/reference/unpivot_to_blocks.html), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) `gather` or `melt`.

  * [`wrapr::qchar_frame()`](https://winvector.github.io/wrapr/reference/qchar_frame.html) a helper function for specifying record control table layout specifications.

 * [`wrapr::build_frame()`](https://winvector.github.io/wrapr/reference/build_frame.html) a helper function for specifying data frames.

  

The package vignettes can be found in the "Articles" tab of [the `cdata` documentation site](https://winvector.github.io/cdata/).

The (older) recommended tutorial is: [Fluid data reshaping with cdata](https://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html). We also have a (older) [short free cdata screencast](https://youtu.be/4cYbP3kbc0k) (and another example can be found [here](https://winvector.github.io/FluidData/DataWranglingAtScale.html)).  These concepts were later adapted from `cdata` by the `tidyr` package.

----

Install via CRAN:

```{r, eval=FALSE}

install.packages("cdata")

```

----

Note: `cdata` is targeted at data with "tame column names" (column names that are valid both in databases, and as `R` unquoted variable names) and basic types (column values that are simple `R` types such as `character`, `numeric`, `logical`, and so on).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/WinVector/cdata

Awesome Lists containing this project

README