https://github.com/echasnovski/ruler

Tidy Data Validation Reports
https://github.com/echasnovski/ruler
Last synced: 2 months ago
JSON representation
Tidy Data Validation Reports
Host: GitHub
URL: https://github.com/echasnovski/ruler
Owner: echasnovski
License: other
Created: 2017-06-18T14:16:14.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2023-03-30T07:20:29.000Z (about 2 years ago)
Last Synced: 2025-03-27T20:40:40.510Z (3 months ago)
Language: R
Homepage: https://echasnovski.github.io/ruler/
Size: 499 KB
Stars: 31
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

jimsghstars - echasnovski/ruler - Tidy Data Validation Reports (R)
README

        ---

output: github_document

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

library(ruler, quietly = TRUE, warn.conflicts = FALSE)

library(dplyr, quietly = TRUE, warn.conflicts = FALSE)

options(tibble.print_min = 6, tibble.print_max = 6)

```

# ruler: Rule Your Data

[![Travis-CI Build Status](https://travis-ci.org/echasnovski/ruler.svg?branch=master)](https://travis-ci.org/echasnovski/ruler)

[![R-CMD-check](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml)

[![Coverage Status](https://codecov.io/gh/echasnovski/ruler/graph/badge.svg)](https://app.codecov.io/github/echasnovski/ruler?branch=master)

[![CRAN](https://www.r-pkg.org/badges/version/ruler?color=blue)](https://cran.r-project.org/package=ruler)

[![Dependencies](https://tinyverse.netlify.com/badge/ruler)](https://CRAN.R-project.org/package=ruler)

[![Downloads](http://cranlogs.r-pkg.org/badges/ruler)](https://cran.r-project.org/package=ruler)

`ruler` offers a set of tools for creating tidy data validation reports using 

[dplyr](https://dplyr.tidyverse.org) grammar of data manipulation. It is structured to be flexible and extendable in terms of creating rules and using their output.

To fully use this package a solid knowledge of `dplyr` is required. The key idea behind `ruler`'s design is to validate data by modifying regular `dplyr` code with as little overhead as possible.

Some functionality is powered by the [keyholder](https://echasnovski.github.io/keyholder/) package. It is highly recommended to use its supported functions during rule construction. All one- and two-table `dplyr` verbs applied to local data frames are supported and considered the most appropriate way to create rules.

This README is structured as follows:

- __Installation__ shows ways to install package.

- __Example__ shows the basic usage of `ruler` for exploration of obeying user-defined rules and its automatic validation.

- __Overview__ explains basic data and function types with design behind them.

- __Usage__ describes `ruler`'s capabilities in more detail.

- __Other packages for validation and assertions__ lists alternatives for described tasks.

## Installation

You can install current stable version from CRAN with:

```{r cran-installation, eval = FALSE}

install.packages("ruler")

```

Also you can install development version from github with:

```{r gh-installation, eval = FALSE}

# install.packages("devtools")

devtools::install_github("echasnovski/ruler")

```

## Example

```{r Example, error = TRUE, purl = FALSE}

# Utilities functions

is_integerish <- function(x) {

  all(x == as.integer(x))

}

z_score <- function(x) {

  abs(x - mean(x)) / sd(x)

}

# Define rule packs

my_packs <- list(

  data_packs(

    dims = . %>% summarise(nrow_low = nrow(.) >= 10, nrow_high = nrow(.) <= 15,

      ncol_low = ncol(.) >= 20, ncol_high = ncol(.) <= 30)

  ),

  group_packs(

    vs_am_num = . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7),

    .group_vars = c("vs", "am")

  ),

  col_packs(

    enough_col_sum = . %>%

      summarise_if(is_integerish, rules(is_enough = sum(.) >= 14))

  ),

  row_packs(

    enough_row_sum = . %>%

      filter(vs == 1) %>%

      transmute(is_enough = rowSums(.) >= 200)

  ),

  cell_packs(

    dbl_not_outlier = . %>%

      transmute_if(is.numeric, rules(is_not_out = z_score(.) < 1)) %>%

      slice(-(1:5))

  )

)

# Expose data to rules

mtcars_exposed <- mtcars %>% as_tibble() %>%

  expose(my_packs)

# View exposure

mtcars_exposed %>% get_exposure()

# Assert any breaker

invisible(mtcars_exposed %>% assert_any_breaker())

```

## Overview

__Rule__ is a function which converts data unit of interest (data, group,

column, row, cell) to logical value indicating whether this object satisfies

certain condition.

__Rule pack__ is a function which combines several rules into one functional

block. The recommended way of creating rules is by creating packs right away with the use of `dplyr` and [magrittr](https://magrittr.tidyverse.org/)'s

pipe operator.

__Exposing__ data to rules means applying rules to data, collecting results in common format and attaching them to the data as an `exposure` attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.

__Exposure__ is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:

1. __Packs info__: a [tibble](https://tibble.tidyverse.org/) with the following structure:

    - _name_ \ : Name of the pack. If not set manually it will be imputed during exposure.

    - _type_ \ : Name of pack type. Indicates which data unit pack checks.

    - _fun_ \ : List of rule pack functions.

    - _remove_obeyers_ \ : Whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.

2. __Tidy data validation report__: a `tibble` with the following structure:

    - _pack_ \ : Name of rule pack from column 'name' in packs info.

    - _rule_ \ : Name of the rule defined in rule pack.

    - _var_ \ : Name of the variable which validation result is reported. Value '.all' is reserved and interpreted as 'all columns as a whole'. __Note__ that _var_ doesn't always represent the actual column in data frame: for group packs it represents the created group name.

    - _id_ \ : Index of the row in tested data frame which validation result is reported. Value 0 is reserved and interpreted as 'all rows as a whole'.

    - _value_ \ : Whether the described data unit obeys the rule.

    

There are four basic combinations of `var` and `id` values which define five basic data units:

- `var == '.all'` and `id == 0`: Data as a whole.

- `var != '.all'` and `id == 0`: Group (`var` shouldn't be an actual column name) or column (`var` should be an actual column name) as a whole.

- `var == '.all'` and `id != 0`: Row as a whole.

- `var != '.all'` and `id != 0`: Described cell.

With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.

## Usage

### Creating packs

#### Data packs

```{r Data pack}

# List of two rule packs for checking data properties

my_data_packs <- data_packs(

  # data_dims is a pack name

  data_dims = . %>% summarise(

    # ncol and nrow are rule names

    ncol = ncol(.) == 12,

    nrow = nrow(.) == 32

  ),

  # Data after subsetting should have number of rows in between 10 and 30

  # Rules are applied separately

  vs_1 = . %>% filter(vs == 1) %>%

    summarise(

      nrow_low = nrow(.) > 10,

      nrow_high = nrow(.) < 30

    )

)

```

#### Group packs

```{r Group pack}

# List of one nameless rule pack for checking group property

my_group_packs <- group_packs(

  # Name will be imputed during exposure

  . %>% group_by(vs, am) %>%

    summarise(any_cyl_6 = any(cyl == 6)),

  # One should supply grouping variables for correct interpretation of output

  .group_vars = c("vs", "am")

)

```

#### Column packs

```{r Column pack}

# rules() defines function predicators with necessary name imputations

# List of two rule pack for checking certain columns' properties

my_col_packs <- col_packs(

  sum_bounds = . %>% summarise_at(

    # Check only columns with names starting with 'c'

    vars(starts_with("c")),

    rules(sum_low = sum(.) > 300, sum_high = sum(.) < 400)

  ),

  # In the edge case of checking one column with one rule there is a need

  # for forcing inclusion of names in the output of summarise_at().

  # This is done with naming argument in vars()

  vs_mean = . %>% summarise_at(vars(vs = vs), rules(mean(.) > 0.5))

)

```

#### Row packs

```{r Row packs}

z_score <- function(x) {

  (x - mean(x)) / sd(x)

}

# List of one rule pack checking certain rows' property

my_row_packs <- row_packs(

  row_mean = . %>% mutate(rowMean = rowMeans(.)) %>%

    transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%

    # Check only rows 10-15

    # Values in 'id' column of report will be based on input data (i.e. 10-15)

    # and not on output data (1-6)

    slice(10:15)

)

```

#### Cell packs

```{r Cell packs}

is_integerish <- function(x) {

  all(x == as.integer(x))

}

# List of two cell pack checking certain cells' property

my_cell_packs <- cell_packs(

  my_cell_pack_1 = . %>% transmute_if(

    # Check only integer-like columns

    is_integerish,

    rules(is_common = abs(z_score(.)) < 1)

  ) %>%

    # Check only rows 20-30

    slice(20:30),

  # The same edge case as in column rule pack

  vs_side = . %>% transmute_at(vars(vs = "vs"), rules(. > mean(.)))

)

```

### Exposing

By default exposing removes obeyers.

```{r Expose removes obeyers by default}

mtcars %>%

  expose(my_data_packs, my_group_packs) %>%

  get_exposure()

```

One can leave obeyers by setting `.remove_obeyers` to `FALSE`.

```{r Expose can not remove obeyers}

mtcars %>%

  expose(my_data_packs, my_group_packs, .remove_obeyers = FALSE) %>%

  get_exposure()

```

By default `expose()` guesses the pack type if 'not-pack' function is supplied. This behaviour has some edge cases but is useful for interactive use.

```{r Expose can guess}

mtcars %>%

  expose(

    some_data_pack = . %>% summarise(nrow = nrow(.) == 10),

    some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.)))

  ) %>%

  get_exposure()

```

To write strict and robust code one can set `.guess` to `FALSE`.

```{r Expose can not guess, error = TRUE, purl = FALSE}

mtcars %>%

  expose(

    some_data_pack = . %>% summarise(nrow = nrow(.) == 10),

    some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))),

    .guess = FALSE

  ) %>%

  get_exposure()

```

### Acting after exposure

General actions are recommended to be done with `act_after_exposure()`. It takes two arguments:

- `.trigger` - a function which takes the data with attached exposure and returns `TRUE` if some action should be made.

- `.actor` - a function which takes the same argument as `.trigger` and performs some action.

If trigger didn't notify then the input data is returned untouched. Otherwise the output of `.actor()` is returned. __Note__ that `act_after_exposure()` is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe).

```{r Acting after exposure}

trigger_one_pack <- function(.tbl) {

  packs_number <- .tbl %>%

    get_packs_info() %>%

    nrow()

  packs_number > 1

}

actor_one_pack <- function(.tbl) {

  cat("More than one pack was applied.\n")

  invisible(.tbl)

}

mtcars %>%

  expose(my_col_packs, my_row_packs) %>%

  act_after_exposure(

    .trigger = trigger_one_pack,

    .actor = actor_one_pack

  ) %>%

  invisible()

```

`ruler` has function `assert_any_breaker()` which can notify about presence of any breaker in exposure.

```{r Assert any breaker, error = TRUE, purl = FALSE}

mtcars %>%

  expose(my_col_packs, my_row_packs) %>%

  assert_any_breaker()

```

## Other packages for validation and assertions

More leaned towards assertions:

- [assertr](https://github.com/ropensci/assertr)

- [assertthat](https://github.com/hadley/assertthat)

- [checkmate](https://github.com/mllg/checkmate)

- [ensurer](https://github.com/smbache/ensurer)

- [tester](https://github.com/gastonstat/tester)

- [sealr](https://github.com/uribo/sealr)

More leaned towards validation:

- [naniar](https://github.com/njtierney/naniar)

- [skimr](https://github.com/ropensci/skimr)

- [validate](https://github.com/data-cleaning/validate)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/echasnovski/ruler

Awesome Lists containing this project

README