Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/echasnovski/ruler
Tidy Data Validation Reports
https://github.com/echasnovski/ruler
Last synced: 27 days ago
JSON representation
Tidy Data Validation Reports
- Host: GitHub
- URL: https://github.com/echasnovski/ruler
- Owner: echasnovski
- License: other
- Created: 2017-06-18T14:16:14.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-03-30T07:20:29.000Z (over 1 year ago)
- Last Synced: 2024-08-13T07:15:08.751Z (4 months ago)
- Language: R
- Homepage: https://echasnovski.github.io/ruler/
- Size: 499 KB
- Stars: 31
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - echasnovski/ruler - Tidy Data Validation Reports (R)
README
---
output: github_document
---```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
library(ruler, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)options(tibble.print_min = 6, tibble.print_max = 6)
```# ruler: Rule Your Data
[![Travis-CI Build Status](https://travis-ci.org/echasnovski/ruler.svg?branch=master)](https://travis-ci.org/echasnovski/ruler)
[![R-CMD-check](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/echasnovski/ruler/actions/workflows/R-CMD-check.yaml)
[![Coverage Status](https://codecov.io/gh/echasnovski/ruler/graph/badge.svg)](https://app.codecov.io/github/echasnovski/ruler?branch=master)
[![CRAN](https://www.r-pkg.org/badges/version/ruler?color=blue)](https://cran.r-project.org/package=ruler)
[![Dependencies](https://tinyverse.netlify.com/badge/ruler)](https://CRAN.R-project.org/package=ruler)
[![Downloads](http://cranlogs.r-pkg.org/badges/ruler)](https://cran.r-project.org/package=ruler)`ruler` offers a set of tools for creating tidy data validation reports using
[dplyr](https://dplyr.tidyverse.org) grammar of data manipulation. It is structured to be flexible and extendable in terms of creating rules and using their output.To fully use this package a solid knowledge of `dplyr` is required. The key idea behind `ruler`'s design is to validate data by modifying regular `dplyr` code with as little overhead as possible.
Some functionality is powered by the [keyholder](https://echasnovski.github.io/keyholder/) package. It is highly recommended to use its supported functions during rule construction. All one- and two-table `dplyr` verbs applied to local data frames are supported and considered the most appropriate way to create rules.
This README is structured as follows:
- __Installation__ shows ways to install package.
- __Example__ shows the basic usage of `ruler` for exploration of obeying user-defined rules and its automatic validation.
- __Overview__ explains basic data and function types with design behind them.
- __Usage__ describes `ruler`'s capabilities in more detail.
- __Other packages for validation and assertions__ lists alternatives for described tasks.## Installation
You can install current stable version from CRAN with:
```{r cran-installation, eval = FALSE}
install.packages("ruler")
```Also you can install development version from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("echasnovski/ruler")
```## Example
```{r Example, error = TRUE, purl = FALSE}
# Utilities functions
is_integerish <- function(x) {
all(x == as.integer(x))
}
z_score <- function(x) {
abs(x - mean(x)) / sd(x)
}# Define rule packs
my_packs <- list(
data_packs(
dims = . %>% summarise(nrow_low = nrow(.) >= 10, nrow_high = nrow(.) <= 15,
ncol_low = ncol(.) >= 20, ncol_high = ncol(.) <= 30)
),
group_packs(
vs_am_num = . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7),
.group_vars = c("vs", "am")
),
col_packs(
enough_col_sum = . %>%
summarise_if(is_integerish, rules(is_enough = sum(.) >= 14))
),
row_packs(
enough_row_sum = . %>%
filter(vs == 1) %>%
transmute(is_enough = rowSums(.) >= 200)
),
cell_packs(
dbl_not_outlier = . %>%
transmute_if(is.numeric, rules(is_not_out = z_score(.) < 1)) %>%
slice(-(1:5))
)
)# Expose data to rules
mtcars_exposed <- mtcars %>% as_tibble() %>%
expose(my_packs)# View exposure
mtcars_exposed %>% get_exposure()# Assert any breaker
invisible(mtcars_exposed %>% assert_any_breaker())
```## Overview
__Rule__ is a function which converts data unit of interest (data, group,
column, row, cell) to logical value indicating whether this object satisfies
certain condition.__Rule pack__ is a function which combines several rules into one functional
block. The recommended way of creating rules is by creating packs right away with the use of `dplyr` and [magrittr](https://magrittr.tidyverse.org/)'s
pipe operator.__Exposing__ data to rules means applying rules to data, collecting results in common format and attaching them to the data as an `exposure` attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.
__Exposure__ is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:
1. __Packs info__: a [tibble](https://tibble.tidyverse.org/) with the following structure:
- _name_ \ : Name of the pack. If not set manually it will be imputed during exposure.
- _type_ \ : Name of pack type. Indicates which data unit pack checks.
- _fun_ \ : List of rule pack functions.
- _remove_obeyers_ \ : Whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.
2. __Tidy data validation report__: a `tibble` with the following structure:
- _pack_ \ : Name of rule pack from column 'name' in packs info.
- _rule_ \ : Name of the rule defined in rule pack.
- _var_ \ : Name of the variable which validation result is reported. Value '.all' is reserved and interpreted as 'all columns as a whole'. __Note__ that _var_ doesn't always represent the actual column in data frame: for group packs it represents the created group name.
- _id_ \ : Index of the row in tested data frame which validation result is reported. Value 0 is reserved and interpreted as 'all rows as a whole'.
- _value_ \ : Whether the described data unit obeys the rule.
There are four basic combinations of `var` and `id` values which define five basic data units:- `var == '.all'` and `id == 0`: Data as a whole.
- `var != '.all'` and `id == 0`: Group (`var` shouldn't be an actual column name) or column (`var` should be an actual column name) as a whole.
- `var == '.all'` and `id != 0`: Row as a whole.
- `var != '.all'` and `id != 0`: Described cell.With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.
## Usage
### Creating packs
#### Data packs
```{r Data pack}
# List of two rule packs for checking data properties
my_data_packs <- data_packs(
# data_dims is a pack name
data_dims = . %>% summarise(
# ncol and nrow are rule names
ncol = ncol(.) == 12,
nrow = nrow(.) == 32
),# Data after subsetting should have number of rows in between 10 and 30
# Rules are applied separately
vs_1 = . %>% filter(vs == 1) %>%
summarise(
nrow_low = nrow(.) > 10,
nrow_high = nrow(.) < 30
)
)
```#### Group packs
```{r Group pack}
# List of one nameless rule pack for checking group property
my_group_packs <- group_packs(
# Name will be imputed during exposure
. %>% group_by(vs, am) %>%
summarise(any_cyl_6 = any(cyl == 6)),# One should supply grouping variables for correct interpretation of output
.group_vars = c("vs", "am")
)
```#### Column packs
```{r Column pack}
# rules() defines function predicators with necessary name imputations# List of two rule pack for checking certain columns' properties
my_col_packs <- col_packs(
sum_bounds = . %>% summarise_at(
# Check only columns with names starting with 'c'
vars(starts_with("c")),
rules(sum_low = sum(.) > 300, sum_high = sum(.) < 400)
),# In the edge case of checking one column with one rule there is a need
# for forcing inclusion of names in the output of summarise_at().
# This is done with naming argument in vars()
vs_mean = . %>% summarise_at(vars(vs = vs), rules(mean(.) > 0.5))
)
```#### Row packs
```{r Row packs}
z_score <- function(x) {
(x - mean(x)) / sd(x)
}# List of one rule pack checking certain rows' property
my_row_packs <- row_packs(
row_mean = . %>% mutate(rowMean = rowMeans(.)) %>%
transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
# Check only rows 10-15
# Values in 'id' column of report will be based on input data (i.e. 10-15)
# and not on output data (1-6)
slice(10:15)
)
```#### Cell packs
```{r Cell packs}
is_integerish <- function(x) {
all(x == as.integer(x))
}# List of two cell pack checking certain cells' property
my_cell_packs <- cell_packs(
my_cell_pack_1 = . %>% transmute_if(
# Check only integer-like columns
is_integerish,
rules(is_common = abs(z_score(.)) < 1)
) %>%
# Check only rows 20-30
slice(20:30),# The same edge case as in column rule pack
vs_side = . %>% transmute_at(vars(vs = "vs"), rules(. > mean(.)))
)
```### Exposing
By default exposing removes obeyers.
```{r Expose removes obeyers by default}
mtcars %>%
expose(my_data_packs, my_group_packs) %>%
get_exposure()
```One can leave obeyers by setting `.remove_obeyers` to `FALSE`.
```{r Expose can not remove obeyers}
mtcars %>%
expose(my_data_packs, my_group_packs, .remove_obeyers = FALSE) %>%
get_exposure()
```By default `expose()` guesses the pack type if 'not-pack' function is supplied. This behaviour has some edge cases but is useful for interactive use.
```{r Expose can guess}
mtcars %>%
expose(
some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.)))
) %>%
get_exposure()
```To write strict and robust code one can set `.guess` to `FALSE`.
```{r Expose can not guess, error = TRUE, purl = FALSE}
mtcars %>%
expose(
some_data_pack = . %>% summarise(nrow = nrow(.) == 10),
some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))),
.guess = FALSE
) %>%
get_exposure()
```### Acting after exposure
General actions are recommended to be done with `act_after_exposure()`. It takes two arguments:
- `.trigger` - a function which takes the data with attached exposure and returns `TRUE` if some action should be made.
- `.actor` - a function which takes the same argument as `.trigger` and performs some action.If trigger didn't notify then the input data is returned untouched. Otherwise the output of `.actor()` is returned. __Note__ that `act_after_exposure()` is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe).
```{r Acting after exposure}
trigger_one_pack <- function(.tbl) {
packs_number <- .tbl %>%
get_packs_info() %>%
nrow()packs_number > 1
}actor_one_pack <- function(.tbl) {
cat("More than one pack was applied.\n")invisible(.tbl)
}mtcars %>%
expose(my_col_packs, my_row_packs) %>%
act_after_exposure(
.trigger = trigger_one_pack,
.actor = actor_one_pack
) %>%
invisible()
````ruler` has function `assert_any_breaker()` which can notify about presence of any breaker in exposure.
```{r Assert any breaker, error = TRUE, purl = FALSE}
mtcars %>%
expose(my_col_packs, my_row_packs) %>%
assert_any_breaker()
```## Other packages for validation and assertions
More leaned towards assertions:
- [assertr](https://github.com/ropensci/assertr)
- [assertthat](https://github.com/hadley/assertthat)
- [checkmate](https://github.com/mllg/checkmate)
- [ensurer](https://github.com/smbache/ensurer)
- [tester](https://github.com/gastonstat/tester)
- [sealr](https://github.com/uribo/sealr)More leaned towards validation:
- [naniar](https://github.com/njtierney/naniar)
- [skimr](https://github.com/ropensci/skimr)
- [validate](https://github.com/data-cleaning/validate)