Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/data-cleaning/validatedb

Validate on a table in a DB, using dbplyr
https://github.com/data-cleaning/validatedb

database datacleaning validation

Last synced: 3 months ago
JSON representation

Validate on a table in a DB, using dbplyr

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# validatedb

[![CRAN status](https://www.r-pkg.org/badges/version/validatedb)](https://CRAN.R-project.org/package=validatedb)
[![R build status](https://github.com/data-cleaning/validatedb/workflows/R-CMD-check/badge.svg)](https://github.com/data-cleaning/validatedb/actions)
[![Codecov test
coverage](https://codecov.io/gh/data-cleaning/validatedb/branch/master/graph/badge.svg)](https://codecov.io/gh/data-cleaning/validatedb?branch=master)
[![Mentioned in Awesome Official Statistics ](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)

`validatedb` executes validation checks written with R package `validate` on a
database. This allows for checking the validity of records in a database.

## Installation

You can install a development version with

``` r
remotes::install_github("data-cleaning/validatedb")
```

## Example

```{r example}
library(validatedb)
```

First we setup a table in a database (for demo purpose)
```{r}
# create a table in a database
income <- data.frame(id=1:2, age=c(12,35), salary = c(1000,NA))
con <- DBI::dbConnect(RSQLite::SQLite())
DBI::dbWriteTable(con, "income", income)
```

We retrieve a reference/handle to the table in the DB with `dplyr`

```{r}
tbl_income <- tbl(con, "income")
print(tbl_income)
```

Let's define a rule set and confront the table with it:
```{r}
rules <- validator( is_adult = age >= 18
, has_income = salary > 0
, mean_age = mean(age,na.rm=TRUE) > 24
, has_values = is_complete(age, salary)
)

# and confront!
cf <- confront(tbl_income, rules, key = "id")

print(cf)

summary(cf)
```

Values (i.e. validations on the table) can be retrieved like in `validate` with
`type="matrix"` or `type="list"`

```{r}
values(cf, type = "matrix")
```

But often this seems more handy:

```{r}
values(cf, type = "tbl")
```

or

```{r}
values(cf, type = "tbl", sparse=TRUE)
```

We can see the sql code by using `show_query`:

```{r}
show_query(cf)
```

Or write the sql to a file for documentation (and inspiration)

```{r, eval = FALSE}
dump_sql(cf, "validation.sql")
```

```sql
```{r, echo = FALSE, results="asis"}
dump_sql(cf)
```
```

### Aggregate example

```{r aggregate, code = readLines("./example/aggregate.R")}

```

## validate specific functions

### Added:

- [x] `is_complete`, `all_complete`
- [x] `is_unique`, `all_unique`
- [x] `exists_any`, `exists_one`
- [x] `do_by`, `sum_by`, `mean_by`, `min_by`, `max_by`

### Todo
Some newly added `validate` utility functions are (still) missing from `validatedb`.

- [ ] `contains_exactly`
- [ ] `is_linear_sequence`
- [ ] `hierachy`