Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/data-cleaning/errorlocate
Find and replace erroneous fields in data using validation rules
https://github.com/data-cleaning/errorlocate
data-cleaning errors invalidation r
Last synced: 3 months ago
JSON representation
Find and replace erroneous fields in data using validation rules
- Host: GitHub
- URL: https://github.com/data-cleaning/errorlocate
- Owner: data-cleaning
- Created: 2015-07-10T15:06:17.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2024-06-21T07:57:12.000Z (5 months ago)
- Last Synced: 2024-06-22T01:18:50.710Z (5 months ago)
- Topics: data-cleaning, errors, invalidation, r
- Language: R
- Homepage: http://data-cleaning.github.io/errorlocate/
- Size: 6.98 MB
- Stars: 21
- Watchers: 4
- Forks: 3
- Open Issues: 14
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- jimsghstars - data-cleaning/errorlocate - Find and replace erroneous fields in data using validation rules (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```[![R build status](https://github.com/data-cleaning/errorlocate/workflows/R-CMD-check/badge.svg)](https://github.com/data-cleaning/errorlocate/actions)
[![CRAN](http://www.r-pkg.org/badges/version/errorlocate)](https://CRAN.R-project.org/package=errorlocate)
[![Downloads](http://cranlogs.r-pkg.org/badges/errorlocate)](http://www.r-pkg.org/pkg/errorlocate)
[![status](https://tinyverse.netlify.com/badge/errorlocate)](https://CRAN.R-project.org/package=errorlocate)
[![Codecov test coverage](https://codecov.io/gh/data-cleaning/errorlocate/branch/master/graph/badge.svg)](https://codecov.io/gh/data-cleaning/errorlocate?branch=master)
[![Mentioned in Awesome Official Statistics ](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)# Error localization
Find errors in data given a set of validation rules.
The `errorlocate` helps to identify obvious errors in raw datasets.It works in tandem with the package `validate`.
With `validate` you formulate data validation rules to which the data must comply.For example:
- "age cannot be negative": `age >= 0`.
- "if a person is married, he must be older then 16 years": `if (married ==TRUE) age > 16`.
- "Profit is turnover minus cost": `profit == turnover - cost`.While `validate` can check if a record is valid or not, it does not identify
which of the variables are responsible for the invalidation. This may seem a simple task,
but is actually quite tricky: a set of validation rules forms a web
of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate
the record for rule 2.`errorlocate` provides a small framework for record based error detection and implements the Felligi Holt
algorithm. This algorithm assumes there is no other information available then the values of a record
and a set of validation rules. The algorithm minimizes the (weighted) number of values that need
to be adjusted to remove the invalidation.# Installation
`errorlocate` can be installed from CRAN:
```r
install.packages("errorlocate")
```Beta versions can be installed with `drat`:
```r
drat::addRepo("data-cleaning")
install.packages("errorlocate")
```The latest development version of `errorlocate` can be installed from github with `devtools`:
```r
devtools::install_github("data-cleaning/errorlocate")
```# Usage
```{r}
library(errorlocate)
rules <- validator( profit == turnover - cost
, cost >= 0.6 * turnover
, turnover >= 0
, cost >= 0 # is implied
)data <- data.frame(profit=750, cost=125, turnover=200)
data_no_error <- replace_errors(data, rules)
# faulty data was replaced with NA
print(data_no_error)er <- errors_removed(data_no_error)
print(er)
summary(er)
er$errors
```