An open API service indexing awesome lists of open source software.

https://github.com/mdlincoln/salty

Turn Clean Data Into Messy Data
https://github.com/mdlincoln/salty

Last synced: about 1 month ago
JSON representation

Turn Clean Data Into Messy Data

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

# salty

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/salty)](https://cran.r-project.org/package=salty)
[![Downloads, grand total](http://cranlogs.r-pkg.org/badges/grand-total/salty)](https://cranlogs.r-pkg.org/)
[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/mdlincoln/salty/branch/master/graph/badge.svg)](https://app.codecov.io/gh/mdlincoln/salty?branch=master)

When teaching students how to clean data, it helps to have data that isn't _too_ clean already.
**salty** offers functions for "salting" clean data with problems often found in datasets in the wild, such as:

- pseudo-OCR errors
- inconsistent capitalization and spelling
- invalid dates
- unpredictable punctuation in numeric fields
- missing values or empty strings

## Installation

Install salty from CRAN with:

```{r cran-installation, eval = FALSE}
install.packages("salty")
```

You may install the development version of salty from github with:

```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mdlincoln/salty")
```

## Basic usage

```{r example}
library(salty)
set.seed(10)

# We'll use charlatan to create some sample data

sample_names <- charlatan::ch_name(10)
sample_names

sample_numbers <- charlatan::ch_double(10)
sample_numbers
```

salty offers several easy-to-use functions for adding common problems to your data.

```{r}
# Add in erroneous letters or punctuation
salt_letters(sample_names)
salt_punctuation(sample_names)

# Flip capitals
salt_capitalization(sample_names)

# Introduce OCR errors. You can specify the proportion of values in the vector
# that should be salted, and the proportion of possible matches within a single
# value that should be changed.
salt_ocr(sample_names, p = 1, rep_p = 1)
```

`salt_delete` will simply drop characters from randomly selected values in a vector, while `salt_empty` and `salt_na` will replace entire values.

```{r}
salt_delete(sample_names, p = 0.5, n = 6)

salt_empty(sample_names, p = 0.5)

salt_na(sample_names, p = 0.5)
```

## Advanced usage

For more fine-grained control over the salting process, and for access to a wider range of salting types, you can use the underlying functions provided for: inserting, substituting, replacing.

The set of insertions and replacements are specified via `shakers`, pre-filled character sets and pattern/replacement pairs that the `salt` verbs then call.

```{r}
available_shakers()
```

`salt_insert` keeps all the characters in the original while adding new ones, while `salt_substitute` overwrites those characters.

```{r}
# Use p to specify the percent of values that you would like to salt
salt_insert(sample_names, shaker$punctuation, p = 0.5)

# Use n to specify how many new insertions/substitutions you want to make to selected values
salt_substitute(sample_names, shaker$punctuation, p = 0.5, n = 3)
```

Different flavors of salt are available using `shaker`, however you can also supply your own character vector of possible replacements if you like.

```{r}
salt_insert(sample_names, shaker$mixed_letters, p = 0.5)

salt_insert(sample_numbers, shaker$digits, p = 0.5)

salt_insert(sample_names, c("foo", "bar", "baz"), p = 0.5)
```

`salt_replace` is a bit more targeted: it works with pairs of patterns and replacements, either contained in `replacement_shaker` or user-specified. Use `rep_p` to set a probability of how many possible replacements should actually get swapped out for any given value.

```{r}
salt_replace(sample_names, replacement_shaker$ocr_errors, p = 1, rep_p = 1)

salt_replace(sample_names, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)

salt_replace(sample_numbers, replacement_shaker$decimal_commas, p = 0.5, rep_p = 1)
```

You may also specify your own arbitrary character vector of possible insertions.

```{r}
salt_insert(sample_names, insertions = c("X", "Z"))
```

## Possible future work

- Modifying date strings to introduce subtle errors like invalid dates (e.g. February 30th)
- Simulating character encoding problems

## Related work

**salty** should _not_ be used for anonymizing data; that's not its purpose.
However, it does draw some inspiration from [anonymizer](https://github.com/paulhendricks/anonymizer).

To _create_ sample data for salting, take a look at [charlatan](https://github.com/ropensci/charlatan).

## Acknowledgements

The common OCR replacement errors are partially derived from the `sed` replacements specified in the [Royal Society Corpus project](http://fedora.clarin-d.uni-saarland.de/rsc/access.html): Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In _Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language_. Göteborg, Sweden. Linköping University Electronic Press. .