https://github.com/mdlincoln/salty

Turn Clean Data Into Messy Data
https://github.com/mdlincoln/salty

Last synced: about 1 month ago
JSON representation

Turn Clean Data Into Messy Data

Host: GitHub
URL: https://github.com/mdlincoln/salty
Owner: mdlincoln
License: other
Created: 2018-07-01T03:04:34.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-08-31T03:51:19.000Z (over 1 year ago)
Last Synced: 2025-10-22T04:49:27.876Z (about 1 month ago)
Language: R
Homepage: https://mdlincoln.shinyapps.io/seasoning
Size: 131 KB
Stars: 64
Watchers: 1
Forks: 4
Open Issues: 6
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - mdlincoln/salty - Turn Clean Data Into Messy Data (R)

README

          ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

```

# salty

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/salty)](https://cran.r-project.org/package=salty)

[![Downloads, grand total](http://cranlogs.r-pkg.org/badges/grand-total/salty)](https://cranlogs.r-pkg.org/)

[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

[![R-CMD-check](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml)

[![Codecov test coverage](https://codecov.io/gh/mdlincoln/salty/branch/master/graph/badge.svg)](https://app.codecov.io/gh/mdlincoln/salty?branch=master)

When teaching students how to clean data, it helps to have data that isn't _too_ clean already.

**salty** offers functions for "salting" clean data with problems often found in datasets in the wild, such as:

- pseudo-OCR errors

- inconsistent capitalization and spelling

- invalid dates

- unpredictable punctuation in numeric fields

- missing values or empty strings

## Installation

Install salty from CRAN with:

```{r cran-installation, eval = FALSE}

install.packages("salty")

```

You may install the development version of salty from github with:

```{r gh-installation, eval = FALSE}

# install.packages("devtools")

devtools::install_github("mdlincoln/salty")

```

## Basic usage

```{r example}

library(salty)

set.seed(10)

# We'll use charlatan to create some sample data

sample_names <- charlatan::ch_name(10)

sample_names

sample_numbers <- charlatan::ch_double(10)

sample_numbers

```

salty offers several easy-to-use functions for adding common problems to your data.

```{r}

# Add in erroneous letters or punctuation

salt_letters(sample_names)

salt_punctuation(sample_names)

# Flip capitals

salt_capitalization(sample_names)

# Introduce OCR errors. You can specify the proportion of values in the vector

# that should be salted, and the proportion of possible matches within a single

# value that should be changed.

salt_ocr(sample_names, p = 1, rep_p = 1)

```

`salt_delete` will simply drop characters from randomly selected values in a vector, while `salt_empty` and `salt_na` will replace entire values.

```{r}

salt_delete(sample_names, p = 0.5, n = 6)

salt_empty(sample_names, p = 0.5)

salt_na(sample_names, p = 0.5)

```

## Advanced usage

For more fine-grained control over the salting process, and for access to a wider range of salting types, you can use the underlying functions provided for: inserting, substituting, replacing.

The set of insertions and replacements are specified via `shakers`, pre-filled character sets and pattern/replacement pairs that the `salt` verbs then call.

```{r}

available_shakers()

```

`salt_insert` keeps all the characters in the original while adding new ones, while `salt_substitute` overwrites those characters.

```{r}

# Use p to specify the percent of values that you would like to salt

salt_insert(sample_names, shaker$punctuation, p = 0.5)

# Use n to specify how many new insertions/substitutions you want to make to selected values

salt_substitute(sample_names, shaker$punctuation, p = 0.5, n = 3)

```

Different flavors of salt are available using `shaker`, however you can also supply your own character vector of possible replacements if you like.

```{r}

salt_insert(sample_names, shaker$mixed_letters, p = 0.5)

salt_insert(sample_numbers, shaker$digits, p = 0.5)

salt_insert(sample_names, c("foo", "bar", "baz"), p = 0.5)

```

`salt_replace` is a bit more targeted: it works with pairs of patterns and replacements, either contained in `replacement_shaker` or user-specified. Use `rep_p` to set a probability of how many possible replacements should actually get swapped out for any given value.

```{r}

salt_replace(sample_names, replacement_shaker$ocr_errors, p = 1, rep_p = 1)

salt_replace(sample_names, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)

salt_replace(sample_numbers, replacement_shaker$decimal_commas, p = 0.5, rep_p = 1)

```

You may also specify your own arbitrary character vector of possible insertions.

```{r}

salt_insert(sample_names, insertions = c("X", "Z"))

```

## Possible future work

- Modifying date strings to introduce subtle errors like invalid dates (e.g. February 30th)

- Simulating character encoding problems

## Related work

**salty** should _not_ be used for anonymizing data; that's not its purpose.

However, it does draw some inspiration from [anonymizer](https://github.com/paulhendricks/anonymizer).

To _create_ sample data for salting, take a look at [charlatan](https://github.com/ropensci/charlatan).

## Acknowledgements

The common OCR replacement errors are partially derived from the `sed` replacements specified in the [Royal Society Corpus project](http://fedora.clarin-d.uni-saarland.de/rsc/access.html): Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In _Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language_. Göteborg, Sweden. Linköping University Electronic Press. .

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mdlincoln/salty

Awesome Lists containing this project

README