https://github.com/mdlincoln/salty
Turn Clean Data Into Messy Data
https://github.com/mdlincoln/salty
Last synced: 3 days ago
JSON representation
Turn Clean Data Into Messy Data
- Host: GitHub
- URL: https://github.com/mdlincoln/salty
- Owner: mdlincoln
- License: other
- Created: 2018-07-01T03:04:34.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-08-31T03:51:19.000Z (about 1 year ago)
- Last Synced: 2025-10-22T04:49:27.876Z (3 days ago)
- Language: R
- Homepage: https://mdlincoln.shinyapps.io/seasoning
- Size: 131 KB
- Stars: 64
- Watchers: 1
- Forks: 4
- Open Issues: 6
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - mdlincoln/salty - Turn Clean Data Into Messy Data (R)
README
---
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# salty
[](https://cran.r-project.org/package=salty)
[](https://cranlogs.r-pkg.org/)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/mdlincoln/salty?branch=master)
When teaching students how to clean data, it helps to have data that isn't _too_ clean already.
**salty** offers functions for "salting" clean data with problems often found in datasets in the wild, such as:
- pseudo-OCR errors
- inconsistent capitalization and spelling
- invalid dates
- unpredictable punctuation in numeric fields
- missing values or empty strings
## Installation
Install salty from CRAN with:
```{r cran-installation, eval = FALSE}
install.packages("salty")
```
You may install the development version of salty from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mdlincoln/salty")
```
## Basic usage
```{r example}
library(salty)
set.seed(10)
# We'll use charlatan to create some sample data
sample_names <- charlatan::ch_name(10)
sample_names
sample_numbers <- charlatan::ch_double(10)
sample_numbers
```
salty offers several easy-to-use functions for adding common problems to your data.
```{r}
# Add in erroneous letters or punctuation
salt_letters(sample_names)
salt_punctuation(sample_names)
# Flip capitals
salt_capitalization(sample_names)
# Introduce OCR errors. You can specify the proportion of values in the vector
# that should be salted, and the proportion of possible matches within a single
# value that should be changed.
salt_ocr(sample_names, p = 1, rep_p = 1)
```
`salt_delete` will simply drop characters from randomly selected values in a vector, while `salt_empty` and `salt_na` will replace entire values.
```{r}
salt_delete(sample_names, p = 0.5, n = 6)
salt_empty(sample_names, p = 0.5)
salt_na(sample_names, p = 0.5)
```
## Advanced usage
For more fine-grained control over the salting process, and for access to a wider range of salting types, you can use the underlying functions provided for: inserting, substituting, replacing.
The set of insertions and replacements are specified via `shakers`, pre-filled character sets and pattern/replacement pairs that the `salt` verbs then call.
```{r}
available_shakers()
```
`salt_insert` keeps all the characters in the original while adding new ones, while `salt_substitute` overwrites those characters.
```{r}
# Use p to specify the percent of values that you would like to salt
salt_insert(sample_names, shaker$punctuation, p = 0.5)
# Use n to specify how many new insertions/substitutions you want to make to selected values
salt_substitute(sample_names, shaker$punctuation, p = 0.5, n = 3)
```
Different flavors of salt are available using `shaker`, however you can also supply your own character vector of possible replacements if you like.
```{r}
salt_insert(sample_names, shaker$mixed_letters, p = 0.5)
salt_insert(sample_numbers, shaker$digits, p = 0.5)
salt_insert(sample_names, c("foo", "bar", "baz"), p = 0.5)
```
`salt_replace` is a bit more targeted: it works with pairs of patterns and replacements, either contained in `replacement_shaker` or user-specified. Use `rep_p` to set a probability of how many possible replacements should actually get swapped out for any given value.
```{r}
salt_replace(sample_names, replacement_shaker$ocr_errors, p = 1, rep_p = 1)
salt_replace(sample_names, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)
salt_replace(sample_numbers, replacement_shaker$decimal_commas, p = 0.5, rep_p = 1)
```
You may also specify your own arbitrary character vector of possible insertions.
```{r}
salt_insert(sample_names, insertions = c("X", "Z"))
```
## Possible future work
- Modifying date strings to introduce subtle errors like invalid dates (e.g. February 30th)
- Simulating character encoding problems
## Related work
**salty** should _not_ be used for anonymizing data; that's not its purpose.
However, it does draw some inspiration from [anonymizer](https://github.com/paulhendricks/anonymizer).
To _create_ sample data for salting, take a look at [charlatan](https://github.com/ropensci/charlatan).
## Acknowledgements
The common OCR replacement errors are partially derived from the `sed` replacements specified in the [Royal Society Corpus project](http://fedora.clarin-d.uni-saarland.de/rsc/access.html): Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In _Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language_. Göteborg, Sweden. Linköping University Electronic Press. .