https://github.com/mdlincoln/salty
Turn Clean Data Into Messy Data
https://github.com/mdlincoln/salty
Last synced: about 1 month ago
JSON representation
Turn Clean Data Into Messy Data
- Host: GitHub
- URL: https://github.com/mdlincoln/salty
- Owner: mdlincoln
- License: other
- Created: 2018-07-01T03:04:34.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-10-21T17:39:15.000Z (over 1 year ago)
- Last Synced: 2024-07-31T19:28:27.540Z (9 months ago)
- Language: R
- Homepage: https://mdlincoln.shinyapps.io/seasoning
- Size: 94.7 KB
- Stars: 65
- Watchers: 1
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - mdlincoln/salty - Turn Clean Data Into Messy Data (R)
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```# salty
[](https://cran.r-project.org/package=salty)
[](https://cranlogs.r-pkg.org/)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/mdlincoln/salty/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/mdlincoln/salty?branch=master)When teaching students how to clean data, it helps to have data that isn't _too_ clean already.
**salty** offers functions for "salting" clean data with problems often found in datasets in the wild, such as:- pseudo-OCR errors
- inconsistent capitalization and spelling
- invalid dates
- unpredictable punctuation in numeric fields
- missing values or empty strings## Installation
Install salty from CRAN with:
```{r cran-installation, eval = FALSE}
install.packages("salty")
```You may install the development version of salty from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mdlincoln/salty")
```## Basic usage
```{r example}
library(salty)
set.seed(10)# We'll use charlatan to create some sample data
sample_names <- charlatan::ch_name(10)
sample_namessample_numbers <- charlatan::ch_double(10)
sample_numbers
```salty offers several easy-to-use functions for adding common problems to your data.
```{r}
# Add in erroneous letters or punctuation
salt_letters(sample_names)
salt_punctuation(sample_names)# Flip capitals
salt_capitalization(sample_names)# Introduce OCR errors. You can specify the proportion of values in the vector
# that should be salted, and the proportion of possible matches within a single
# value that should be changed.
salt_ocr(sample_names, p = 1, rep_p = 1)
````salt_delete` will simply drop characters from randomly selected values in a vector, while `salt_empty` and `salt_na` will replace entire values.
```{r}
salt_delete(sample_names, p = 0.5, n = 6)salt_empty(sample_names, p = 0.5)
salt_na(sample_names, p = 0.5)
```## Advanced usage
For more fine-grained control over the salting process, and for access to a wider range of salting types, you can use the underlying functions provided for: inserting, substituting, replacing.
The set of insertions and replacements are specified via `shakers`, pre-filled character sets and pattern/replacement pairs that the `salt` verbs then call.
```{r}
available_shakers()
````salt_insert` keeps all the characters in the original while adding new ones, while `salt_substitute` overwrites those characters.
```{r}
# Use p to specify the percent of values that you would like to salt
salt_insert(sample_names, shaker$punctuation, p = 0.5)# Use n to specify how many new insertions/substitutions you want to make to selected values
salt_substitute(sample_names, shaker$punctuation, p = 0.5, n = 3)
```Different flavors of salt are available using `shaker`, however you can also supply your own character vector of possible replacements if you like.
```{r}
salt_insert(sample_names, shaker$mixed_letters, p = 0.5)salt_insert(sample_numbers, shaker$digits, p = 0.5)
salt_insert(sample_names, c("foo", "bar", "baz"), p = 0.5)
````salt_replace` is a bit more targeted: it works with pairs of patterns and replacements, either contained in `replacement_shaker` or user-specified. Use `rep_p` to set a probability of how many possible replacements should actually get swapped out for any given value.
```{r}
salt_replace(sample_names, replacement_shaker$ocr_errors, p = 1, rep_p = 1)salt_replace(sample_names, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)
salt_replace(sample_numbers, replacement_shaker$decimal_commas, p = 0.5, rep_p = 1)
```You may also specify your own arbitrary character vector of possible insertions.
```{r}
salt_insert(sample_names, insertions = c("X", "Z"))
```## Possible future work
- Modifying date strings to introduce subtle errors like invalid dates (e.g. February 30th)
- Simulating character encoding problems## Related work
**salty** should _not_ be used for anonymizing data; that's not its purpose.
However, it does draw some inspiration from [anonymizer](https://github.com/paulhendricks/anonymizer).To _create_ sample data for salting, take a look at [charlatan](https://github.com/ropensci/charlatan).
## Acknowledgements
The common OCR replacement errors are partially derived from the `sed` replacements specified in the [Royal Society Corpus project](http://fedora.clarin-d.uni-saarland.de/rsc/access.html): Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In _Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language_. Göteborg, Sweden. Linköping University Electronic Press. .