Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/epicentre-msf/dbc
Dictionary-based cleaning
https://github.com/epicentre-msf/dbc
Last synced: 8 days ago
JSON representation
Dictionary-based cleaning
- Host: GitHub
- URL: https://github.com/epicentre-msf/dbc
- Owner: epicentre-msf
- License: other
- Created: 2021-09-24T03:22:20.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-19T16:41:46.000Z (about 1 year ago)
- Last Synced: 2024-08-13T07:12:03.195Z (4 months ago)
- Language: R
- Size: 108 KB
- Stars: 2
- Watchers: 5
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - epicentre-msf/dbc - Dictionary-based cleaning (R)
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/"
)
options(digits = 4, width = 120)
```# dbc: Dictionary-based cleaning
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![R-CMD-check](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml)Tools for creating and applying dictionaries of value-replacement pairs, to
clean non-valid values of numeric, categorical, or date-type variables within a
dataset.### Installation
Install from GitHub with:
```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("epicentre-msf/dbc")
```### Example usage
#### Example dataset and dictionary
```{r}
library(dbc)
data(ll1) # example messy dataset
data(dict_categ1) # example dictionary of categorical vars and allowed valuesll1
```#### Check and clean numeric variables
##### 1. Produce an initial dictionary of non-valid numeric values
```{r}
dict_clean_numeric <- check_numeric(
ll1,
vars = c("age", "contacts"), # cols that should be numeric
fn = as.integer # values not coercible by `fn` are non-valid
)dict_clean_numeric
```##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.
Normally one would do this step in a spreadsheet but we'll do it in R here for simplicity.
```{r}
dict_clean_numeric$replacement <- c(".na", "39", "10", ".na")
```##### 3. Use the cleaning dictionary to clean numeric variables
```{r}
clean_numeric(
ll1,
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric,
fn = as.integer
)
```##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary.
*Check for new non-valid numeric values, after incorporating previous cleaning*
```{r}
dict_clean_numeric_update <- check_numeric(
ll2, # same as ll1 but with 3 additional entries
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric, # incorporate previous cleaning before checking
fn = as.integer,
return_all = TRUE # return original cleaning dict + new entries
)dict_clean_numeric_update
```*Manually specify replacement for new non-valid entry*
```{r}
dict_clean_numeric_update$replacement[5] <- "6"
```*Apply updated cleaning dictionary to updated dataset*
```{r}
clean_numeric(
ll2,
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric_update,
fn = as.integer
)
```#### Check and clean categorical variables
##### 1. Produce an initial dictionary of non-valid categorical values
```{r}
dict_clean_categ <- check_categorical(
ll1,
dict_allowed = dict_categ1 # dictionary of categorical vars and their allowed values
)dict_clean_categ
```##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.
Again, we would normally do this step in a spreadsheet but we do it in R here for simplicity.
```{r}
dict_clean_categ$replacement <- c(
"Years",
"Years",
"Cured",
".na",
"M",
".na",
"Suspected"
)
```##### 3. Use the cleaning dictionary to clean categorical variables
```{r}
clean_categorical(
ll1,
dict_allowed = dict_categ1,
dict_clean = dict_clean_categ
)
```##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary, as in the example with numeric variables above.