https://github.com/epicentre-msf/dbc

Dictionary-based cleaning
https://github.com/epicentre-msf/dbc

Last synced: 6 months ago
JSON representation

Dictionary-based cleaning

Host: GitHub
URL: https://github.com/epicentre-msf/dbc
Owner: epicentre-msf
License: other
Created: 2021-09-24T03:22:20.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-10-19T16:41:46.000Z (over 1 year ago)
Last Synced: 2024-08-13T07:12:03.195Z (10 months ago)
Language: R
Size: 108 KB
Stars: 2
Watchers: 5
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - epicentre-msf/dbc - Dictionary-based cleaning (R)

README

        ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/"

)

options(digits = 4, width = 120)

```

# dbc: Dictionary-based cleaning

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![R-CMD-check](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml)

Tools for creating and applying dictionaries of value-replacement pairs, to

clean non-valid values of numeric, categorical, or date-type variables within a

dataset.

### Installation

Install from GitHub with:

```{r, eval=FALSE}

# install.packages("remotes")

remotes::install_github("epicentre-msf/dbc")

```

### Example usage

#### Example dataset and dictionary

```{r}

library(dbc)

data(ll1)          # example messy dataset

data(dict_categ1)  # example dictionary of categorical vars and allowed values

ll1

```

#### Check and clean numeric variables

##### 1. Produce an initial dictionary of non-valid numeric values

```{r}

dict_clean_numeric <- check_numeric(

  ll1,

  vars = c("age", "contacts"), # cols that should be numeric

  fn = as.integer              # values not coercible by `fn` are non-valid

)

dict_clean_numeric

```

##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.

Normally one would do this step in a spreadsheet but we'll do it in R here for simplicity.

```{r}

dict_clean_numeric$replacement <- c(".na", "39", "10", ".na")

```

##### 3. Use the cleaning dictionary to clean numeric variables

```{r}

clean_numeric(

  ll1,

  vars = c("age", "contacts"),

  dict_clean = dict_clean_numeric,

  fn = as.integer

)

```

##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary.

*Check for new non-valid numeric values, after incorporating previous cleaning*

```{r}

dict_clean_numeric_update <- check_numeric(

  ll2,                             # same as ll1 but with 3 additional entries

  vars = c("age", "contacts"),

  dict_clean = dict_clean_numeric, # incorporate previous cleaning before checking

  fn = as.integer,

  return_all = TRUE                # return original cleaning dict + new entries

)

dict_clean_numeric_update

```

*Manually specify replacement for new non-valid entry*

```{r}

dict_clean_numeric_update$replacement[5] <- "6"

```

*Apply updated cleaning dictionary to updated dataset*

```{r}

clean_numeric(

  ll2,

  vars = c("age", "contacts"),

  dict_clean = dict_clean_numeric_update,

  fn = as.integer

)

```

#### Check and clean categorical variables

##### 1. Produce an initial dictionary of non-valid categorical values

```{r}

dict_clean_categ <- check_categorical(

  ll1,

  dict_allowed = dict_categ1 # dictionary of categorical vars and their allowed values

)

dict_clean_categ

```

##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.

Again, we would normally do this step in a spreadsheet but we do it in R here for simplicity.

```{r}

dict_clean_categ$replacement <- c(

  "Years",

  "Years",

  "Cured",

  ".na",

  "M",

  ".na",

  "Suspected"

)

```

##### 3. Use the cleaning dictionary to clean categorical variables

```{r}

clean_categorical(

  ll1,

  dict_allowed = dict_categ1,

  dict_clean = dict_clean_categ

)

```

##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary, as in the example with numeric variables above.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/epicentre-msf/dbc

Awesome Lists containing this project

README