Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/evamaerey/ind2cat

Concisely recode indicator variables to categories
https://github.com/evamaerey/ind2cat

Last synced: 11 days ago
JSON representation

Concisely recode indicator variables to categories

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
message = F,
warning = F,
fig.path = "man/figures/README-",
out.width = "100%"
)

kabel_format <- "html"

```

## Ever use an indicator variable *directly* and made a sad plot?

```{r sad, echo = F}
library(tidyverse)
theme_set(theme_grey(base_size = 20))
p_sad <- tidytitanic::passengers %>%
ggplot() +
aes(x = survived) +
geom_bar()

ggjudge::judge_plot(p_sad, judgement = "sad")

```

## Or a bad plot?

```{r bad, echo = F}
p_bad <- tidytitanic::passengers %>%
ggplot() +
aes(x = age) +
geom_histogram() +
facet_grid(~ survived)

ggjudge::judge_plot(p_bad, judgement = "bad")

```

## ... because the repetitiveness of recoding felt too tedious?

```{r sqrecode, echo = F, eval = F}
tidytitanic::passengers %>%
ggplot() +
aes(x = age) +
geom_histogram() +
facet_grid(~ ifelse(survived,
"survived",
"not survived"))
```

```{r tedious, echo = F}
ggjudge::judge_chunk_code("sqrecode", "tedious")
```

## Introducing ind2cat::ind_recode()!

ind2cat::ind_recode is a concise, sensible function for making human-readable data summary products with indicator variables!

```{r introduce_ind2cat, echo = F, eval = F}
library(ind2cat)
tidytitanic::passengers %>%
ggplot() +
aes(x = age) +
geom_histogram() +
facet_grid(~ ind_recode(survived))
```

```{r concise, echo = F}
ggjudge::judge_chunk_code("introduce_ind2cat", "concise code")
```

```{r nice, echo = F}
ggjudge::judge_chunk_output_plot("introduce_ind2cat", judgement = "nice plot")
```

# Issues up front

## disclosures

0) ind2cat is experimental
1) I'm not sure if there there is already a solution

## to do to move ind2cat to more robust

- change to Rlang for grabbing function name (Claus Wilke)
- left join instead of ifelse to make code more performant (Emily Rederer)
- make "Y" "N" a lot stricter - right now we're assuming a ton! Danger.
- test!

# Thanks up front

- Emily Rederer
- Kyle McDermott
- Claus Wilke
- Yihui Xie

# Abstract

Indicator variables are easy to create, store, and interpret [@10.1177/1536867X19830921]. They concisely encode information about the presence of a condition for observational units. The variable name encapsulates the information about the condition of interest, and the variable's values (TRUE and FALSE, 1 or 0, "Yes" or "No") indicate if the condition is met for the observational unit. When using indicator variables to use in summary products, analysts often make a choice between using an indicator variable as-is or crafting categorical variables where values can be directly interpreted. Using the indicator variable as-is may be motivated by time savings, but yields poor results in summary products. {{ind2cat}} can help analysts concisely translate indicator variables to categorical variables for reporting products, yielding more polished outputs. By default, ind2cat creates the categorical variable from the indicator variable name, resulting in a light-weight syntax.

# Introduction

Using current analytic tools, analysts make a choice between directly using indicator variables or recoding the variable first to categorical. Current procedures for recoding indicator variables to a categorical variable is repetitive, but forgoing a recode and using indicator variables directly yields hard-to-interpret summary products.

The data below inspired by email training data, demonstrates how an analyst might current recode an indicator variable. This method is repetitive; in the recoding line, 'spam' is typed four times.

```{r manipulation_status_quo}
library(tidyverse)
data.frame(ind_spam = c(TRUE, TRUE, FALSE, FALSE, TRUE)) %>%
mutate(cat_spam = ifelse(ind_spam, "spam", "not spam"))
```

Likewise, in data visualization products where recoding can be done on the fly, we see that the process can be repetative.

```{r visual_status_quo}
tidytitanic::passengers %>%
ggplot() +
aes(x = age) +
geom_histogram() +
facet_grid(~ ifelse(survived,
"survived",
"not survived"))
```

Furthermore, the `ifelse()` approach to recoding indicator variables also has the disadvantage of not consistently ordering the resultant categories; ordering in products will be alphabetical and not reflect the F/T order of the source variable. An additional step to reflect the source variable, using a function like forcats::fct_rev, may be required for consistent reporting. We show this with another visualization example, and see that the specification of the x axis variable becomes more difficult to reason about.

```{r visual_status_quo_order}
data.frame(ind_grad = c(T, F, T, T)) %>%
ggplot() +
aes(x = fct_rev(ifelse(ind_grad, "grad", "not grad"))) +
geom_bar()

```

Given how verbose recoding an indicator variable can be, analysts may choose to forego a recoding the variable, especially in exploratory analysis. However, when indicator variables are used directly in data summary products like tables and visuals, information is often awkwardly displayed and is sometimes lost. Below, the table that is created by using the indicator variable directly is awkward to interpret. The indicator variable name persists in the output allowing savvy readers to interpret the output, but communication is strained.

```{r direct_table_awkward}
tidytitanic::passengers %>%
count(survived)
```

In the following two-way table produced using an indicator variable directly with the popular janitor package, information is completely lost:

```{r direct_table_loss}
tidytitanic::passengers %>%
janitor::tabyl(sex, survived)
```

Likewise, in the following visual summary of the data, where an indicator variable is directly used, interpretation is awkward.

```{r direct_visual_awkward, fig.cap="A. Bar labels + axis label preserves information but is awkward"}
library(tidyverse)

tidytitanic::passengers %>%
ggplot() +
aes(x = survived) +
geom_bar()
```

Moreover, when indicator variables are used directly as faceting variable for plots produced by the popular ggplot2 library, information is lost and the plot is not directly interpretable.

```{r direct_visual_loss, fig.cap = "D. Facetting directly on an indicator variable with popular ggplot2 results in information loss"}
tidytitanic::passengers %>%
ggplot() +
aes(x = age) +
geom_histogram() +
facet_grid(~ survived)
```

# Introducing ind2cat::ind_recode

The ind2cat::ind_recode() function uses indicator variable names to automatically derive human-readable, and appropriately ordered categories.

To clearly compare the new method, we reiterate the status quo with a toy example:

```{r manipulation_status_quo_reprise, eval = T, message=F, warning=F}
library(tidyverse)

data.frame(ind_graduated =
c(TRUE, TRUE, FALSE)) %>%
mutate(cat_graduated =
ifelse(ind_graduated,
"graduated",
"not graduated")) %>%
mutate(cat_graduated =
fct_rev(cat_graduated)
)
```

Below we contrast this with the use of ind2cat's ind_recode function which avoids repetition by creating categories based on the indicator variable name. Using the the function ind_recode(), we can accomplish the same task shown above more succinctly:

```{r manipulation_ind2cat, eval = T, message=F, warning=F}
library(ind2cat)

data.frame(ind_graduated =
c(TRUE, TRUE, FALSE)) %>%
mutate(cat_graduated =
ind_recode(ind_graduated)
)
```

The function ind_recode is flexible, and can recode from variable populated with TRUE/FALSE values as well as 1/0 or "Yes"/"No" (and variants 'y/n' for example).

Furthermore, while ind_recode default functionality allows analysts to move from its first-cut human-readable recode, it also allows fully customized categories via adjustment of the functions parameters.

If the category associated with 'TRUE' should be modified (default is based on the variable name), the `cat_true` may be used as follows. Note that the false category is generated from the TRUE category by default.

```{r manipulation_ind2cat_custom}
data.frame(ind_graduated = c(T,T,F)) %>%
mutate(cat_graduated = ind_recode(ind_graduated,
cat_false = "current"))
```

Also, the default negator 'not' can be changed by setting the `negator` argument.

```{r manipulation_ind2cat_negator}
tibble(ind_grad = c(T,T,F)) %>%
mutate(cat_grad = ind_recode(ind_grad, negator = "~"))
```

If the negative category should be independently specified, the `cat_false` argument can be set:

```{r manipulation_ind2cat_false_cat}
tibble(ind_grad = c("Y", "N")) %>%
mutate(cat_grad = ind_recode(ind_grad, cat_false = "enrolled"))
```

Also, if the derived category's levels should be reversed, i.e. [1,0] instead of the default [0,1], rev can be set to TRUE.

```{r manipulation_ind2cat_rev}
tibble(ind_grad = c("yes", "no")) %>%
mutate(cat_grad = ind_recode(ind_grad, rev = TRUE)) %>%
mutate(cat_grad_num = as.numeric(cat_grad))
```

Finally, several indicator variable prefixes are automatically removed with the default setting, includeing `ind_` and `IND_`. This behavior can be modified using the `var_prefix` argument.

```{r manipulation_ind2cat_prefix}
tibble(dummy_grad = c(0, 0, 1, 1, 1 ,0 ,0)) %>%
mutate(cat_grad = ind_recode(dummy_grad,
var_prefix = "dummy_"))
```

## Use in data products like figures and tables

In the summary figure, we show the values that result from using ind_recode on the fly in ggplot2. In a true-to-life analytic reporting space, the analyst could then use `labs(x = NULL)` to remove the variable recoding specification.

```{r visual_ind2cat_customization_in_visualizations, fig.width=12, fig.height=15}
data.frame(ind_spam = c(TRUE, TRUE, FALSE, FALSE, FALSE)) %>%
ggplot() +
aes(x = ind_recode(ind_spam)) +
geom_bar() +
theme_gray(base_size = 15)->
p1

p1 +
aes(x = ind_recode(ind_spam, cat_true = "suspicious")) ->
p2

p1 +
aes(x = ind_recode(ind_spam, negator = "~")) ->
p3

p1 +
aes(x = ind_recode(ind_spam, cat_false = "trustworthy")) ->
p4

p1 +
aes(x = ind_recode(ind_spam, rev = TRUE)) ->
p5

library(patchwork)

(p1 + p2) /
(p3 + p4) /
(p5 + patchwork::plot_spacer())
```

```{r table_ind2cat_preserves}
tidytitanic::passengers %>%
mutate(cat_survived = ind_recode(survived,
cat_false = "perished")) %>%
janitor::tabyl(sex, cat_survived) %>%
janitor::adorn_percentages() %>%
janitor::adorn_pct_formatting() %>%
janitor::adorn_ns(position = "rear")
```

# Conclusion

# Implementation details

```{r read_in_function}
readLines("R/ind_recode.R") -> implementation
```

```{r display_function, code = implementation, eval= F}
```

---

# README.Rmd chunks names

```{r get_chunk_names}
knitr::knit_code$get() |> names()
```

# Github Social

```{r githubsocial, fig.width=30, fig.height=15}
library(patchwork)
(ggjudge::judge_plot(p_sad, judgement = "using indicators directly can be sad") +
ggjudge::judge_plot(p_bad, judgement = "and is some of the time it's just bad") +
ggjudge::judge_chunk_code("sqrecode", "but current recodes are tedious"))/

(ggjudge::judge_chunk_code("introduce_ind2cat", "so introducing ind2cat::ind_recode()") +
ggjudge::judge_chunk_output_plot("introduce_ind2cat", judgement = "code's concise and plot's also nice")+ patchwork::plot_spacer())
```