Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/batpigandme/pivot-to-video

📺 OK, not *really* this is more about `tidyr::pivot_*()` functions…
https://github.com/batpigandme/pivot-to-video

Last synced: 7 days ago
JSON representation

📺 OK, not *really* this is more about `tidyr::pivot_*()` functions…

Awesome Lists containing this project

README

        

---
author: Mara Averick
date: '`r format(Sys.Date())`'
title: '`r emo::ji("tv")` Pivot to Video'
output:
html_document:
keep_md: true
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.path = 'fig/', dev = 'png', dpi = 144, fig.retina = 2)
```

`r emo::ji("television")` OK, not _really_!

This is actually about `tidyr::pivot_*()` functions, which you can learn all
about in the new tidyr version 1.0.0 [Pivoting
vignette](https://tidyr.tidyverse.org/articles/pivot.html)!

But, I *will* be using some nifty TV-related data. So, I'm sticking
with the name.

```{r message=FALSE, warning=FALSE}
library(tidyverse)
```

## `r emo::ji("burger")` Bob's Burgers: A Belcher Family survey

```{r message=FALSE, warning=FALSE}
sheet <- googlesheets::gs_title("bobs_burgers_survey_results")

bobs_ws <- googlesheets::gs_ws_ls(sheet)
```

```{r}
raw_dat <- sheet %>%
googlesheets::gs_read(ws = glue::glue("{bobs_ws}"))
```

```{r prelim-results}
belcher_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
dplyr::rename("response" = `Members of the Belcher family with whom I identify (select all that apply)`) %>%
dplyr::select(-Timestamp)

belcher_results
```

Because I used a [Google Form](https://docs.google.com/forms/d/e/1FAIpQLScYLAkzDzEOWvXrn601Vt79S6DsKLkv12NUsqbo72j1fmapWA/viewform) to collect this data, I don't have to worry about
order of names, since they come out the same every time.

```{r agg-results}
agg_results <- belcher_results %>%
dplyr::group_by(response) %>%
dplyr::summarise(total = n()) %>%
dplyr::arrange(desc(total))

agg_results
```

### `tidyr::separate_rows()`

Let's make them long with
[`tidyr::separate_rows()`](https://tidyr.tidyverse.org/dev/reference/separate_rows.html).
I'm also adding a numeric variable, `identify` (short for "character(s) with
whom I identify"), and ensuring that respondent IDs don't get erroneously
treated as numeric, by converting them to characters.

```{r results-prep}
belcher_results <- belcher_results %>%
tidyr::separate_rows(response) %>%
dplyr::mutate(identify = 1,
resp_id = as.character(resp_id))

head(belcher_results)
```

### `tidyr::pivot_wider()`

Now we'll take one of the "new" tidyr verbs for a spin, [`pivot_wider()`](https://tidyr.tidyverse.org/dev/reference/pivot_wider.html).
Rather than fill things out with a bunch of `NA`s, we'll prepare our data
to use with the `UpSetR` package by turning it into binaries, and ditch the
respondent ID in the end.

```{r}
binary_tib <- belcher_results %>%
tidyr::pivot_wider(
names_from = response,
values_from = identify,
values_fill = list(identify = 0)
) %>%
dplyr::select(-resp_id)

head(binary_tib)
```

### `UpSetR::upset()`

I *highly* recommend Paul Campbell's
[code-through](https://www.cultureofinsight.com/blog/2018/01/25/2018-01-25-visualising-twitter-follower-overlap/)
using [UpSetR](https://github.com/hms-dbmi/UpSetR), which gave me (among other things) the pro tip that `upset()` does
*not* like tibbles (hence the `as.data.frame()` at the end).

```{r}
binary_df <- as.data.frame(binary_tib)
```

```{r basic-upset}
UpSetR::upset(binary_df, nsets = 5, order.by = "freq")
```

## `r emo::ji("movie")` I have seen this movie...

Let's take a look at another dataset I collected with a quick survey, this one asking people whether they had or had not seen a given movie.

```{r sheet-chunk-2}
sheet <- googlesheets::gs_title("seen_this_movie")

movie_ws <- googlesheets::gs_ws_ls(sheet)

raw_dat <- sheet %>%
googlesheets::gs_read(ws = glue::glue("{movie_ws}"))
```

First I'll use [`tibble::rowid_to_column()`](https://tibble.tidyverse.org/reference/rownames.html) to keep track of which user said what (and, again, make sure those are stored as characters rather than numeric, so I don't accidentally analyze it the wrong way).

I still don't care about the timestamp, so I'll get everything _but_ that using `dplyr::select(-Timestamp)`.

The variable names here have a whole bunch of annoying characters (e.g. spaces, exclamation points). So, I'm going to use Sam Firke's [janitor](https://sfirke.github.io/janitor/) package — specifically the [`janitor::clean_names()`](https://sfirke.github.io/janitor/reference/clean_names.html) function — to take care of some of the grunt work for me there.

I'm also going to rename `my_age_is` to `age`, just because it's annoying.

```{r wide-movie-results}
movie_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
dplyr::mutate(resp_id = as.character(resp_id)) %>%
dplyr::select(-Timestamp) %>%
janitor::clean_names() %>%
dplyr::rename("age" = my_age_is)

head(movie_results)
```

Since this dataset is wide, let's also take a gander at it using `glimpse()`:

```{r glimpse-wide-movies}
glimpse(movie_results)
```

### [`tidyr::pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html)

Since I only want to elongate the movies, and all of those columns are logical, I'll first select which columns I want by using `select_if()` and `is.logical()` as the predicate function.

```{r}
logicols <- select_if(movie_results, is.logical) %>%
colnames()

movie_long <- movie_results %>%
pivot_longer(
cols = one_of(logicols),
names_to = "movie",
values_to = "seen"
)

head(movie_long)
```
Aside: A nice little trick for recoding `TRUE`/`FALSE` as `0` and `1`, just use `as.numeric()`.

```{r}
movie_bin <- movie_long %>%
mutate(seen_num = as.numeric(seen))

movie_bin
```

```{r}
lil_seen <- movie_bin %>%
select(resp_id, movie, seen_num)

movie_bin %>%
group_by(movie) %>%
summarise(pct_seen = sum(seen_num) / n()) %>%
arrange(desc(pct_seen))
```

```{r}
library(corrr)

lil_seen_wide <- lil_seen %>%
pivot_wider(names_from = movie, values_from = seen_num) %>%
select(-resp_id)

correlate(lil_seen_wide) %>%
fashion()
```

```{r movies-rplot}
correlate(lil_seen_wide) %>%
rplot()
```

```{r movies-upset}
lil_seen_wide <- as.data.frame(lil_seen_wide)
UpSetR::upset(lil_seen_wide, nsets = 13, order.by = "freq")
```

Let's briefly pretend we're looking at just three movies: Bridesmaids, Anchorman, and Airplane.

```{r}
three_movies <- c("airplane", "anchorman", "bridesmaids")
```

```{r}
movie_long %>%
filter(movie %in% three_movies)
```

For the wide version, we'll add the respondent id and age to the list of variables we want to look at.
```{r}
vars_wanted <- c("resp_id", "age", three_movies)

movie_results %>%
select(vars_wanted)
```

Note the change in "shape" of our data, though the contents remain the same:

![](https://i.imgur.com/JcIlr51.png)

---