Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/njtierney/brolgar

BRowse Over Longitudinal Data Graphically and Analytically in R
https://github.com/njtierney/brolgar

Last synced: about 2 months ago
JSON representation

BRowse Over Longitudinal Data Graphically and Analytically in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
fig.align = "center",
out.width = "75%"
)
```

# brolgar

[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![R-CMD-check](https://github.com/njtierney/brolgar/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/njtierney/brolgar/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/brolgar)](https://CRAN.R-project.org/package=brolgar)
[![Codecov test coverage](https://codecov.io/gh/njtierney/brolgar/branch/main/graph/badge.svg)](https://app.codecov.io/gh/njtierney/brolgar?branch=main)

`brolgar` helps you **br**owse **o**ver **l**ongitudinal **d**ata **g**raphically and **a**nalytically in **R**, by providing tools to:

- Efficiently explore raw longitudinal data
- Calculate features (summaries) for individuals
- Evaluate diagnostics of statistical models

This helps you go from the "plate of spaghetti" plot on the left, to "interesting observations" plot on the right.

```{r show-spaghetti, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3, fig.width = 6}
library(brolgar)
library(ggplot2)
library(patchwork)
suppressPackageStartupMessages(library(gghighlight))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))

p1 <- ggplot(wages,
aes(x = xp,
y = ln_wages,
group = id)) +
geom_line()

p2 <-
wages %>%
features(ln_wages, feat_monotonic) %>%
left_join(wages, by = "id") %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
gghighlight(increase)

p1 + p2

```

## Installation

Install from [GitHub](https://github.com/) with:

```r
# install.packages("remotes")
remotes::install_github("njtierney/brolgar")
```

Or from the [R Universe](https://njtierney.r-universe.dev) with:

```r
# Enable this universe
options(repos = c(
njtierney = 'https://njtierney.r-universe.dev',
CRAN = 'https://cloud.r-project.org')
)

# Install some packages
install.packages('brolgar')
```

# Using `brolgar`: We need to talk about data

There are many ways to describe longitudinal data - from panel data, cross-sectional data, and time series. We define longitudinal data as:

> individuals repeatedly measured through time.

The tools and workflows in `brolgar` are designed to work with a special tidy time series data frame called a `tsibble`. We can define our longitudinal data in terms of a time series to gain access to some really useful tools. To do so, we need to identify three components:

1. The **key** variable in your data is the **identifier** of your individual.
2. The **index** variable is the **time** component of your data.
3. The **regularity** of the time interval (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements.

Together, time **index** and **key** uniquely identify an observation.

The term `key` is used a lot in brolgar, so it is an important idea to internalise:

> **The key is the identifier of your individuals or series**

Identifying the key, index, and regularity of the data can be a challenge. You can learn more about specifying this in the vignette, ["Longitudinal Data Structures"](https://brolgar.njtierney.com/articles/longitudinal-data-structures.html).

## The wages data

The `wages` data is an example dataset provided with brolgar. It looks like this:

```{r print-wages}
wages
```

And under the hood, it was created with the following setup:

```{r setup-wages-ts, eval = FALSE}
wages <- as_tsibble(x = wages,
key = id,
index = xp,
regular = FALSE)
```

Here `as_tsibble()` takes wages, and a `key`, and `index`, and we state the `regular = FALSE` (since there are not regular time periods between measurements). This turns the data into a `tsibble` object - a powerful data abstraction made available in the [`tsibble`](https://tsibble.tidyverts.org/) package by [Earo Wang](https://earo.me/), if you would like to learn more about `tsibble`, see the [official package documentation](https://tsibble.tidyverts.org/) or read [the paper](https://pdf.earo.me/tsibble.pdf).

# Efficiently exploring longitudinal data

Exploring longitudinal data can be challenging when there are many individuals. It is difficult to look at all of them!

You often get a "plate of spaghetti" plot, with many lines plotted on top of each other. You can avoid the spaghetti by looking at a random subset of the data using tools in `brolgar`.

## `sample_n_keys()`

In `dplyr`, you can use `sample_n()` to sample `n` observations, or `sample_frac()` to look at a `frac`tion of observations.

`brolgar` builds on this providing `sample_n_keys()` and `sample_frac_keys()`. This allows you to take a random sample of `n` keys using `sample_n_keys()`. For example:

```{r plot-sample-n-keys}
set.seed(2019-7-15-1300)
wages %>%
sample_n_keys(size = 5) %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line()
```

And what if you want to create many of these plots?

## Clever facets: `facet_sample()`

`facet_sample()` allows you to specify the number of keys per facet, and the number of facets with `n_per_facet` and `n_facets`.

By default, it splits the data into 12 facets with 5 per facet:

```{r facet-sample}
set.seed(2019-07-23-1937)
ggplot(wages,
aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
facet_sample()

```

Under the hood, `facet_sample()` is powered by `sample_n_keys()` and `stratify_keys()`.

You can see more facets (e.g., `facet_strata()`) and data visualisations you can make in brolgar in the [Visualisation Gallery](https://brolgar.njtierney.com/articles/visualisation-gallery.html).

## Finding features in longitudinal data

Sometimes you want to know what the range or a summary of a variable for each individual. We call these summaries `features` of the data, and they can be extracted using the `features` function, from [`fabletools`](https://fabletools.tidyverts.org/).

For example, if you want to answer the question "What is the summary of wages for each individual?". You can use `features()` to find the five number summary (min, max, q1, q3, and median) of `ln_wages` with `feat_five_num`:

```{r features-five-num}
wages %>%
features(ln_wages,
feat_five_num)

```

This returns the id, and then the features.

There are many features in brolgar - these features all begin with `feat_`. You can, for example, find those whose `ln_wages` values only increase or decrease with `feat_monotonic`:

```{r features-monotonic}
wages %>%
features(ln_wages, feat_monotonic)
```

You can read more about creating and using features in the [Finding Features](https://brolgar.njtierney.com/articles/finding-features.html) vignette. You can also see other features for time series in the [`feasts` package](https://feasts.tidyverts.org).

## Linking individuals back to the data

Once you have created these features, you can join them back to the data with a `left_join`, like so:

```{r features-left-join}
wages %>%
features(ln_wages, feat_monotonic) %>%
left_join(wages, by = "id") %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
gghighlight(increase)
```

# Other helper functions

## `n_obs()`

Return the number of observations total with `n_obs()`:

```{r example-n-obs}
n_obs(wages)
```

## `n_keys()`

And the number of keys in the data using `n_keys()`:

```{r example-n-keys}
n_keys(wages)
```

## Finding the number of observations per `key`.

You can also use `n_obs()` inside features to return the number of observations for each key:

```{r n-obs-features}
wages %>%
features(ln_wages, n_obs)
```

This returns a dataframe, with one row per key, and the number of observations for each key.

This could be further summarised to get a sense of the patterns of the number of observations:

```{r summarise-n-obs}
library(ggplot2)
wages %>%
features(ln_wages, n_obs) %>%
ggplot(aes(x = n_obs)) +
geom_bar()

wages %>%
features(ln_wages, n_obs) %>%
summary()
```

# Further Reading

`brolgar` provides other useful functions to explore your data, which you can read about in the [exploratory modelling](https://brolgar.njtierney.com/articles/exploratory-modelling.html) and [Identify Interesting Observations](https://brolgar.njtierney.com/articles/id-interesting-obs.html) vignettes. As a taster, here are some of the figures you can produce:

```{r show-wages-lg, echo = FALSE, fig.height = 3, fig.width = 6}
suppressMessages(library(dplyr))
suppressMessages(library(gghighlight))
wages_slope <- key_slope(wages,ln_wages ~ xp) %>%
left_join(wages, by = "id")

p3 <- wages_slope %>%
as_tibble() %>% # workaround for gghighlight + tsibble
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
gghighlight(.slope_xp < 0)

p4 <- wages_slope %>%
keys_near(key = id,
var = .slope_xp) %>%
left_join(wages, by = "id") %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id,
colour = stat)) +
geom_line()

gridExtra::grid.arrange(p3, p4, ncol = 2)
```

# Related work

One of the sources of inspiration for this work was the [`lasangar` R package by Bryan Swihart](https://github.com/swihart/lasagnar) (and [paper](https://doi.org/10.1097%2FEDE.0b013e3181e5b06a)).

For even more expansive time series summarisation, make sure you check out the [`feasts` package](https://github.com/tidyverts/feasts) (and [talk!](https://slides.mitchelloharawild.com/user2019/#1)).

# Contributing

Please note that the `brolgar` project is released with a [Contributor Code of Conduct](https://github.com/njtierney/brolgar/blob/main/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.

# A Note on the API

This version of brolgar was been forked from [tprvan/brolgar](https://github.com/tprvan/brolgar), and has undergone breaking changes to the API.

# Acknowledgements

Thank you to [Mitchell O'Hara-Wild](https://mitchelloharawild.com/blog.html) and [Earo Wang](https://earo.me/) for many useful discussions on the implementation of brolgar, as it was heavily inspired by the [`feasts`](https://github.com/tidyverts/feasts) package from the [`tidyverts`](https://tidyverts.org/). I would also like to thank [Tania Prvan](https://researchers.mq.edu.au/en/persons/tania-prvan) for her valuable early contributions to the project, as well as [Stuart Lee](https://stuartlee.org/) for helpful discussions. Thanks also to [Ursula Laa](https://uschilaa.github.io/) for her feedback on the package structure and documentation.