Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/epiverse-trace/linelist

R package for handling linelist data
https://github.com/epiverse-trace/linelist

data data-structures epidemiology epiverse outbreaks r r-package sdg-3 structured-data

Last synced: 2 months ago
JSON representation

R package for handling linelist data

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r readmesetup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# **linelist**: Tagging and Validating Epidemiological Data

[![Digital Public Good](https://raw.githubusercontent.com/epiverse-trace/linelist/main/man/figures/dpg_badge.png)](https://digitalpublicgoods.net/registry/linelist.html)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit)
[![cran-check](https://badges.cranchecks.info/summary/linelist.svg)](https://cran.r-project.org/web/checks/check_results_linelist.html)
[![R-CMD-check](https://github.com/epiverse-trace/linelist/workflows/R-CMD-check/badge.svg)](https://github.com/epiverse-trace/linelist/actions)
[![codecov](https://codecov.io/gh/epiverse-trace/linelist/branch/main/graph/badge.svg?token=JGTCEY0W02)](https://app.codecov.io/gh/epiverse-trace/linelist)
[![lifecycle-experimental](https://raw.githubusercontent.com/reconverse/reconverse.github.io/master/images/badge-maturing.svg)](https://www.reconverse.org/lifecycle.html#maturing)
[![month-download](https://cranlogs.r-pkg.org/badges/linelist)](https://cran.r-project.org/package=linelist)
[![total-download](https://cranlogs.r-pkg.org/badges/grand-total/linelist)](https://cran.r-project.org/package=linelist)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6532786.svg)](https://doi.org/10.5281/zenodo.6532786)

*linelist* provides a safe entry point to the *Epiverse* software ecosystem,
adding a foundational layer through *tagging*, *validation*, and *safeguarding*
epidemiological data, to help make data pipelines more straightforward and
robust.

## Installation

### Stable version

Our stable versions are released on CRAN, and can be installed using:

```{r, eval=FALSE}
install.packages("linelist", build_vignettes = TRUE)
```

::: {.pkgdown-devel}

### Development version

The development version of linelist can be installed from
[GitHub](https://github.com/) with:

```{r, eval=FALSE}
if (!require(pak)) {
install.packages("pak")
}
pak::pak("epiverse-trace/linelist")
```

:::

## Usage

```{r}
#| fig.alt: "Graphical summary of the linelist R package, with emphasis of these 4 key features: 1. Tag key epi variables, 2. Validate tagged data, 3. Safeguards vs accidental loss / alteration, 4. Robust data for stronger pipelines](man/figures/linelist_infographics.png"
#| out.width: "60%"
knitr::include_graphics("man/figures/linelist_infographics.png")
```

linelist works by tagging key epidemiological data in a `data.frame` or a
`tibble` to facilitate and strengthen data pipelines. The resulting object is a
`linelist` object, which extends `data.frame` (or `tibble`) by providing three
types of features:

1. a **tagging system** to identify key data, enabling access to these data using
their tags rather than actual names, which may change over time and across
datasets

2. **validation** of the tagged variables (making sure they are present and of the
right type/class)

3. **safeguards** against accidental losses of tagged variables in common data
handling operations

The short example below illustrates these different features. See the
[Documentation](#documentation) section for more in-depth examples and details
about `linelist` objects.

```{r}
# load packages and a dataset for the example
# -------------------------------------------
library(linelist)
library(dplyr)

dataset <- outbreaks::mers_korea_2015$linelist
head(dataset)

# check known tagged variables
# ----------------------------
tags_names()

# build a linelist
# ----------------
x <- dataset %>%
tibble() %>%
make_linelist(
date_onset = "dt_onset", # date of onset
date_reporting = "dt_report", # date of reporting
occupation = "age" # mistake
)
x
tags(x) # check available tags
```

`validate_linelist()` will error if one of your tagged column doesn't have the
correct type:

```{r, error = TRUE}
# validation of tagged variables
# ------------------------------
## (this flags a likely mistake: occupation should not be an integer)
validate_linelist(x)
```

```{r}
# change tags: fix mistakes, add new ones
# ---------------------------------------
x <- x %>%
set_tags(
occupation = NULL, # tag removal
gender = "sex", # new tag
outcome = "outcome"
)

# safeguards against actions losing tags
# --------------------------------------
## attemping to remove geographical info but removing dates by mistake
x_no_geo <- x %>%
select(-(5:8))
```

For stronger pipelines, you can even trigger errors upon loss:

```{r error = TRUE}
lost_tags_action("error")

x_no_geo <- x %>%
select(-(5:8))

x_no_geo <- x %>%
select(-(5:7))

## to revert to default behaviour (warning upon error)
lost_tags_action()
```

Alternatively, content can be accessed by tags:

```{r}
x_no_geo %>%
select(has_tag(c("date_onset", "outcome")))

x_no_geo %>%
tags_df()
```

linelist can also be connected to the incidence2 package for pipelines focused
on aggregated count data:

```{r, fig.width=8, fig.height=6, fig.alt="Epicurves (daily incidence) by sex and outcome via the incidence2 R package."}
library(incidence2)

x_no_geo %>%
tags_df() %>%
incidence("date_onset", groups = c("gender", "outcome")) %>%
plot(
fill = "outcome",
angle = 45,
nrow = 2,
border_colour = "white",
legend = "bottom"
)
```

## Documentation

More detailed documentation can be found at:
https://epiverse-trace.github.io/linelist/

In particular:

* [A general introduction to linelist](https://epiverse-trace.github.io/linelist/articles/linelist.html)

* [The reference manual](https://epiverse-trace.github.io/linelist/reference/index.html)

## Getting help

To ask questions or give us some feedback, please use the github
[issues](https://github.com/epiverse-trace/linelist/issues) system.

## Data privacy

Case line lists may contain personally identifiable information (PII). While
linelist provides a way to store this data in R, it does not currently provide
tools for data anonymization. The user is responsible for respecting individual
privacy and ensuring PII is handled with the required level of confidentiality,
in compliance with applicable laws and regulations for storing and sharing PII.

Note that PII is rarely needed for common analytics tasks, so that in many
instances it may be advisable to remove PII from the data before sharing them
with analytics teams.

## Development

### Lifecycle

This package is currently *maturing*, as defined by the [RECON software
lifecycle](https://www.reconverse.org/lifecycle.html). This means that essential
features and mechanisms are present and stable but minor breaking changes, or
function renames may still occur sporadically.

### Contributions

Contributions are welcome via [pull requests](https://github.com/epiverse-trace/linelist/pulls).

### Code of Conduct

Please note that the linelist project is released with a
[Code of Conduct](https://github.com/epiverse-trace/.github/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.

### Notes

This package is a reboot of the RECON package
[linelist](https://github.com/reconhub/linelist). Unlike its predecessor, the
new package focuses on the implementation of a `linelist` class. The data
cleaning features of the original package will eventually be re-implemented for
`linelist` objects, albeit likely in a separate package.