An open API service indexing awesome lists of open source software.

https://github.com/const-ae/treelabel

Store and work with labels that exist in hierarchical relationships
https://github.com/const-ae/treelabel

Last synced: about 1 year ago
JSON representation

Store and work with labels that exist in hierarchical relationships

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(width = 100)
```

# treelabel

The goal of treelabel is to store and work with labels that exist in a hierarchical relationship.

This is an alpha software release: feel free to play around with the code, and please provide feedback, but expect breaking changes to the API!

## Motivation

![](man/figures/celltype_tree.png)

I work on single-cell RNA-seq data with gene expression profiles for thousands of cells. A common first step is to annotate each cell's *cell type*. The granularity of these cell type annotations can vary; one can classify cells broadly into *immune cells* or *epithelial cells* or one can be very detailed and distinguish within the immune cells *CD4 positive T regulatory cells* from *CD4 positive T follicular helper cells*. Choosing the best annotation level is difficult because one analysis may need broad cell types, whereas others require the highest possible resolution. The `treelabel` package provides an intuitive interface to store and work with these hierarchically related labels.

Depending on the reference data and annotation method used for the cell typing, you often have multiple (partially) conflicting annotations. `treelabel` provides functions to build a consensus across different annotations and can integrate annotations at different resolutions. Furthermore, `treelabel` supports uncertainty scores associated with a label. For example, most automatic cell type scoring tools (like [Azimuth](https://azimuth.hubmapconsortium.org/) or [celltypist](https://www.celltypist.org/)), return a confidence score in addition to the cell type label. These scores enable a more precise selection of cells where you have sufficient confidence in the cell type label.

## What this package is. And what it is not,

This package is purposefully kept generic and only makes the following assumptions:

- Your labels have a tree-like relationship: the edges between the labels are directed, and there are no cycles.
- The relation between a parent and a child can phrased as *is a*. For example, a *`T cell` is a `Immune cell`*.
- The scores can be logical or non-negative numbers.

This package does not provide any functionality to:

- Assign cell types to cells based on the expression profile. Use any of the many available automatic cell type scoring tools (see for [this](https://github.com/seandavi/awesome-single-cell?tab=readme-ov-file#cell-type-identification-and-classification) list for some suggestions) or do it manually using clustering plus marker gene expression.
- Automatically harmonize cell type labels from different references (e.g., figure out that the `NK cells` from one dataset correspond to the `Natural killer cells` from another). You have to do this manually. There is an example further down in the README.
- Provide the optimal cell type tree. You will probably want to define the tree for your analysis depending on the annotations available to you. As a reference, look at the [cell ontology project](https://cell-ontology.github.io/), which provides a large database of cell type label relationships and is used by the Human Cell Atlas.
- Plot trees. For demonstration purposes, we will use the `igraph` plots (which are not very pretty), and for the plot on the top, I used the [D3](https://d3js.org/) library from Javascript (which is cumbersome to use from R). See the end of the README for an example how to make pretty plots of trees with ggplot2.

## Installation

You can install the development version of `treelabel` like this:

``` r
devtools::install_github("const-ae/treelabel")
```

# Documentation

`treelabel` is build to be directly compatible with the `tidyverse`.

```{r, eval=FALSE}
library(treelabel)
library(tidyverse)
```
```{r, include=FALSE}
devtools::load_all(".")
library(tidyverse)
options(
pillar.print_max = 5,
pillar.print_min = 5
)
```

## Motivating example

I will demonstrate a typical single cell analysis workflow that takes a hierarchical set of labels, stores them in a `treelabel` vector, and analyzes the abundance changes of cell types.

I will illustrate the process using the "pbmcsca" dataset from Seurat and score each cell using Azimuth. Note that `treelabel` is compatible with any data storage format (e.g., `SingleCellExperiment` or `Seurat`) and can handle both manual cell type labels based on clustering or automated cell scores produced by, for example, Azimuth or Celltypist.

```{r message=FALSE, warning=FALSE, paged.print=FALSE, results='hide'}
# This can take a minute to run through
library(Seurat)
# Might need to call `SeuratData::InstallData("pbmcsca")` first
pbmcsca <- SeuratData::LoadData("pbmcsca")
azimuth_res <- Azimuth::RunAzimuth(pbmcsca, reference = "pbmcref")
# Select most important columns to make the output easier to read.
azimuth_res@meta.data <- select(azimuth_res@meta.data, c("orig.ident", "Experiment", "Method", starts_with("predicted.celltype.")))
```

Take a look at the Azimuth output. It contains six new columns that all start with "predicted.celltype". These columns contain the labels and confidence scores from the automated mapping. These will serve as the input to `treelabel` which turns the six columns into one!

```{r, paged.print=FALSE}
azimuth_res@meta.data |>
as_tibble(rownames = "cell_id")
```

To create the `treelabel` vector, we need to define our cell type hierarchy. We use the `igraph` packages for this.

```{r, paged.print=FALSE}
# Define the cell type label hierarchy
pbmcsca_tree <- igraph::graph_from_literal(
root - NK : `T cell` : Mono : DC : B,
`T cell` - `CD8 T` : `CD4 T`,
`CD4 T` - CTL : `CD4 Naive` : `CD4 TCM` : `CD4 TEM` : Treg,
`CD8 T` - `CD8 Naive` : `CD8 TCM` : `CD8 TEM`,
DC - cDC1 : cDC2 : pDC,
Mono - `CD14 Mono` : `CD16 Mono`
)

# Convert the Azimuth result into a treelabel vector:
# * We first append '.label' to the Azimuth column to make the `pivot_longer` simpler.
# * We then filter the cell types to the ones we list in the `pbmcsca_tree`.
# * Lastly, we convert the label and score column into a treelabel vector.
celltype_annotations <- azimuth_res@meta.data |>
as_tibble(rownames = "cell_id") |>
dplyr::rename_with(.cols = matches("^predicted.celltype.l\\d$"), \(x) paste0(x, ".label")) |>
pivot_longer(starts_with("predicted.celltype"), names_prefix = "predicted\\.celltype\\.", names_sep = "\\." ,names_to = c("level", ".value")) |>
filter(label %in% igraph::V(pbmcsca_tree)$name) |>
treelabel_from_dataframe(pbmcsca_tree, id = "cell_id", label = "label", score = "score", name = "azimuth_celltypes")
```

The `azimuth_celltypes` column is an S3 vector (build ontop of the `vctrs` package) which works a bit like a `factor`. Each entry contains the full information about the hierarchical cell type labels. By default, `treelabel` prints the most precise cell type label that is available, but the information about the other levels is still accessible

```{r, paged.print=FALSE}
vec <- head(celltype_annotations$azimuth_celltypes, n = 8)
vec

# The fourth element is a T cell, a CD8 T, and a CD8 TEM!
vec[4]
tl_eval(vec[4], `T cell`)
tl_eval(vec[4], `CD8 T`)
tl_eval(vec[4], `CD8 TEM`)
```

We can join the `celltype_annotations` with the full meta data to replace the old `predicted.celltype` column with our new `treelabel`.

```{r, paged.print=FALSE}
meta_data <- azimuth_res@meta.data |>
as_tibble(rownames = "cell_id") |>
dplyr::select(- starts_with("predicted.celltype")) |>
left_join(celltype_annotations, by = "cell_id")

meta_data
```

The `pbmcsca` contains results from ten different sequencing experiments. We can for example count how often each cell type was seen per method

```{r, paged.print=FALSE}
# Summing the confidence scores does not exactly give you the counts
meta_data |>
summarize(as_tibble(tl_score_matrix(sum(azimuth_celltypes, na.rm=TRUE))), .by = c(Method))

# Instead, you can say that only labels where the score exceeds a thresholds count.
meta_data |>
mutate(azimuth_celltypes = tl_modify(azimuth_celltypes, .scores > 0.8)) |>
summarize(as_tibble(tl_score_matrix(sum(azimuth_celltypes, na.rm=TRUE))), .by = c(Method))
```

The `test_abundance_changes` function makes it easy to test if the number of cells of a cell type changes between conditions. Importantly, you need to have multiple independent replicates. The `pbmcsca` unfortunately does not have that, so I will just simulate random patient IDs to demonstrate how the function works.

```{r, paged.print=FALSE}
# Apply threshold and make Experiment a factor
input_dat <- meta_data |>
mutate(Experiment = as.factor(Experiment)) |>
mutate(patient_id = sample(paste0("sample_", 1:5), size = n(), replace = TRUE)) |>
mutate(azimuth_celltypes = tl_modify(azimuth_celltypes, .scores > 0.8))

# The function takes many arguments. See `?test_abundance_changes` for all details
test_abundance_changes(input_dat, design = ~ Experiment, aggregate_by = patient_id)

# We can also run `test_abundance_changes` inside dplyr::reframe (an alternative to `summarize`)
# and calculate the abundance changes for each Method separately.
# Setting `reference = `T cell` will test if the number of T cell subtypes changes as a proportion
# of all T cells.
input_dat |>
reframe(test_abundance_changes(data = across(everything()), design = ~ Experiment, aggregate_by = patient_id,
reference = `T cell`, contrast = cond(Experiment = 'pbmc2') - cond(Experiment = 'pbmc1')),
.by = Method)
```

### Compatibility with Bioconductor and Seurat

`treelabel` works directly with the BioConductor data structures `SingleCellExperiment`, `SummarizedExperiment`, and `DataFrame`.

```{r}
# Load an example SingleCellExperiment object
suppressMessages({
sce <- ExperimentHub::ExperimentHub()[["EH2259"]]
})
# Make a simple tree with only one level
kang_tree <- igraph::graph_from_edgelist(cbind("root", levels(sce$cell)))
# Add treelabel column to colData
colData(sce)$treelabel <- treelabel(sce$cell, kang_tree)
colData(sce)
```

The `treelabel` vectors can also be used with Seurat data. Here, we match the provided annotations to the names from the `pbmcsca_tree`.

```{r eval=FALSE, include=TRUE}
# Load pbmcsca again
library(Seurat)
pbmcsca <- SeuratData::LoadData("pbmcsca")
```

```{r}
# We will re-use the `pbmcsca_tree` from above. The provided annotations in pbmcsca$CellType
# are in a slightly different format, so we manually convert them.
rename_pbmcsca_celltypes <- c(
"B cell" = "B", "CD14+ monocyte" = "CD14 Mono", "CD16+ monocyte" = "CD16 Mono",
"CD4+ T cell" = "CD4 T", "Cytotoxic T cell" = "CD8 T", "Dendritic cell" = "DC",
"Megakaryocyte" = "Mono", "Natural killer cell" = "NK",
"Plasmacytoid dendritic cell" = "pDC", "Unassigned" = "root"
)

pbmcsca@meta.data$tl_manual <- treelabel(rename_pbmcsca_celltypes[pbmcsca$CellType], pbmcsca_tree)
pbmcsca@meta.data[1:5,c("orig.ident", "CellType", "tl_manual")]
```

## Technical documentation

We define our label hierarchy using [`igraph`](https://r.igraph.org/articles/igraph.html).

```{r tree_plot}
tree <- igraph::graph_from_literal(
root - ImmuneCell : EndothelialCell : EpithelialCell,
ImmuneCell - TCell : BCell,
TCell - CD4_TCell : CD8_TCell
)
plot(tree, layout = igraph::layout_as_tree(tree, root = "root"),
vertex.size = 40, vertex.label.cex = 0.6)
```

### Constructors

The easiest way to make a `treelabel` vector is to make one from a character vector. You call the `treelabel` constructor and provide the labels and the reference tree

```{r}
char_vec <- c("BCell", "EndothelialCell", "CD4_TCell", NA, "BCell", "EpithelialCell", "ImmuneCell")
vec <- treelabel(char_vec, tree = tree)
vec
```

If you have some uncertainty associated with each label, you can also use a named `numeric` vector to make a `treelabel` vector.

```{r}
num_vec <- c("BCell" = 0.99, "EndothelialCell" = 0.6, "CD4_TCell" = 0.8, NA, "BCell" = 0.78, "EpithelialCell" = 0.9, "ImmuneCell" = 0.4)
vec <- treelabel(num_vec, tree = tree)
vec
```

Some tools provide the confidence scores for each vertex in the tree. In this case, you can provide the annotations as a `list ` or a `data.frame`

```{r}
lst <- list(
c(BCell = 0.99, ImmuneCell = 1),
c(root = 1, EndothelialCell = 0.65),
c(CD4_TCell = 0.8, TCell = 0.95, ImmuneCell = 0.95),
NULL, # will be treated as NA
c(ImmuneCell = 0.4)
)

vec <- treelabel(lst, tree)
vec
```

Lastly, you can convert a "tidy" data frame to a treelabel. The `treelabel_from_dataframe` works differently from the other constructors, as it returns a `data.frame` with an ID column and a `treelabel` column. The function cannot directly return a `treelabel` vector because the order of the rows in the data.frame could be scrambled, in which case it is unclear how cells and elements in the treelabel relate.

```{r}
df <- data.frame(
cell_id = c("cell 1", "cell 1", "cell 2", "cell 3", "cell 3", "cell 3"),
annot = c("BCell", "ImmuneCell", NA, "TCell", "CD4_TCell", "ImmuneCell"),
confidence = c(0.99, 1, NA, 0.95, 0.8, 0.95)
)
df <- treelabel_from_dataframe(df, tree, id = "cell_id", label = "annot", score = "confidence")
df
```

### Working with the `treelabel` vector

The `treelabel` vectors can be indexed or concatenated like any regular R vector:

```{r}
vec
length(vec)
vec[2]
vec[1:4]
c(vec, vec[1:3])
```

You can extract the tree from a `treelabel` and the name of the tree root.

```{r}
tl_tree(vec)
tl_tree_root(vec)
```

The easiest way to get the score for a particular label inside a `treelabel` vector is to use `$`

```{r}
vec$ImmuneCell
vec$CD4_TCell
```

### Testing the identity

The printing function builds on the `tl_name`, which returns the vertex furthest from the root that is not `NA`. We can change this threshold. For example, for the third cell the *CD4_TCell* label does not pass the `0.9` threshold, but the *TCell* label does.

```{r, paged.print=FALSE}
tibble(vec, tl_name(vec), tl_name(vec, threshold = 0.9))
```

You can also evaluate arbitrary expressions using `tl_eval`.

```{r, paged.print=FALSE}
tibble(vec) |> mutate(is_tcell = tl_eval(vec, TCell > 0.9))
```

`treelabel` is clever about evaluating these expressions. If, for example, we ask if the cell might be a T cell (i.e., `TCell > 0.2`), the second and fifth entries switch from `FALSE` to `NA`.

```{r, paged.print=FALSE}
tibble(vec) |> mutate(maybe_tcell = tl_eval(vec, TCell > 0.2))
```

To understand why, let's look at how `treelabel` internally stores the data. Internally, the scores are stored as a matrix with one column for each label, and the scores that were not specified are stored as `NA`.

```{r}
tl_score_matrix(vec)
```

For each missing element, we can give a lower and upper bound for the value. For the fifth element the confidence that it is an `ImmuneCell` is `tl_get(vec[5], "ImmuneCell")` = `r tl_get(vec[5], "ImmuneCell")`. This means that each child can also be at most `0.4`.

The general formula is that the score for a vertex `v` that is `NA` can be at most (in pseudocode): `max(0, score(parent(v)) - sum(children(parent(v)), na.rm=TRUE))`.

```{r}
# tl_atmost is clever
tl_atmost(vec) |> tl_score_matrix()
# tl_atleast simply replaces `NA`'s with zeros
tl_atleast(vec) |> tl_score_matrix()
```

The `tl_eval` function evaluates its arguments for `tl_atmost(x)` and `tl_atleast(x)`. If the results agree, that value is returned; if not, `tl_eval` returns `NA`. A word of caution: this function can give surprising results if multiple label references occur in the expression.

```{r}
t1 <- treelabel(list(c("TCell" = 0.8)), tree)
# Ideally both function calls would return `NA`
tl_eval(t1, CD4_TCell > CD8_TCell)
tl_eval(t1, CD4_TCell < CD8_TCell)
```

### Arithmetic

You can combine two vectors or summarize across elements. You can do whatever calculations you want, and `treelabel` will try to make the right thing happen. You can also do problematic things like produce negative values. `treelabel` currently does not stop you, but this breaks one of the assumptions of `treelabel`.

```{r, paged.print=FALSE}
vec2 <- treelabel(c("BCell" = 0.8, "EpithelialCell" = 0.3, "TCell" = 0.9, "CD8_TCell" = 0.2, "TCell" = 0.8), tree)
tibble(vec, vec2) |>
mutate(arithmetic_mean = (vec + vec2) / 2,
geometric_mean = (vec * vec2)^(1/2),
rounding = round(vec))
```

### Modification

You can modify the elements of a `treelabel` vector. The easiest is way is to use `if_else` (note `ifelse` does not work!!) and mix the content of two vectors. Alternatively, you can set elements to `NA`.

```{r}
high_quality_res <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
# Combine two vectors or set one to 'NA'
if_else(high_quality_res, vec, vec2)
if_else(high_quality_res, vec, NA)
```

If you want to modify the content within a tree, that is change the value of individual vertices, you can use the `tl_modify` function.

```{r}
# The effect of tl_modify is best understood by considering the underlying score matrix
tl_score_matrix(vec)[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = 0.3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = ImmuneCell / 3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = root - ImmuneCell/3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = NA, .propagate_NAs_down = TRUE))
tl_score_matrix(tl_modify(vec, ImmuneCell = NA, .propagate_NAs_down = FALSE))
```

#### Tree modifications

Sometimes you don't want to change the values within the tree, but change the tree structure or only work on a selected branch. The `tl_tree_modify` allows you to set a completely new tree structure and only retain values for vertices that occurr in both the new and old tree.

```{r}
subtree <- igraph::graph_from_literal(
root - CD4_TCell : CD8_TCell : EndothelialCell : EpithelialCell
)

tl_score_matrix(vec)
tl_tree_modify(vec, subtree) |> tl_score_matrix()
```

Sometimes, you only want to work on a single branch of the tree. You can do this using the `tl_tree_filter` and `tl_tree_cut` functions.

```{r}
# Select all T cells
tl_tree_cut(vec, new_root = "TCell")

# This does the same, but leaves the old root
tl_tree_filter(vec, \(names) grepl("TCell", names))
```

### Consensus construction

`treelabel` provides functions to make it easy to apply expression across `treelabel` columns. These functions are built on top of [`dplyr::across`](https://dplyr.tidyverse.org/reference/across.html). They take as the first argument a specification of columns (e.g., `where(is_treelabel)` or `starts_with("label_")`). The second argument is evaluated internally with `tl_eval`.

```{r, paged.print=FALSE}
dat <- tibble(cell_id = paste0("cell_", 1:5), vec, vec2)

dat |> mutate(is_immune = tl_across(where(is_treelabel), ImmuneCell > 0.7))
dat |> mutate(immune_counts = tl_sum_across(where(is_treelabel), ImmuneCell > 0.7))
dat |> mutate(mean_immune_score = tl_mean_across(where(is_treelabel), ImmuneCell))
dat |> filter(tl_if_all(where(is_treelabel), ImmuneCell > 0.7))
```

In addition, we can also work on the whole matrix of values per tree label and combine them.

```{r, paged.print=FALSE}
dat |> mutate(consensus = tl_mean_across(c(vec,vec2)))

dat |>
mutate(across(c(vec, vec2), \(x) tl_modify(x, .scores > 0.5))) |>
mutate(consensus = tl_sum_across(c(vec,vec2)))
```

### Pretty plotting

The following visualization is inspired by the default tree visualization in D3.

```{r}
#' Calculate layout of tree using igraph and return results as two tibbles.
prepare_tree_for_plotting <- function(tree, tree_root = "root"){
tree <- .make_tree(tree, root = tree_root)

layout <- igraph::layout_as_tree(tree, root = tree_root)

children <- lapply(igraph::V(tree), \(v){
igraph::neighbors(tree, v, mode = "out")$name
})

vertices <- igraph::V(tree)$name
nodes <- tibble(node = vertices,
distance_to_root = max(layout[,2]) - layout[,2],
position = layout[,1],
is_leaf = vapply(children, \(x) length(x) == 0, FUN.VALUE = logical(1L)))

edges <- edges <- tibble(node = vertices,
child = unname(children)) |>
unnest(child) |>
left_join(nodes, by = c("node" = "node")) |>
left_join(nodes, by = c("child" = "node"), suffix = c(".node", ".child"))

list(nodes = nodes, edges = edges)
}
```

Make the plot. The [ggbezier](https://github.com/const-ae/ggbezier) is not on CRAN yet.

```{r ggplot_code, fig.height=2}
pl_tree <- prepare_tree_for_plotting(tree)

ggplot(data = pl_tree$nodes, aes(x = distance_to_root, y = position)) +
ggbezier::geom_bezier(data = pl_tree$edges |> pivot_longer(c(ends_with(".node"), ends_with(".child")), names_sep = "\\.", names_to = c(".value", "side")),
aes(x = distance_to_root, y = position, x_handle1 = distance_to_root - 0.4,
x_handle2 = distance_to_root + 0.4, y_handle1 = position, y_handle2 = position, group = paste0(node, "-", child)),
show_handles = FALSE, color = "lightgrey", linewidth = 0.3) +
geom_point(aes(color = I(ifelse(is_leaf, "lightgrey", "#4e4e4e")))) +
shadowtext::geom_shadowtext(aes(label = node, hjust = ifelse(is_leaf, 0, 1), x = distance_to_root + ifelse(is_leaf, 0.03, -0.03)),
color = "black", bg.colour = "white") +
scale_x_continuous(expand = expansion(add = c(0.5, 0.9))) +
theme_void()
```

## Session Info

```{r}
sessionInfo()
```