Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/heike/gglogo

R package for creating sequence logo plots
https://github.com/heike/gglogo

Last synced: 16 days ago
JSON representation

R package for creating sequence logo plots

Awesome Lists containing this project

README

        

---
title: "gglogo"
author: "Eric Hare, Heike Hofmann"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
keep_md: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/")
```

R package for creating sequence logo plots

[![CRAN Status](http://www.r-pkg.org/badges/version/gglogo)](https://cran.r-project.org/package=gglogo) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/gglogo?color=blue)](https://r-pkg.org/pkg/gglogo)
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](https://github.com/heike/gglogo/commits/main)
[![R-CMD-check](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/heike/gglogo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/heike/gglogo?branch=main)

# Installation

`gglogo` is available from CRAN (version 0.1.4):
```{r, eval = FALSE}
install.packages("gglogo")
```

The development version is available from Github (0.1.9000):

```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("heike/gglogo", build_vignettes = TRUE)
```

# Getting Started

Load the library

```{r}
library(gglogo)
```

Load a dataset containing a set of peptide sequences

```{r, message = FALSE}
data(sequences)
head(sequences)
```

Now plot the sequences in a(n almost) traditional sequence plot, the Shannon information is shown on the y axis.

```{r warning = FALSE, message=FALSE}
library(ggplot2)
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position = "classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```

(Sequence) Logo plots ([Schneider & Stephens 1990](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316)) are typically used in bioinformatics as a way to visually demonstrate how well a sequence of nucleotides or amino acids are preserved in a certain region.

A cognitively better version of the plot is the default, i.e. without specifying the `position` parameter, the plot defaults to aligning the largest contributor in each position along the y axis and showing all other variants in each position by a tail hanging below the axis. Longer tails indicate more variability in a position.

```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```

## Other variants

Besides the Shannon information, we could also visualize the frequencies of peptides in each position. We can either set `method = frequency`, or calculate the (relative) frequency information ourselves as:

```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = freq/total, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```

Using the classic variant of alignment results in a stacked barchart of amino acids by position:

```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position="classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```

## Implementation details

This implementation of sequence logos is a two-step process of data prepping/wrangling followed by the visualization.
The data prepping happens in the function `ggfortify`:

```{r message = FALSE}
library(dplyr)

seq_info <- sequences %>% # data pipeline for processing
ggfortify(
peptide, # variable in which the sequences are stored
treatment = class,
method = "shannon",
missing_encode = c(".", "*", NA)
)
```

`sequences` specifies the variable of the sequences in the data set, `treatment` is a (list) of grouping variables for which the (Shannon) information will be calculated in each position. For peptide sequences, the data set `aacids` is used to provide additional information on properties.

```{r}
head(seq_info)
```

By specifying the `treatment` parameter, the corresponding information methods are now calculated for treatments as well, and we can assess the variability/conservation of the sequence by the treatment:

```{r}
seq_info %>%
ggplot() +
geom_logo(aes(x = class, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom") +
facet_wrap(~position, ncol = 12)

```

## Available alphabets

By default, the font used for logo plots is Helvetica, available as dataset `alphabet`. Each letter is implemented in form of a polygon with `x` and `y` coordinates. The variable `group` contains the corresponding letter.

```{r message = FALSE}
alphabet %>%
filter(group == "B") %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1)
```

Besides the default alphabet, the fonts Comic Sans, xkcd, and braille (for 3d printing) are implemented:

```{r}
alphabet_comic %>%
filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Comic Sans")

alphabet_xkcd %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("xkcd font")

alphabet_braille %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Braille (use in 3d prints)")

```

## References

Schneider, TD, Stephens, RM (1990). [Sequence logos: a new way to display consensus sequences](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316). Nucleic Acids Res, 18, 20:6097-100.