Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/heike/gglogo

R package for creating sequence logo plots
https://github.com/heike/gglogo

Last synced: 4 months ago
JSON representation

R package for creating sequence logo plots

Host: GitHub
URL: https://github.com/heike/gglogo
Owner: heike
Created: 2013-06-29T19:04:43.000Z (over 11 years ago)
Default Branch: main
Last Pushed: 2024-02-01T16:19:48.000Z (about 1 year ago)
Last Synced: 2024-10-12T21:19:25.297Z (4 months ago)
Language: R
Homepage: http://heike.github.io/gglogo/
Size: 24.1 MB
Stars: 21
Watchers: 2
Forks: 3
Open Issues: 7
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

        ---

title: "gglogo"

author: "Eric Hare, Heike Hofmann"

date: "`r format(Sys.time(), '%B %d, %Y')`"

output: 

  html_document:

    keep_md: true

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/")

```

R package for creating sequence logo plots

[![CRAN Status](http://www.r-pkg.org/badges/version/gglogo)](https://cran.r-project.org/package=gglogo) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/gglogo?color=blue)](https://r-pkg.org/pkg/gglogo)

[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](https://github.com/heike/gglogo/commits/main)

[![R-CMD-check](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml)

[![Codecov test coverage](https://codecov.io/gh/heike/gglogo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/heike/gglogo?branch=main)

# Installation

`gglogo` is available from CRAN (version 0.1.4):

```{r, eval = FALSE}

install.packages("gglogo")

```

The development version is available from Github (0.1.9000):

```{r, eval = FALSE}

# install.packages("devtools")

devtools::install_github("heike/gglogo", build_vignettes = TRUE)

```

# Getting Started

Load the library

```{r}

library(gglogo)

```

Load a dataset containing a set of peptide sequences

```{r, message = FALSE}

data(sequences)

head(sequences)

```

Now plot the sequences in a(n almost) traditional sequence plot, the Shannon information is shown on the y axis. 

```{r warning = FALSE, message=FALSE}

library(ggplot2)

ggplot(data = ggfortify(sequences, peptide, method="shannon")) +      

  geom_logo(aes(x = position, y = info, group = element, 

     label = element, fill = interaction(Polarity, Water)),

     alpha = 0.6, position = "classic")  +

  scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +

  theme(legend.position = "bottom") 

```

(Sequence) Logo plots ([Schneider & Stephens 1990](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316)) are typically used in bioinformatics as a way to visually demonstrate how well a sequence of nucleotides or amino acids are preserved in a certain region.

A cognitively better version of the plot is the default, i.e. without specifying the `position` parameter, the plot defaults to aligning the largest contributor in each position along the y axis and showing all other variants in each position by a tail hanging below the axis. Longer tails indicate more variability in a position. 

```{r}

ggplot(data = ggfortify(sequences, peptide, method="shannon")) +      

  geom_logo(aes(x = position, y = info, group = element, 

     label = element, fill = interaction(Polarity, Water)),

     alpha = 0.6)  +

  scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +

  theme(legend.position = "bottom") 

```

## Other variants

Besides the Shannon information, we could also visualize the frequencies of peptides in each position. We can either set `method = frequency`, or calculate the (relative) frequency information ourselves as:

```{r}

ggplot(data = ggfortify(sequences, peptide, method="shannon")) +      

  geom_logo(aes(x = position, y = freq/total, group = element, 

     label = element, fill = interaction(Polarity, Water)),

     alpha = 0.6)  +

  scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +

  theme(legend.position = "bottom") 

```

Using the classic variant of alignment results in a stacked barchart of amino acids by position:

```{r}

ggplot(data = ggfortify(sequences, peptide, method="shannon")) +      

  geom_logo(aes(x = position, y = info, group = element, 

     label = element, fill = interaction(Polarity, Water)),

     alpha = 0.6, position="classic")  +

  scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +

  theme(legend.position = "bottom") 

```

## Implementation details 

This implementation of sequence logos is a two-step process of data prepping/wrangling followed by the visualization. 

The data prepping happens in the function `ggfortify`:

```{r message = FALSE}

library(dplyr)

seq_info <- sequences %>%     # data pipeline for processing

  ggfortify(

    peptide, # variable in which the sequences are stored

    treatment = class,

    method = "shannon",

    missing_encode = c(".", "*", NA)

  )

```

`sequences` specifies the variable of the sequences in the data set, `treatment` is a (list) of grouping variables for which the (Shannon) information will be calculated in each position. For peptide sequences, the data set `aacids` is used to provide additional information on properties. 

```{r}

head(seq_info)

```

By specifying the `treatment` parameter, the corresponding information methods are now calculated for treatments as well, and we can assess the variability/conservation of the sequence by the treatment:

```{r}

seq_info %>%

  ggplot() + 

  geom_logo(aes(x = class, y = info, group = element, 

     label = element, fill = interaction(Polarity, Water)),

     alpha = 0.6)  +

  scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +

  theme(legend.position = "bottom") +

  facet_wrap(~position, ncol = 12)

```

## Available alphabets

By default, the font used for logo plots is Helvetica, available as dataset `alphabet`. Each letter is implemented in form of a polygon with `x` and `y` coordinates. The variable `group` contains the corresponding letter.

```{r message = FALSE}

alphabet %>%

  filter(group == "B") %>% 

  ggplot(aes(x = x, y = y)) + geom_polygon() + 

  theme(aspect.ratio = 1)

```

Besides the default alphabet, the fonts  Comic Sans, xkcd, and braille (for 3d printing) are implemented:

```{r}

alphabet_comic %>% 

  filter(group %in% c(LETTERS, 0:9)) %>%

  ggplot(aes(x = x, y = y)) + geom_polygon() + 

  theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + 

  ggtitle("Comic Sans")

alphabet_xkcd %>% 

  dplyr::filter(group %in% c(LETTERS, 0:9)) %>%

  ggplot(aes(x = x, y = y)) + geom_polygon() + 

  theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + 

  ggtitle("xkcd font")

alphabet_braille %>% 

  dplyr::filter(group %in% c(LETTERS, 0:9)) %>%

  ggplot(aes(x = x, y = y)) + geom_polygon() + 

  theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + 

  ggtitle("Braille (use in 3d prints)")

```

## References

Schneider, TD, Stephens, RM (1990). [Sequence logos: a new way to display consensus sequences](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316). Nucleic Acids Res, 18, 20:6097-100.