Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/heike/gglogo
R package for creating sequence logo plots
https://github.com/heike/gglogo
Last synced: 16 days ago
JSON representation
R package for creating sequence logo plots
- Host: GitHub
- URL: https://github.com/heike/gglogo
- Owner: heike
- Created: 2013-06-29T19:04:43.000Z (over 11 years ago)
- Default Branch: main
- Last Pushed: 2024-02-01T16:19:48.000Z (10 months ago)
- Last Synced: 2024-10-12T21:19:25.297Z (about 1 month ago)
- Language: R
- Homepage: http://heike.github.io/gglogo/
- Size: 24.1 MB
- Stars: 21
- Watchers: 2
- Forks: 3
- Open Issues: 7
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
title: "gglogo"
author: "Eric Hare, Heike Hofmann"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
keep_md: true
---```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/")
```R package for creating sequence logo plots
[![CRAN Status](http://www.r-pkg.org/badges/version/gglogo)](https://cran.r-project.org/package=gglogo) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/gglogo?color=blue)](https://r-pkg.org/pkg/gglogo)
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](https://github.com/heike/gglogo/commits/main)
[![R-CMD-check](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/heike/gglogo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/heike/gglogo?branch=main)# Installation
`gglogo` is available from CRAN (version 0.1.4):
```{r, eval = FALSE}
install.packages("gglogo")
```The development version is available from Github (0.1.9000):
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("heike/gglogo", build_vignettes = TRUE)
```# Getting Started
Load the library
```{r}
library(gglogo)
```Load a dataset containing a set of peptide sequences
```{r, message = FALSE}
data(sequences)
head(sequences)
```Now plot the sequences in a(n almost) traditional sequence plot, the Shannon information is shown on the y axis.
```{r warning = FALSE, message=FALSE}
library(ggplot2)
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position = "classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```(Sequence) Logo plots ([Schneider & Stephens 1990](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316)) are typically used in bioinformatics as a way to visually demonstrate how well a sequence of nucleotides or amino acids are preserved in a certain region.
A cognitively better version of the plot is the default, i.e. without specifying the `position` parameter, the plot defaults to aligning the largest contributor in each position along the y axis and showing all other variants in each position by a tail hanging below the axis. Longer tails indicate more variability in a position.
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```## Other variants
Besides the Shannon information, we could also visualize the frequencies of peptides in each position. We can either set `method = frequency`, or calculate the (relative) frequency information ourselves as:
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = freq/total, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```Using the classic variant of alignment results in a stacked barchart of amino acids by position:
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position="classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```## Implementation details
This implementation of sequence logos is a two-step process of data prepping/wrangling followed by the visualization.
The data prepping happens in the function `ggfortify`:```{r message = FALSE}
library(dplyr)seq_info <- sequences %>% # data pipeline for processing
ggfortify(
peptide, # variable in which the sequences are stored
treatment = class,
method = "shannon",
missing_encode = c(".", "*", NA)
)
````sequences` specifies the variable of the sequences in the data set, `treatment` is a (list) of grouping variables for which the (Shannon) information will be calculated in each position. For peptide sequences, the data set `aacids` is used to provide additional information on properties.
```{r}
head(seq_info)
```By specifying the `treatment` parameter, the corresponding information methods are now calculated for treatments as well, and we can assess the variability/conservation of the sequence by the treatment:
```{r}
seq_info %>%
ggplot() +
geom_logo(aes(x = class, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom") +
facet_wrap(~position, ncol = 12)```
## Available alphabets
By default, the font used for logo plots is Helvetica, available as dataset `alphabet`. Each letter is implemented in form of a polygon with `x` and `y` coordinates. The variable `group` contains the corresponding letter.
```{r message = FALSE}
alphabet %>%
filter(group == "B") %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1)
```Besides the default alphabet, the fonts Comic Sans, xkcd, and braille (for 3d printing) are implemented:
```{r}
alphabet_comic %>%
filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Comic Sans")alphabet_xkcd %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("xkcd font")alphabet_braille %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Braille (use in 3d prints)")```
## References
Schneider, TD, Stephens, RM (1990). [Sequence logos: a new way to display consensus sequences](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316). Nucleic Acids Res, 18, 20:6097-100.