https://github.com/tidyomics/plyranges
A grammar of genomic data transformation
https://github.com/tidyomics/plyranges
bioconductor data-analysis dplyr genomic-ranges genomics tidy-data
Last synced: 4 months ago
JSON representation
A grammar of genomic data transformation
- Host: GitHub
- URL: https://github.com/tidyomics/plyranges
- Owner: tidyomics
- Created: 2017-08-28T05:12:39.000Z (almost 9 years ago)
- Default Branch: devel
- Last Pushed: 2025-11-20T14:44:53.000Z (6 months ago)
- Last Synced: 2025-11-20T16:21:14.246Z (6 months ago)
- Topics: bioconductor, data-analysis, dplyr, genomic-ranges, genomics, tidy-data
- Language: R
- Homepage: https://tidyomics.github.io/plyranges/
- Size: 3.34 MB
- Stars: 148
- Watchers: 11
- Forks: 18
- Open Issues: 39
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: .github/CONTRIBUTING.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, echo = FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# plyranges: fluent genomic data analysis 
[](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc)
[](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges)
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling
genomics data from a variety of sources. The package defines a grammar of
genomic data transformation based on `dplyr` and the Bioconductor packages
`IRanges`, `GenomicRanges`, and `rtracklayer`. It does this by providing a set
of verbs for developing analysis pipelines based on _Ranges_ objects that
represent genomic regions:
* Modify genomic regions with the `mutate()` and `stretch()` functions.
* Modify genomic regions while fixing the start/end/center coordinates with the `anchor_` family of functions.
* Sort genomic ranges with `arrange()`.
* Modify, subset, and aggregate genomic data with the `mutate()`,
`filter()`, and `summarise()`functions.
* Any of the above operations can be performed on partitions of the
data with `group_by()`.
* Find nearest neighbour genomic regions with the `join_nearest_` family
of functions.
* Find overlaps between ranges with the `join_overlaps_` family of functions.
* Merge all overlapping and adjacent genomic regions with `reduce_ranges()`.
* Merge the end points of all genomic regions with `disjoin_ranges()`.
* Import and write common genomic data formats with the `read_/write_` family
of functions.
For more details on the features of plyranges, read the
[vignette](https://tidyomics.github.io/plyranges/articles/an-introduction.html).
For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq
results read the [*fluentGenomics*
workflow](https://tidyomics.github.io/fluentGenomics).
plyranges is part of the [tidyomics](https://github.com/tidyomics)
project, providing a `dplyr`-based interface for many types of
genomics datasets represented in Bioconductor.
# Installation
[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor
release:
```{r, eval=FALSE}
# install.packages("BiocManager")
BiocManager::install("plyranges")
```
To install the development version from GitHub:
```{r, eval=FALSE}
BiocManager::install("tidyomics/plyranges")
```
# Quick overview
## About `Ranges`
`Ranges` objects can either represent sets of integers as `IRanges` (which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as `GRanges`. In addition,
both types of `Ranges` can store information about their intervals as metadata
columns (for example GC content over a genomic interval).
`Ranges` objects follow the tidy data principle: each row of a `Ranges` object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).
We can construct a `IRanges` object from a `data.frame` with a `start` or
`width` using the `as_iranges()` method.
```{r, message=FALSE}
library(plyranges)
df <- data.frame(start = 1:5, width = 5)
as_iranges(df)
# alternatively with end
df <- data.frame(start = 1:5, end = 5:9)
as_iranges(df)
```
We can also construct a `GRanges` object in a similar manner. Note that a
`GRanges` object requires at least a seqnames column to be present in the
data.frame (but not necessarily a strand column).
```{r}
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
start = 1:5,
width = 5)
as_granges(df)
# strand can be specified with `+`, `*` (mising) and `-`
df$strand <- c("+", "+", "-", "-", "*")
as_granges(df)
```
# Example: finding GWAS hits that overlap known exons
Let's look at a more a realistic example (taken from HelloRanges vignette).
```{r, include=FALSE}
dir <- system.file(package = "HelloRangesData", "extdata/")
genome <- as_granges(read.delim(file.path(dir, "hg19.genome"),
header = FALSE),
seqnames = V1, start = 1L, width = V2)
gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome)
exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome)
```
Suppose we have two _GRanges_ objects: one containing coordinates of known
exons and another containing SNPs from a GWAS.
The first and last 5 exons are printed below, there are two additional columns
corresponding to the exon name, and a score.
We could check the number of exons per chromosome using `group_by` and
`summarise`.
```{r}
exons
exons %>%
group_by(seqnames) %>%
summarise(n = n())
```
Next we create a column representing the transcript_id with `mutate`:
```{r}
exons <- exons %>%
mutate(tx_id = sub("_exon.*", "", name))
```
To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This
will create a new _GRanges_ with the coordinates of SNPs that overlap exons, as
well as metadata from both objects.
```{r}
olap <- join_overlap_inner(gwas, exons)
olap
```
For each SNP we can count the number of times it overlaps a transcript.
```{r}
olap %>%
group_by(name.x, tx_id) %>%
summarise(n = n())
```
We can also generate 2bp splice sites on either side of the exon using
`flank_left` and `flank_right`. We add a column indicating the side of flanking
for illustrative purposes. The `interweave` function pairs the left and right
ranges objects.
```{r}
left_ss <- flank_left(exons, 2L)
right_ss <- flank_right(exons, 2L)
all_ss <- interweave(left_ss, right_ss, .id = "side")
all_ss
```
# Learning more
- The [*fluentGenomics* workflow](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html).
- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed
walk through of using plyranges for coverage analysis.
- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.
# Citation
If you found `plyranges` useful for your work please cite our
[paper](http://dx.doi.org/10.1186/s13059-018-1597-8):
```
@ARTICLE{Lee2019,
title = "plyranges: a grammar of genomic data transformation",
author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",
journal = "Genome Biol.",
volume = 20,
number = 1,
pages = "4",
month = jan,
year = 2019,
url = "http://dx.doi.org/10.1186/s13059-018-1597-8",
doi = "10.1186/s13059-018-1597-8",
pmc = "PMC6320618"
}
```
# Contributing
We welcome contributions from the R/Bioconductor community. We ask that
contributors follow the [code of conduct](.github/CODE_OF_CONDUCT.md) and the guide
outlined [here](.github/CONTRIBUTING.md).