https://github.com/tidyomics/plyranges

A grammar of genomic data transformation
https://github.com/tidyomics/plyranges

bioconductor data-analysis dplyr genomic-ranges genomics tidy-data

Last synced: 5 months ago
JSON representation

A grammar of genomic data transformation

Host: GitHub
URL: https://github.com/tidyomics/plyranges
Owner: tidyomics
Created: 2017-08-28T05:12:39.000Z (almost 9 years ago)
Default Branch: devel
Last Pushed: 2025-11-20T14:44:53.000Z (8 months ago)
Last Synced: 2025-11-20T16:21:14.246Z (8 months ago)
Topics: bioconductor, data-analysis, dplyr, genomic-ranges, genomics, tidy-data
Language: R
Homepage: https://tidyomics.github.io/plyranges/
Size: 3.34 MB
Stars: 148
Watchers: 11
Forks: 18
Open Issues: 39
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: .github/CONTRIBUTING.md
- Code of conduct: .github/CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, echo = FALSE, message=FALSE, warning=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

```

# plyranges: fluent genomic data analysis 

[![R-CMD-check-bioc](https://github.com/tidyomics/plyranges/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc)

[![BioC status](http://www.bioconductor.org/shields/build/release/bioc/plyranges.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges)

[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling

genomics data from a variety of sources. The package defines a grammar of

genomic data transformation based on `dplyr` and the Bioconductor packages

`IRanges`, `GenomicRanges`, and `rtracklayer`. It does this by providing a set

of verbs for developing analysis pipelines based on _Ranges_ objects that

represent genomic regions:

* Modify genomic regions with the `mutate()` and `stretch()` functions.

* Modify genomic regions while fixing the start/end/center coordinates with the `anchor_` family of functions.

* Sort genomic ranges with `arrange()`.

* Modify, subset, and aggregate genomic data with the `mutate()`,

`filter()`, and `summarise()`functions.

* Any of the above operations can be performed on partitions of the

data with `group_by()`.

* Find nearest neighbour genomic regions with the `join_nearest_` family

of functions.

* Find overlaps between ranges with the `join_overlaps_` family of functions.

* Merge all overlapping and adjacent genomic regions with `reduce_ranges()`.

* Merge the end points of all genomic regions with `disjoin_ranges()`.

* Import and write common genomic data formats with the `read_/write_` family

of functions.

For more details on the features of plyranges, read the

[vignette](https://tidyomics.github.io/plyranges/articles/an-introduction.html).

For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq

results read the [*fluentGenomics*

workflow](https://tidyomics.github.io/fluentGenomics).

plyranges is part of the [tidyomics](https://github.com/tidyomics)

project, providing a `dplyr`-based interface for many types of

genomics datasets represented in Bioconductor.

# Installation

[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor

release:

```{r, eval=FALSE}

# install.packages("BiocManager")

BiocManager::install("plyranges")

```

To install the development version from GitHub:

```{r, eval=FALSE}

BiocManager::install("tidyomics/plyranges")

```

# Quick overview

## About `Ranges`

`Ranges` objects can either represent sets of integers as `IRanges` (which have

start, end and width attributes) or represent genomic intervals (which have

additional attributes, sequence name, and strand) as `GRanges`.  In addition,

both types of `Ranges` can store information about their intervals as metadata

columns (for example GC content over a genomic interval).

`Ranges` objects follow the tidy data principle: each row of a `Ranges` object

corresponds to an interval, while each column will represent a variable about

that interval, and generally each object will represent a single unit of

observation (like gene annotations).

We can construct a `IRanges` object from a `data.frame` with a `start` or

`width` using the `as_iranges()` method.

```{r, message=FALSE}

library(plyranges)

df <- data.frame(start = 1:5, width = 5)

as_iranges(df)

# alternatively with end

df <- data.frame(start = 1:5, end = 5:9)

as_iranges(df)

```

We can also construct a `GRanges` object in a similar manner. Note that a

`GRanges` object requires at least a seqnames column to be present in the

data.frame (but not necessarily a strand column).

```{r}

df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),

                 start = 1:5,

                 width = 5)

as_granges(df)

# strand can be specified with `+`, `*` (mising) and `-`

df$strand <- c("+", "+", "-", "-", "*")

as_granges(df)

```

# Example: finding GWAS hits that overlap known exons

Let's look at a more a realistic example (taken from HelloRanges vignette).

```{r, include=FALSE}

dir <- system.file(package = "HelloRangesData", "extdata/")

genome <- as_granges(read.delim(file.path(dir, "hg19.genome"),

                     header = FALSE),

                     seqnames = V1, start = 1L, width = V2)

gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome)

exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome)

```

Suppose we have two _GRanges_ objects: one containing coordinates of known

exons and another containing SNPs from a GWAS.

The first and last 5 exons are printed below, there are two additional columns

corresponding to the exon name, and a score.

We could check the number of exons per chromosome using `group_by` and

`summarise`.

```{r}

exons

exons %>%

  group_by(seqnames) %>%

  summarise(n = n())

```

Next we create a column representing the transcript_id with `mutate`:

```{r}

exons <- exons %>%

  mutate(tx_id = sub("_exon.*", "", name))

```

To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This

will create a new _GRanges_ with the coordinates of SNPs that overlap exons, as

well as metadata from both objects.

```{r}

olap <- join_overlap_inner(gwas, exons)

olap

```

For each SNP we can count the number of times it overlaps a transcript.

```{r}

olap %>%

  group_by(name.x, tx_id) %>%

  summarise(n = n())

```

We can also generate 2bp splice sites on either side of the exon using

`flank_left` and `flank_right`. We add a column indicating the side of flanking

for illustrative purposes. The `interweave` function pairs the left and right

ranges objects.

```{r}

left_ss <- flank_left(exons, 2L)

right_ss <- flank_right(exons, 2L)

all_ss <- interweave(left_ss, right_ss, .id = "side")

all_ss

```

# Learning more

- The [*fluentGenomics* workflow](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html).

- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed

walk through of using plyranges for coverage analysis.

- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.

# Citation

If you found `plyranges` useful for your work please cite our

[paper](http://dx.doi.org/10.1186/s13059-018-1597-8):

```

@ARTICLE{Lee2019,

  title    = "plyranges: a grammar of genomic data transformation",

  author   = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",

  journal  = "Genome Biol.",

  volume   =  20,

  number   =  1,

  pages    = "4",

  month    =  jan,

  year     =  2019,

  url      = "http://dx.doi.org/10.1186/s13059-018-1597-8",

  doi      = "10.1186/s13059-018-1597-8",

  pmc      = "PMC6320618"

}

```

# Contributing

We welcome contributions from the R/Bioconductor community. We ask that

contributors follow the [code of conduct](.github/CODE_OF_CONDUCT.md) and the guide

outlined [here](.github/CONTRIBUTING.md).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tidyomics/plyranges

Awesome Lists containing this project

README