https://github.com/const-ae/tidygenomics

Tidy Verbs for Dealing with Genomic Data Frames https://const-ae.github.io/tidygenomics/
https://github.com/const-ae/tidygenomics

genomics intervals r-package r-stats tidy

Last synced: 3 months ago
JSON representation

Tidy Verbs for Dealing with Genomic Data Frames https://const-ae.github.io/tidygenomics/

Host: GitHub
URL: https://github.com/const-ae/tidygenomics
Owner: const-ae
Created: 2017-05-07T20:45:40.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2021-04-15T07:00:02.000Z (about 4 years ago)
Last Synced: 2025-03-23T20:37:02.891Z (4 months ago)
Topics: genomics, intervals, r-package, r-stats, tidy
Language: R
Homepage:
Size: 206 KB
Stars: 103
Watchers: 3
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # tidygenomics

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/tidygenomics)](https://cran.r-project.org/package=tidygenomics)

Tidy Verbs for Dealing with Genomic Data Frames

## Description

Handle genomic data within data frames just as you would with `GRanges`.

This packages provides method to deal with genomics intervals the "tidy-way" which makes

it simpler to integrate in the the general data munging process. The API is inspired by the

popular bedtools and the genome_join() method from the fuzzyjoin package.

## Installation

```

install.packages("tidygenomics")

```

Or to get the latest development version

```

devtools::install_github("const-ae/tidygenomics")

```

## Documentation

#### genome_intersect

Joins 2 data frames based on their genomic overlap. Unlike the `genome_join` function it updates the boundaries to reflect

the overlap of the regions.



```{r}

x1 <- data.frame(id = 1:4, 

                chromosome = c("chr1", "chr1", "chr2", "chr2"),

                start = c(100, 200, 300, 400),

                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,

                 chromosome = c("chr1", "chr2", "chr2", "chr1"),

                 start = c(140, 210, 400, 300),

                 end = c(160, 240, 415, 320))

genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")

```

| id.x|chromosome | id.y| start| end|

|----:|:----------|----:|-----:|---:|

|    1|chr1       |    1|   140| 150|

|    4|chr2       |    3|   400| 415|

#### genome_subtract

Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.



```{r}

x1 <- data.frame(id = 1:4,

                chromosome = c("chr1", "chr1", "chr2", "chr1"),

                start = c(100, 200, 300, 400),

                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,

                chromosome = c("chr1", "chr2", "chr1", "chr1"),

                start = c(120, 210, 300, 400),

                end = c(125, 240, 320, 415))

genome_subtract(x1, x2, by=c("chromosome", "start", "end"))

```

| id|chromosome | start| end|

|--:|:----------|-----:|---:|

|  1|chr1       |   100| 119|

|  1|chr1       |   126| 150|

|  2|chr1       |   200| 250|

|  3|chr2       |   300| 350|

|  4|chr1       |   416| 450|

#### genome_join_closest

Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.



```{r}

x1 <- data_frame(id = 1:4, 

                 chr = c("chr1", "chr1", "chr2", "chr3"),

                 start = c(100, 200, 300, 400),

                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,

                 chr = c("chr1", "chr1", "chr1", "chr2"),

                 start = c(220, 210, 300, 400),

                 end = c(225, 240, 320, 415))

genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")

```

| id.x|chr.x | start.x| end.x| id.y|chr.y | start.y| end.y| distance|

|----:|:-----|-------:|-----:|----:|:-----|-------:|-----:|--------:|

|    1|chr1  |     100|   150|    2|chr1  |     210|   240|       59|

|    2|chr1  |     200|   250|    1|chr1  |     220|   225|        0|

|    2|chr1  |     200|   250|    2|chr1  |     210|   240|        0|

|    3|chr2  |     300|   350|    4|chr2  |     400|   415|       49|

|    4|chr3  |     400|   450|   NA|NA    |      NA|    NA|       NA|

#### genome_cluster

Add a new column with the cluster if 2 intervals are overlapping or are within the `max_distance`.



```{r}

x1 <- data.frame(id = 1:4, bla=letters[1:4],

                chromosome = c("chr1", "chr1", "chr2", "chr1"),

                start = c(100, 120, 300, 260),

                end = c(150, 250, 350, 450))

genome_cluster(x1, by=c("chromosome", "start", "end"))

```

| id|bla |chromosome | start| end| cluster_id|

|--:|:---|:----------|-----:|---:|----------:|

|  1|a   |chr1       |   100| 150|          0|

|  2|b   |chr1       |   120| 250|          0|

|  3|c   |chr2       |   300| 350|          2|

|  4|d   |chr1       |   260| 450|          1|

```{r}

genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)

```

| id|bla |chromosome | start| end| cluster_id|

|--:|:---|:----------|-----:|---:|----------:|

|  1|a   |chr1       |   100| 150|          0|

|  2|b   |chr1       |   120| 250|          0|

|  3|c   |chr2       |   300| 350|          1|

|  4|d   |chr1       |   260| 450|          0|

#### genome_complement

Calculates the complement of a genomic region.



```{r}

x1 <- data.frame(id = 1:4,

                 chromosome = c("chr1", "chr1", "chr2", "chr1"),

                 start = c(100, 200, 300, 400),

                 end = c(150, 250, 350, 450))

genome_complement(x1, by=c("chromosome", "start", "end"))

```

|chromosome | start| end|

|:----------|-----:|---:|

|chr1       |     1|  99|

|chr1       |   151| 199|

|chr1       |   251| 399|

|chr2       |     1| 299|

#### genome_join

Classical join function based on the overlap of the interval. Implemented and maintained in the

[fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package and documented here only for completeness.



```{r}

x1 <- data_frame(id = 1:4, 

                 chr = c("chr1", "chr1", "chr2", "chr3"),

                 start = c(100, 200, 300, 400),

                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,

                 chr = c("chr1", "chr1", "chr1", "chr2"),

                 start = c(220, 210, 300, 400),

                 end = c(225, 240, 320, 415))

fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")

```

| id.x|chr.x | start.x| end.x| id.y|chr.y | start.y| end.y|

|----:|:-----|-------:|-----:|----:|:-----|-------:|-----:|

|    2|chr1  |     200|   250|    1|chr1  |     220|   225|

|    2|chr1  |     200|   250|    2|chr1  |     210|   240|

```{r}

fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")

```

| id.x|chr.x | start.x| end.x| id.y|chr.y | start.y| end.y|

|----:|:-----|-------:|-----:|----:|:-----|-------:|-----:|

|    1|chr1  |     100|   150|   NA|NA    |      NA|    NA|

|    2|chr1  |     200|   250|    1|chr1  |     220|   225|

|    2|chr1  |     200|   250|    2|chr1  |     210|   240|

|    3|chr2  |     300|   350|   NA|NA    |      NA|    NA|

|    4|chr3  |     400|   450|   NA|NA    |      NA|    NA|

```{r}

fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")

```

| id|chr  | start| end|

|--:|:----|-----:|---:|

|  1|chr1 |   100| 150|

|  3|chr2 |   300| 350|

|  4|chr3 |   400| 450|

## Inspiration

- [tidyverse](http://tidyverse.org/)

- [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin)

- [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)

- [bedtools](http://bedtools.readthedocs.io)

If you have any additional questions or encounter issues please raise them on the [github page](https://github.com/Artjom-Metro/tidygenomics).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/const-ae/tidygenomics

Awesome Lists containing this project

README