Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jonocarroll/dfplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`
https://github.com/jonocarroll/dfplyr

Last synced: 3 months ago
JSON representation

A `DataFrame` (`S4Vectors`) backend for `dplyr`

Host: GitHub
URL: https://github.com/jonocarroll/dfplyr
Owner: jonocarroll
License: gpl-3.0
Created: 2019-12-05T12:56:14.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2024-10-16T04:35:32.000Z (3 months ago)
Last Synced: 2024-10-17T19:26:00.581Z (3 months ago)
Language: R
Size: 193 KB
Stars: 20
Watchers: 6
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE.md

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

    collapse = TRUE,

    comment = "#>",

    fig.path = "man/figures/README-",

    out.width = "100%"

)

```

# DFplyr

The goal of DFplyr is to enable `dplyr` and `ggplot2` support for

`S4Vectors::DataFrame` by providing the appropriate extension methods. As row

names are an important feature of many Bioconductor structures, these are

preserved where possible.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("devtools")

devtools::install_github("jonocarroll/DFplyr")

```

You can install from [Bioconductor](https://bioconductor.org) with:

``` r

if (!require("BiocManager", quietly =TRUE))

    install.packages("BiocManager")

# The following initializes usage of Bioc devel

BiocManager::install(version='devel')

BiocManager::install("DFplyr")

```

## Examples

First create an S4Vectors `DataFrame`, including S4 columns if desired

```{r}

library(S4Vectors)

m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]

d <- as(m, "DataFrame")

d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))

d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))

d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))

d

```

This will appear in RStudio's environment pane as a 

```

Formal class DataFrame (dplyr-compatible)

``` 

when using `DFplyr`. No interference with the actual object is required, but

this helps identify that `dplyr`-compatibility is available.

`DataFrame`s can then be used in `dplyr` calls the same as `data.frame` or

`tibble` objects. Support for working with S4 columns is enabled provided they

have appropriate functions. Adding multiple columns will result in the new

columns being created in alphabetical order

```{r}

library(DFplyr)

mutate(d, newvar = cyl + hp)

mutate(d, nl2 = nl * 2)

mutate(d, length_nl = lengths(nl))

mutate(d,

    chr = GenomeInfoDb::seqnames(grX),

    strand_X = BiocGenerics::strand(grX),

    end_X = BiocGenerics::end(grX)

)

```

the object returned remains a standard `DataFrame`, and further calls can be 

piped with `%>%`

```{r}

mutate(d, newvar = cyl + hp) %>%

    pull(newvar)

```

Some of the variants of the `dplyr` verbs also work

```{r}

mutate_if(d, is.numeric, ~ .^2)

mutate_if(d, ~ inherits(., "GRanges"), BiocGenerics::start)

```

Use of `tidyselect` helpers is limited to within `dplyr::vars()` calls and using 

the `_at` variants

```{r}

mutate_at(d, vars(starts_with("c")), ~ .^2)

select_at(d, vars(starts_with("gr")))

```

Importantly, grouped operations are supported. `DataFrame` does not 

natively support groups (the same way that `data.frame` does not) so these

are implemented specifically for `DFplyr`

```{r}

group_by(d, cyl, am)

```

Other verbs are similarly implemented, and preserve row names where possible

```{r}

select(d, am, cyl)

arrange(d, desc(hp))

filter(d, am == 0)

slice(d, 3:6)

group_by(d, gear) %>%

    slice(1:2)

```

`rename` is itself renamed to `rename2` due to conflicts between {dplyr} and 

{S4Vectors}, but works in the {dplyr} sense of taking `new = old` replacements 

with NSE syntax

```{r}

select(d, am, cyl) %>%

    rename2(foo = am)

```

Row names are not preserved when there may be duplicates or they don't make

sense, otherwise the first label (according to the current de-duplication

method, in the case of `distinct`, this is via `BiocGenerics::duplicated`). This

may have complications for S4 columns.

```{r}

distinct(d)

group_by(d, cyl, am) %>%

    tally(gear)

count(d, gear, am, cyl)

```

## Coverage

Most `dplyr` functions are implemented with the exception of `join`s. 

If you find any which are not, please [file an issue](https://github.com/jonocarroll/DFplyr/issues/new).