Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jonocarroll/DFplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`
https://github.com/jonocarroll/DFplyr

Last synced: 3 months ago
JSON representation

A `DataFrame` (`S4Vectors`) backend for `dplyr`

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# DFplyr

The goal of DFplyr is to enable `dplyr` and `ggplot2` support for `S4Vectors::DataFrame` by
providing the appropriate extension methods. As row names are an important feature of many
Bioconductor structures, these are preserved where possible.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("jonocarroll/DFplyr")
```
## Examples

Most `dplyr` functions are implemented. If you find any which are not, please [file an issue](https://github.com/jonocarroll/DFplyr/issues/new).

```{r}
suppressPackageStartupMessages(
suppressWarnings({
library(S4Vectors)
library(dplyr)
library(DFplyr)
}))
```

First create an S4Vectors DataFrame, including S4 columns if desired

```{r}
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width=10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
```

This will appear in RStudio's environment pane as a `Formal class DataFrame (dplyr-compatible)` when
using `DFplyr`. No interference with the actual object is required, but this helps
identify that `dplyr`-compatibility is available.

`DataFrame`s can then be used in `dplyr` calls the same as `data.frame` or `tibble`
objects. Support for working with S4 columns is enabled provided they have appropriate
functions. Adding multiple columns will result in the new columns being created in
alphabetical order

```{r}
mutate(d, newvar = cyl + hp)

mutate(d, nl2 = nl * 2)

mutate(d, length_nl = lengths(nl))

mutate(d,
chr = GenomeInfoDb::seqnames(grX),
strand_X = BiocGenerics::strand(grX),
end_X = BiocGenerics::end(grX))
```

the object returned remains a standard `DataFrame`, and further calls can be
piped with `%>%`

```{r}
mutate(d, newvar = cyl + hp) %>%
pull(newvar)
```

Some of the variants of the `dplyr` verbs also work

```{r}
mutate_if(d, is.numeric, ~.^2)

mutate_if(d, ~inherits(., "GRanges"), BiocGenerics::start)
```

Use of `tidyselect` helpers is limited to within `dplyr::vars()` calls and using
the `_at` variants

```{r}
mutate_at(d, vars(starts_with("c")), ~.^2)

select_at(d, vars(starts_with("gr")))
```

Importantly, grouped operations are supported. `DataFrame` does not
natively support groups (the same way that `data.frame` does not) so these
are implemented specifically for `DFplyr`

```{r}
group_by(d, cyl, am)

# group_by(d, cyl) %>%
# top_n(1, disp)
```

Other verbs are similiarly implemented, and preserve row names where possible

```{r}
select(d, am, cyl)

select(d, am, cyl) %>%
DFplyr::rename(foo = am)

arrange(d, desc(hp))

filter(d, am == 0)

slice(d, 3:6)

# dd <- rbind(data.frame(d[1, ], row.names = "MyCar"), d)
# dd
```

Row names are not preserved when there may be duplicates or they don't make
sense, otherwise the first label (according to the current de-duplication method,
in the case of `distinct`, this is via `BiocGenerics::duplicated`). This may have
complications for S4 columns.

```{r}
distinct(d)

group_by(d, cyl, am) %>%
tally(gear)

count(d, gear, am, cyl)
```

`ggplot2` support is also enabled

```{r}
library(ggplot2)
ggplot(d, aes(disp, cyl)) + geom_point()
```

## Implementation

Most of the `dplyr` verbs for `DataFrame`s are implmented by first converting to `tibble`,
performing the verb operation, then converting back to `DataFrame`. Care has been taken to
retain groups and row names through these operations, but this may introduce some
complications. If you spot any, please [file an issue](https://github.com/jonocarroll/DFplyr/issues/new).