https://github.com/jonocarroll/dfplyr
A `DataFrame` (`S4Vectors`) backend for `dplyr`
https://github.com/jonocarroll/dfplyr
Last synced: about 1 year ago
JSON representation
A `DataFrame` (`S4Vectors`) backend for `dplyr`
- Host: GitHub
- URL: https://github.com/jonocarroll/dfplyr
- Owner: jonocarroll
- License: gpl-3.0
- Created: 2019-12-05T12:56:14.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-10-16T04:35:32.000Z (over 1 year ago)
- Last Synced: 2025-03-12T14:50:44.105Z (over 1 year ago)
- Language: R
- Size: 193 KB
- Stars: 21
- Watchers: 5
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# DFplyr
The goal of DFplyr is to enable `dplyr` and `ggplot2` support for
`S4Vectors::DataFrame` by providing the appropriate extension methods. As row
names are an important feature of many Bioconductor structures, these are
preserved where possible.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("jonocarroll/DFplyr")
```
You can install from [Bioconductor](https://bioconductor.org) with:
``` r
if (!require("BiocManager", quietly =TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("DFplyr")
```
## Examples
First create an S4Vectors `DataFrame`, including S4 columns if desired
```{r}
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
```
This will appear in RStudio's environment pane as a
```
Formal class DataFrame (dplyr-compatible)
```
when using `DFplyr`. No interference with the actual object is required, but
this helps identify that `dplyr`-compatibility is available.
`DataFrame`s can then be used in `dplyr` calls the same as `data.frame` or
`tibble` objects. Support for working with S4 columns is enabled provided they
have appropriate functions. Adding multiple columns will result in the new
columns being created in alphabetical order
```{r}
library(DFplyr)
mutate(d, newvar = cyl + hp)
mutate(d, nl2 = nl * 2)
mutate(d, length_nl = lengths(nl))
mutate(d,
chr = GenomeInfoDb::seqnames(grX),
strand_X = BiocGenerics::strand(grX),
end_X = BiocGenerics::end(grX)
)
```
the object returned remains a standard `DataFrame`, and further calls can be
piped with `%>%`
```{r}
mutate(d, newvar = cyl + hp) %>%
pull(newvar)
```
Some of the variants of the `dplyr` verbs also work
```{r}
mutate_if(d, is.numeric, ~ .^2)
mutate_if(d, ~ inherits(., "GRanges"), BiocGenerics::start)
```
Use of `tidyselect` helpers is limited to within `dplyr::vars()` calls and using
the `_at` variants
```{r}
mutate_at(d, vars(starts_with("c")), ~ .^2)
select_at(d, vars(starts_with("gr")))
```
Importantly, grouped operations are supported. `DataFrame` does not
natively support groups (the same way that `data.frame` does not) so these
are implemented specifically for `DFplyr`
```{r}
group_by(d, cyl, am)
```
Other verbs are similarly implemented, and preserve row names where possible
```{r}
select(d, am, cyl)
arrange(d, desc(hp))
filter(d, am == 0)
slice(d, 3:6)
group_by(d, gear) %>%
slice(1:2)
```
`rename` is itself renamed to `rename2` due to conflicts between {dplyr} and
{S4Vectors}, but works in the {dplyr} sense of taking `new = old` replacements
with NSE syntax
```{r}
select(d, am, cyl) %>%
rename2(foo = am)
```
Row names are not preserved when there may be duplicates or they don't make
sense, otherwise the first label (according to the current de-duplication
method, in the case of `distinct`, this is via `BiocGenerics::duplicated`). This
may have complications for S4 columns.
```{r}
distinct(d)
group_by(d, cyl, am) %>%
tally(gear)
count(d, gear, am, cyl)
```
## Coverage
Most `dplyr` functions are implemented with the exception of `join`s.
If you find any which are not, please [file an issue](https://github.com/jonocarroll/DFplyr/issues/new).