https://github.com/talegari/tidier
dplyr friendly spark style window aggregation for R dataframes and remote dbplyr tbls
https://github.com/talegari/tidier
dbplyr dplyr mutate rstats rstats-package spark-sql tidyverse
Last synced: 4 months ago
JSON representation
dplyr friendly spark style window aggregation for R dataframes and remote dbplyr tbls
- Host: GitHub
- URL: https://github.com/talegari/tidier
- Owner: talegari
- License: gpl-3.0
- Created: 2023-04-25T10:50:47.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2023-09-10T17:41:15.000Z (over 2 years ago)
- Last Synced: 2025-10-12T03:54:34.570Z (4 months ago)
- Topics: dbplyr, dplyr, mutate, rstats, rstats-package, spark-sql, tidyverse
- Language: R
- Homepage: https://talegari.github.io/tidier/
- Size: 438 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library("purrr") # prevent `purrr` load message by `furrr`
devtools::load_all()
```
# tidier
[](https://CRAN.R-project.org/package=tidier) [](https://github.com/talegari/tidier/actions/workflows/R-CMD-check.yaml) `r badger::badge_devel(color = "blue")`
`tidier` package provides '[Apache Spark](https://spark.apache.org/)' style window aggregation for R dataframes and remote `dbplyr` tbls via '[mutate](https://dplyr.tidyverse.org/reference/mutate.html)' in '[dplyr](https://dplyr.tidyverse.org/index.html)' flavour.
## Example
**Create a new column with average temp over last seven days in the same month**.
```{r}
set.seed(101)
airquality |>
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) |>
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) |>
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
```
## Features
- `mutate` supports
- `.by` (group by),
- `.order_by` (order by),
- `.frame` (endpoints of window frame),
- `.index` (identify index column like date column, in df version only),
- `.complete` (whether to compute over incomplete window, in df version only).
- `mutate` automatically uses a future backend (via [`furrr`](https://furrr.futureverse.org/), in df version only).
## Motivation
This implementation is inspired by Apache Spark's [`windowSpec`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.Column.over.html?highlight=windowspec) class with [`rangeBetween`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.WindowSpec.rangeBetween.html) and [`rowsBetween`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.WindowSpec.rowsBetween.html).
## Ecosystem
1. [`dbplyr`](https://dbplyr.tidyverse.org/) implements this via [`dbplyr::win_over`](https://dbplyr.tidyverse.org/reference/win_over.html?q=win_over#null) enabling [`sparklyr`](https://spark.rstudio.com/) users to write window computations. Also see, [`dbplyr::window_order`/`dbplyr::window_frame`](https://dbplyr.tidyverse.org/reference/window_order.html?q=window_fr#ref-usage). `tidier`'s `mutate` wraps this functionality via uniform syntax across dataframes and remote tbls.
2. [`tidypyspark`](https://talegari.github.io/tidypyspark/_build/html/index.html) python package implements `mutate` style window computation API for pyspark.
## Installation
- dev: `remotes::install_github("talegari/tidier")`
- cran: `install.packages("tidier")`
## Acknowledgements
`tidier` package is deeply indebted to three amazing packages and people behind it.
1. [`dplyr`](https://cran.r-project.org/package=dplyr):
```
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.0,
.
```
2. [`slider`](https://cran.r-project.org/package=slider):
```
Vaughan D (2021). _slider: Sliding Window Functions_. R package
version 0.2.2, .
```
3. [`dbplyr`](https://cran.r-project.org/package=dbplyr):
```
Wickham H, Girlich M, Ruiz E (2023). _dbplyr: A 'dplyr' Back End
for Databases_. R package version 2.3.2,
.
```