Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/winvector/rqdatatable
Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.
https://github.com/winvector/rqdatatable
Last synced: 3 months ago
JSON representation
Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.
- Host: GitHub
- URL: https://github.com/winvector/rqdatatable
- Owner: WinVector
- License: other
- Created: 2018-05-30T00:49:26.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2023-08-20T05:23:41.000Z (over 1 year ago)
- Last Synced: 2024-06-21T14:12:39.766Z (7 months ago)
- Language: R
- Homepage: https://winvector.github.io/rqdatatable/
- Size: 40.4 MB
- Stars: 37
- Watchers: 9
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/rqdatatable)](https://cran.r-project.org/package=rqdatatable)
[![status](https://tinyverse.netlify.com/badge/rqdatatable)](https://CRAN.R-project.org/package=rqdatatable)![](https://github.com/WinVector/rqdatatable/raw/master/tools/rqdatatable.png)
[`rqdatatable`](https://github.com/WinVector/rqdatatable) is an implementation of
the [`rquery`](https://github.com/WinVector/rquery) piped Codd-style relational algebra
hosted on [`data.table`](https://rdatatable.gitlab.io/data.table/). `rquery` allow the expression
of complex transformations as a series of relational operators and
`rqdatatable` implements the operators using `data.table`.A `Python` version of `rquery`/`rqdatatable` is under initial development as [`data_algebra`](https://github.com/WinVector/data_algebra).
For example
scoring a logistic regression model (which requires grouping, ordering, and ranking)
is organized as follows. For more on this example please see
["Let’s Have Some Sympathy For The Part-time R User"](https://win-vector.com/2017/08/04/lets-have-some-sympathy-for-the-part-time-r-user/).```{r}
library("rqdatatable")
``````{r}
# data example
dL <- build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
``````{r}
scale <- 0.237# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
```We can show the expanded form of query tree.
```{r, comment=""}
cat(format(rquery_pipeline))
```And execute it using `data.table`.
```{r}
ex_data_table(rquery_pipeline)
```One can also apply the pipeline to new tables.
```{r}
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
```Initial bench-marking of `rqdatatable` is very favorable (notes [here](https://win-vector.com/2018/06/03/rqdatatable-rquery-powered-by-data-table/)).
To install `rqdatatable` please use `install.packages("rqdatatable")`.
Some related work includes:
* [`data.table`](https://rdatatable.gitlab.io/data.table/)
* [`Polars`](https://www.pola.rs)
* [`data algebra`](https://github.com/WinVector/data_algebra)
* [`disk.frame`](https://github.com/DiskFrame/disk.frame)
* [`dbplyr`](https://dbplyr.tidyverse.org)
* [`dplyr`](https://dplyr.tidyverse.org)
* [`dtplyr`](https://github.com/tidyverse/dtplyr)
* [`maditr`](https://github.com/gdemin/maditr)
* [`nc`](https://github.com/tdhock/nc)
* [`poorman`](https://github.com/nathaneastwood/poorman)
* [`rquery`](https://github.com/WinVector/rquery)
* [`SparkR`]( https://CRAN.R-project.org/package=SparkR)
* [`sparklyr`](https://spark.rstudio.com)
* [`sqldf`](https://github.com/ggrothendieck/sqldf)
* [`table.express`](https://github.com/asardaes/table.express)
* [`tidyfast`](https://github.com/TysonStanley/tidyfast)
* [`tidyfst`](https://github.com/hope-data-science/tidyfst)
* [`tidyquery`](https://github.com/ianmcook/tidyquery)
* [`tidyr`](https://tidyr.tidyverse.org)
* [`tidytable`](https://github.com/markfairbanks/tidytable) (formerly `gdt`/`tidydt`)--
Note `rqdatatable` has an "immediate mode" which allows direct application of pipelines stages without
pre-assembling the pipeline. "Immediate mode" is a convenience for ad-hoc analyses, and has some negative
performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found
[here](https://github.com/WinVector/rqdatatable/blob/master/extras/ImmediateIssue.md).`rqdatatable` implements the `rquery` grammar in the style of a "Turing or Cook reduction" (implementing the result in terms of multiple oracle calls to the related system).
`rqdatatable` is intended for "simple column names", in particular as `rqdatatable` often uses `eval()` to work over `data.table` escape characters such as "`\`" and "`\\`" are not reliable in column names. Also `rqdatatable` does not support tables with no columns.