Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://edgararuiz.github.io/dbplot/
Simplifies plotting of database and sparklyr data
https://edgararuiz.github.io/dbplot/
Last synced: 2 months ago
JSON representation
Simplifies plotting of database and sparklyr data
- Host: GitHub
- URL: https://edgararuiz.github.io/dbplot/
- Owner: edgararuiz
- Fork: true (edgararuiz-zz/dbplot)
- Created: 2021-01-12T01:47:49.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2020-07-29T15:15:11.000Z (over 4 years ago)
- Last Synced: 2024-09-23T01:18:36.798Z (4 months ago)
- Homepage: https://edgararuiz.github.io/dbplot/
- Size: 3.12 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- awesome-ggplot2 - dbplot
README
---
output: github_document
---# dbplot
```{r, setup, include = FALSE}
library(dplyr)
library(dbplot)
library(nycflights13)knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)#knitr::opts_chunk$set(fig.height = 3.5, fig.width = 4, fig.align = 'center')
```[![Build Status](https://travis-ci.org/edgararuiz/dbplot.svg?branch=master)](https://travis-ci.org/edgararuiz/dbplot)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/dbplot)](https://cran.r-project.org/package=dbplot)
[![Coverage status](https://codecov.io/gh/edgararuiz/dbplot/branch/master/graph/badge.svg)](https://codecov.io/github/edgararuiz/dbplot?branch=master)- [Installation](#installation)
- [Connecting to a data source](#connecting-to-a-data-source)
- [Example](#example)
- [`ggplot`](#ggplot)
- [Histogram](#histogram)
- [Raster](#raster)
- [Bar Plot](#bar-plot)
- [Line plot](#line-plot)
- [Boxplot](#boxplot)
- [Calculation functions](#calculation-functions)
- [`db_bin()`](#db_bin)Leverages `dplyr` to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:
1. Functions that ouput a `ggplot2` object
2. Functions that outputs a `data.frame` object with the calculations
3. Creates the formula needed to calculate bins for a Histogram or a Raster plot## Installation
You can install the released version from CRAN:
```{r, eval = FALSE}
# install.packages("dbplot")
```Or the the development version from GitHub, using the `remotes` package:
```{r, eval = FALSE}
# install.packages("remotes")
# remotes::install_github("edgararuiz/dbplot")
```## Connecting to a data source
- For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
- To use Spark, please visit the `sparklyr` official website: http://spark.rstudio.com
## Example
In addition to database connections, the functions work with `sparklyr`. A local `RSQLite` database will be used for the examples in this README.
```{r}
library(DBI)
library(odbc)
library(dplyr)con <- dbConnect(RSQLite::SQLite(), ":memory:")
db_flights <- copy_to(con, nycflights13::flights, "flights")
```## `ggplot`
### Histogram
By default `dbplot_histogram()` creates a 30 bin histogram
```{r}
library(ggplot2)db_flights %>%
dbplot_histogram(distance)
```Use `binwidth` to fix the bin size
```{r}
db_flights %>%
dbplot_histogram(distance, binwidth = 400)
```Because it outputs a `ggplot2` object, more customization can be done
```{r}
db_flights %>%
dbplot_histogram(distance, binwidth = 400) +
labs(title = "Flights - Distance traveled") +
theme_bw()
```### Raster
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.
- If no `fill` argument is passed, the default calculation will be count, `n()`
```{r}
db_flights %>%
dbplot_raster(sched_dep_time, sched_arr_time)
```- Pass an aggregation formula that can run inside the database
```{r}
db_flights %>%
dbplot_raster(
sched_dep_time,
sched_arr_time,
mean(distance, na.rm = TRUE)
)
```- Increase or decrease for more, or less, definition. The `resolution` argument controls that, it defaults to 100
```{r}
db_flights %>%
dbplot_raster(
sched_dep_time,
sched_arr_time,
mean(distance, na.rm = TRUE),
resolution = 20
)
```### Bar Plot
- `dbplot_bar()` defaults to a tally() of each value in a discrete variable
```{r}
db_flights %>%
dbplot_bar(origin)
```- Pass a formula, and column name, that will be operated for each value in the discrete variable
```{r}
db_flights %>%
dbplot_bar(origin, avg_delay = mean(dep_delay, na.rm = TRUE))
```### Line plot
- `dbplot_line()` defaults to a tally() of each value in a discrete variable
```{r}
db_flights %>%
dbplot_line(month)
```- Pass a formula that will be operated for each value in the discrete variable
```{r}
db_flights %>%
dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))
```### Boxplot
It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn't calculate outliers. It has been tested with the following connections:
- MS SQL Server
- PostgreSQL
- Oracle
- `sparklyr`Here is an example using `dbplot_boxplot()` with a local data frame:
```{r}
nycflights13::flights %>%
dbplot_boxplot(origin, distance)
```## Calculation functions
If a more customized plot is needed, the data the underpins the plots can also be accessed:
1. `db_compute_bins()` - Returns a data frame with the bins and count per bin
2. `db_compute_count()` - Returns a data frame with the count per discrete value
3. `db_compute_raster()` - Returns a data frame with the results per x/y intersection
4. `db_compute_raster2()` - Returns same as `db_compute_raster()` function plus the coordinates of the x/y boxes
5. `db_compute_boxplot()` - Returns a data frame with boxplot calculations```{r}
db_flights %>%
db_compute_bins(arr_delay)
```The data can be piped to a plot
```{r}
db_flights %>%
filter(arr_delay < 100 , arr_delay > -50) %>%
db_compute_bins(arr_delay) %>%
ggplot() +
geom_col(aes(arr_delay, count, fill = count))
```## `db_bin()`
Uses 'rlang' to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.
```{r}
db_bin(var)
``````{r}
db_flights %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally()
``````{r}
db_flights %>%
filter(!is.na(arr_delay)) %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally()%>%
collect %>%
ggplot() +
geom_col(aes(x, n))
``````{r}
dbDisconnect(con)
```