https://edgararuiz.github.io/dbplot/

Simplifies plotting of database and sparklyr data
https://edgararuiz.github.io/dbplot/

Last synced: about 1 year ago
JSON representation

Simplifies plotting of database and sparklyr data

Host: GitHub
URL: https://edgararuiz.github.io/dbplot/
Owner: edgararuiz
Fork: true (edgararuiz-zz/dbplot)
Created: 2021-01-12T01:47:49.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-07-29T15:15:11.000Z (almost 6 years ago)
Last Synced: 2024-09-23T01:18:36.798Z (almost 2 years ago)
Homepage: https://edgararuiz.github.io/dbplot/
Size: 3.12 MB
Stars: 8
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

awesome-ggplot2 - dbplot

README

          ---

output: github_document

---

# dbplot 

```{r, setup, include = FALSE}

library(dplyr)

library(dbplot)

library(nycflights13)

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

#knitr::opts_chunk$set(fig.height = 3.5, fig.width =  4, fig.align = 'center')

```

[![Build Status](https://travis-ci.org/edgararuiz/dbplot.svg?branch=master)](https://travis-ci.org/edgararuiz/dbplot)

[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/dbplot)](https://cran.r-project.org/package=dbplot)

[![Coverage status](https://codecov.io/gh/edgararuiz/dbplot/branch/master/graph/badge.svg)](https://codecov.io/github/edgararuiz/dbplot?branch=master)

-   [Installation](#installation)

-   [Connecting to a data source](#connecting-to-a-data-source)

-   [Example](#example)

-   [`ggplot`](#ggplot)

    -   [Histogram](#histogram)

    -   [Raster](#raster)

    -   [Bar Plot](#bar-plot)

    -   [Line plot](#line-plot)

    -   [Boxplot](#boxplot)

-   [Calculation functions](#calculation-functions)

-   [`db_bin()`](#db_bin)

Leverages `dplyr` to process the calculations of a plot inside a database.  This package provides helper functions that abstract the work at three levels:

    

1. Functions that ouput a `ggplot2` object

2. Functions that outputs a `data.frame` object with the calculations

3. Creates the formula needed to calculate bins for a Histogram or a Raster plot

## Installation

You can install the released version from CRAN:

```{r, eval = FALSE}

# install.packages("dbplot")

```

Or the the development version from GitHub, using the `remotes` package:

```{r, eval = FALSE}

# install.packages("remotes")

# remotes::install_github("edgararuiz/dbplot")

```

## Connecting to a data source

- For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com 

- To use Spark, please visit the `sparklyr` official website: http://spark.rstudio.com

## Example

In addition to database connections, the functions work with `sparklyr`. A local `RSQLite` database will be used for the examples in this README.  

```{r}

library(DBI)

library(odbc)

library(dplyr)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

db_flights <- copy_to(con, nycflights13::flights, "flights")

```

## `ggplot`

### Histogram

By default `dbplot_histogram()` creates a 30 bin histogram

```{r}

library(ggplot2)

db_flights %>% 

  dbplot_histogram(distance)

```

Use `binwidth` to fix the bin size

```{r}

db_flights %>% 

  dbplot_histogram(distance, binwidth = 400)

```

Because it outputs a `ggplot2` object, more customization can be done

```{r}

db_flights %>% 

  dbplot_histogram(distance, binwidth = 400) +

  labs(title = "Flights - Distance traveled") +

  theme_bw()

```

### Raster

To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.

A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.

- If no `fill` argument is passed, the default calculation will be count, `n()`

```{r}

db_flights %>%

  dbplot_raster(sched_dep_time, sched_arr_time) 

```

- Pass an aggregation formula that can run inside the database

```{r}

db_flights %>%

  dbplot_raster(

    sched_dep_time, 

    sched_arr_time, 

    mean(distance, na.rm = TRUE)

    ) 

```

- Increase or decrease for more, or less, definition.  The `resolution` argument controls that, it defaults to 100 

```{r}

db_flights %>%

  dbplot_raster(

    sched_dep_time, 

    sched_arr_time, 

    mean(distance, na.rm = TRUE),

    resolution = 20

    ) 

```

### Bar Plot

- `dbplot_bar()` defaults to a tally() of each value in a discrete variable

```{r}

db_flights %>%

  dbplot_bar(origin)

```

- Pass a formula, and column name, that will be operated for each value in the discrete variable

```{r}

db_flights %>%

  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))

```

### Line plot

- `dbplot_line()` defaults to a tally() of each value in a discrete variable

```{r}

db_flights %>%

  dbplot_line(month)

```

- Pass a formula that will be operated for each value in the discrete variable

```{r}

db_flights %>%

  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))

```

### Boxplot

It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn't calculate outliers. It has been tested with the following connections:

- MS SQL Server

- PostgreSQL

- Oracle

- `sparklyr`

Here is an example using `dbplot_boxplot()` with a local data frame:

```{r}

nycflights13::flights %>%

  dbplot_boxplot(origin, distance)

```

## Calculation functions

If a more customized plot is needed, the data the underpins the plots can also be accessed:

1. `db_compute_bins()` - Returns a data frame with the bins and count per bin

2. `db_compute_count()` - Returns a data frame with the count per discrete value

3. `db_compute_raster()` -  Returns a data frame with the results per x/y intersection

4. `db_compute_raster2()` -  Returns same as `db_compute_raster()` function plus the coordinates of the x/y boxes

5. `db_compute_boxplot()` -  Returns a data frame with boxplot calculations

```{r}

db_flights %>%

  db_compute_bins(arr_delay) 

```

The data can be piped to a plot

```{r}

db_flights %>%

  filter(arr_delay < 100 , arr_delay > -50) %>%

  db_compute_bins(arr_delay) %>%

  ggplot() +

  geom_col(aes(arr_delay, count, fill = count))

```

## `db_bin()`

Uses 'rlang' to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.

```{r}

db_bin(var)

```

```{r}

db_flights %>%

  group_by(x = !! db_bin(arr_delay)) %>%

  tally()

```

```{r}

db_flights %>%

  filter(!is.na(arr_delay)) %>%

  group_by(x = !! db_bin(arr_delay)) %>%

  tally()%>%

  collect %>%

  ggplot() +

  geom_col(aes(x, n))

```

```{r}

dbDisconnect(con)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://edgararuiz.github.io/dbplot/

Awesome Lists containing this project

README