{"id":13858162,"url":"https://github.com/edgararuiz-zz/dbplot","last_synced_at":"2025-04-14T19:44:05.200Z","repository":{"id":81867644,"uuid":"100206682","full_name":"edgararuiz-zz/dbplot","owner":"edgararuiz-zz","description":"Simplifies plotting of database and sparklyr data","archived":false,"fork":false,"pushed_at":"2020-07-29T15:15:11.000Z","size":3269,"stargazers_count":124,"open_issues_count":14,"forks_count":20,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-28T08:04:47.064Z","etag":null,"topics":["databases","dbplot","ggplot2","r","rlang","sparklyr","visualization"],"latest_commit_sha":null,"homepage":"https://edgararuiz.github.io/dbplot/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edgararuiz-zz.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2017-08-13T21:50:29.000Z","updated_at":"2024-11-01T22:39:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"d789e66c-c8e5-48ee-8acc-62d13558e5f0","html_url":"https://github.com/edgararuiz-zz/dbplot","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edgararuiz-zz%2Fdbplot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edgararuiz-zz%2Fdbplot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edgararuiz-zz%2Fdbplot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edgararuiz-zz%2Fdbplot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edgararuiz-zz","download_url":"https://codeload.github.com/edgararuiz-zz/dbplot/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248949807,"owners_count":21188154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["databases","dbplot","ggplot2","r","rlang","sparklyr","visualization"],"created_at":"2024-08-05T03:01:58.821Z","updated_at":"2025-04-14T19:44:05.166Z","avatar_url":"https://github.com/edgararuiz-zz.png","language":"R","readme":"---\noutput: github_document\n---\n\n# dbplot \u003cimg src=\"man/figures/logo.png\" align=\"right\" alt=\"\" width=\"220\" /\u003e\n\n```{r, setup, include = FALSE}\nlibrary(dplyr)\nlibrary(dbplot)\nlibrary(nycflights13)\n\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n\n#knitr::opts_chunk$set(fig.height = 3.5, fig.width =  4, fig.align = 'center')\n```\n\n\n[![Build Status](https://travis-ci.org/edgararuiz/dbplot.svg?branch=master)](https://travis-ci.org/edgararuiz/dbplot)\n[![CRAN\\_Status\\_Badge](http://www.r-pkg.org/badges/version/dbplot)](https://cran.r-project.org/package=dbplot)\n[![Coverage status](https://codecov.io/gh/edgararuiz/dbplot/branch/master/graph/badge.svg)](https://codecov.io/github/edgararuiz/dbplot?branch=master)\n\n-   [Installation](#installation)\n-   [Connecting to a data source](#connecting-to-a-data-source)\n-   [Example](#example)\n-   [`ggplot`](#ggplot)\n    -   [Histogram](#histogram)\n    -   [Raster](#raster)\n    -   [Bar Plot](#bar-plot)\n    -   [Line plot](#line-plot)\n    -   [Boxplot](#boxplot)\n-   [Calculation functions](#calculation-functions)\n-   [`db_bin()`](#db_bin)\n\nLeverages `dplyr` to process the calculations of a plot inside a database.  This package provides helper functions that abstract the work at three levels:\n    \n1. Functions that ouput a `ggplot2` object\n2. Functions that outputs a `data.frame` object with the calculations\n3. Creates the formula needed to calculate bins for a Histogram or a Raster plot\n\n## Installation\n\nYou can install the released version from CRAN:\n```{r, eval = FALSE}\n# install.packages(\"dbplot\")\n```\n\nOr the the development version from GitHub, using the `remotes` package:\n```{r, eval = FALSE}\n# install.packages(\"remotes\")\n# remotes::install_github(\"edgararuiz/dbplot\")\n```\n\n\n## Connecting to a data source\n\n- For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com \n\n- To use Spark, please visit the `sparklyr` official website: http://spark.rstudio.com\n\n## Example\n\nIn addition to database connections, the functions work with `sparklyr`. A local `RSQLite` database will be used for the examples in this README.  \n\n\n```{r}\nlibrary(DBI)\nlibrary(odbc)\nlibrary(dplyr)\n\ncon \u003c- dbConnect(RSQLite::SQLite(), \":memory:\")\ndb_flights \u003c- copy_to(con, nycflights13::flights, \"flights\")\n```\n\n## `ggplot`\n\n### Histogram\n\nBy default `dbplot_histogram()` creates a 30 bin histogram\n\n```{r}\nlibrary(ggplot2)\n\ndb_flights %\u003e% \n  dbplot_histogram(distance)\n```\n\nUse `binwidth` to fix the bin size\n\n```{r}\ndb_flights %\u003e% \n  dbplot_histogram(distance, binwidth = 400)\n```\n\nBecause it outputs a `ggplot2` object, more customization can be done\n\n```{r}\ndb_flights %\u003e% \n  dbplot_histogram(distance, binwidth = 400) +\n  labs(title = \"Flights - Distance traveled\") +\n  theme_bw()\n```\n\n### Raster\n\nTo visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.\n\nA Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.\n\n\n- If no `fill` argument is passed, the default calculation will be count, `n()`\n```{r}\ndb_flights %\u003e%\n  dbplot_raster(sched_dep_time, sched_arr_time) \n```\n\n\n- Pass an aggregation formula that can run inside the database\n```{r}\ndb_flights %\u003e%\n  dbplot_raster(\n    sched_dep_time, \n    sched_arr_time, \n    mean(distance, na.rm = TRUE)\n    ) \n```\n\n- Increase or decrease for more, or less, definition.  The `resolution` argument controls that, it defaults to 100 \n```{r}\ndb_flights %\u003e%\n  dbplot_raster(\n    sched_dep_time, \n    sched_arr_time, \n    mean(distance, na.rm = TRUE),\n    resolution = 20\n    ) \n```\n\n### Bar Plot\n\n- `dbplot_bar()` defaults to a tally() of each value in a discrete variable\n```{r}\ndb_flights %\u003e%\n  dbplot_bar(origin)\n```\n\n\n- Pass a formula, and column name, that will be operated for each value in the discrete variable\n```{r}\ndb_flights %\u003e%\n  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))\n```\n\n### Line plot\n\n- `dbplot_line()` defaults to a tally() of each value in a discrete variable\n```{r}\ndb_flights %\u003e%\n  dbplot_line(month)\n```\n\n- Pass a formula that will be operated for each value in the discrete variable\n```{r}\ndb_flights %\u003e%\n  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))\n```\n\n### Boxplot\n\nIt expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn't calculate outliers. It has been tested with the following connections:\n\n- MS SQL Server\n- PostgreSQL\n- Oracle\n- `sparklyr`\n\nHere is an example using `dbplot_boxplot()` with a local data frame:\n\n```{r}\nnycflights13::flights %\u003e%\n  dbplot_boxplot(origin, distance)\n```\n\n\n\n## Calculation functions\n\nIf a more customized plot is needed, the data the underpins the plots can also be accessed:\n\n1. `db_compute_bins()` - Returns a data frame with the bins and count per bin\n2. `db_compute_count()` - Returns a data frame with the count per discrete value\n3. `db_compute_raster()` -  Returns a data frame with the results per x/y intersection\n4. `db_compute_raster2()` -  Returns same as `db_compute_raster()` function plus the coordinates of the x/y boxes\n5. `db_compute_boxplot()` -  Returns a data frame with boxplot calculations\n\n\n```{r}\ndb_flights %\u003e%\n  db_compute_bins(arr_delay) \n```\n\nThe data can be piped to a plot\n\n```{r}\ndb_flights %\u003e%\n  filter(arr_delay \u003c 100 , arr_delay \u003e -50) %\u003e%\n  db_compute_bins(arr_delay) %\u003e%\n  ggplot() +\n  geom_col(aes(arr_delay, count, fill = count))\n```\n\n\n## `db_bin()`\n\nUses 'rlang' to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.\n\n```{r}\ndb_bin(var)\n```\n\n\n```{r}\ndb_flights %\u003e%\n  group_by(x = !! db_bin(arr_delay)) %\u003e%\n  tally()\n```\n\n```{r}\ndb_flights %\u003e%\n  filter(!is.na(arr_delay)) %\u003e%\n  group_by(x = !! db_bin(arr_delay)) %\u003e%\n  tally()%\u003e%\n  collect %\u003e%\n  ggplot() +\n  geom_col(aes(x, n))\n```\n\n```{r}\ndbDisconnect(con)\n```\n\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedgararuiz-zz%2Fdbplot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedgararuiz-zz%2Fdbplot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedgararuiz-zz%2Fdbplot/lists"}