https://github.com/casualcomputer/sql.mechanic
Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).
https://github.com/casualcomputer/sql.mechanic
data-analysis data-quality-checks data-science database mysql netezza oracle postgres quality-control r sql sql-server
Last synced: 4 months ago
JSON representation
Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).
- Host: GitHub
- URL: https://github.com/casualcomputer/sql.mechanic
- Owner: casualcomputer
- License: gpl-3.0
- Created: 2021-06-19T05:06:49.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-03-04T00:47:31.000Z (about 2 years ago)
- Last Synced: 2024-08-13T07:11:58.742Z (8 months ago)
- Topics: data-analysis, data-quality-checks, data-science, database, mysql, netezza, oracle, postgres, quality-control, r, sql, sql-server
- Language: R
- Homepage:
- Size: 90.8 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - casualcomputer/sql.mechanic - Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.). (R)
README
---
title: "Summarizing large tables with SQL"
author: "Henry Luan"
output: rmarkdown::github_document
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```## Overview
The core helper function **get_summary_codes** in the `sql.mechanic` package takes in a string that specifies a table's database name, schema name, and table name. It outputs an SQL query that summarizes the table's column statistics.
If you don't have much time to read or just want a quick solution, jump to [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases).
## Intended User
- If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI tools (e.g. Python, R, SAS, etc.).
- If it's much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.
## Limitations
- You probably want to understand how [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)affect the CPU and disk usage of your database server, to avoid bad surprises on your server's resource usage.
- If you use the setup of [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)on a cloud database, you **MUST** do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.
- Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.
## Credits
Specail credits: the R codes in this package are built on top of Gordon S. Linoff's book *Data Analysis Using SQL and Excel, 2nd Edition*. His work has been a tremendous inspiration for the creation of this package.
## Example 1: Generate SQL queries and execute them in DMBS
### Step 1: Install packages
You can install the library from my GitHub. If you have concerns regarding the package's security, you can download, check, and use the "get_summary_codes.R" file directly.
```{r, message=FALSE, warning=FALSE}
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")
```### Step 2: Generate SQL quires
The following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.
```{r, fig.show='hold'}
library(sql.mechanic)#SQL codes for basic summary, Netezza database
sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
#SQL codes for advanced summary, Netezza database
sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")
#SQL codes for basic summary, Microsoft SQL Server
sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")#SQL codes for advanced summary, Microsoft SQL Server
sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")
#copy some of the sql queries to the clipboard
writeClipboard(sql_basic_netezza)
```### Step 3: Paste the codes in your clipboard and run it in SQL
In case you are curious, the SQL copied to your clipboard looks like this.
```{sql, eval = FALSE}
SELECT REPLACE(REPLACE(REPLACE(' SELECT '''' as colname,
COUNT(*) as numvalues,
MAX(freqnull) as freqnull,
CAST(MIN(minval) as CHAR(100)) as minval,
SUM(CASE WHEN = minval THEN freq ELSE 0 END) as numminvals,
CAST(MAX(maxval) as CHAR(100)) as maxval,
SUM(CASE WHEN = maxval THEN freq ELSE 0 END) as nummaxvals,
SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniquesFROM (SELECT , COUNT(*) as freq
FROM SCHEMA_NAME. GROUP BY ) osum
CROSS JOIN (SELECT MIN() as minval, MAX() as maxval, SUM(CASE WHEN IS NULL THEN 1 ELSE 0 END) as freqnull
FROM (SELECT FROM SCHEMA_NAME.) osum
) summary',
'', column_name),
'', 'TABLE_NAME'),
'',
(CASE WHEN ordinal_position = 1 THEN ''
ELSE 'UNION ALL' END)) as codes_data_summary
FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP') then '"'||column_name||'"'
else column_name end as column_name , ordinal_position
FROM information_schema.columns
WHERE table_name ='TABLE_NAME') a;
```### Step 4: Copy, paste and execute the query results from Step 3.
## Example 2: Automatically summarize tables in your databases {#example-2-automatically-summarize-tables-in-your-databases}
This example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.
```{r, fig.show='hold',eval=FALSE}
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")# Load packages
library(sql.mechanic)
library(odbc)
library(DBI)# Connect to database(s)
## Method 1: prompting user (you need to make some changes here)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "mysqlhost",
Database = "mydbname",
UID = "myuser",
PWD = "Database password",
Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly## Alternative Method: Using a DSN
#con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
output_table = dbFetch(res) # part of Step 4 in "Example 1"
print(output) #desired summary tabledbDisconnect(con) #close database connection
```