An open API service indexing awesome lists of open source software.

https://github.com/casualcomputer/sql.mechanic

Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).
https://github.com/casualcomputer/sql.mechanic

data-analysis data-quality-checks data-science database mysql netezza oracle postgres quality-control r sql sql-server

Last synced: 4 months ago
JSON representation

Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).

Awesome Lists containing this project

README

        

---
title: "Summarizing large tables with SQL"
author: "Henry Luan"
output: rmarkdown::github_document
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

## Overview

The core helper function **get_summary_codes** in the `sql.mechanic` package takes in a string that specifies a table's database name, schema name, and table name. It outputs an SQL query that summarizes the table's column statistics.

If you don't have much time to read or just want a quick solution, jump to [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases).

## Intended User

- If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI tools (e.g. Python, R, SAS, etc.).

- If it's much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.

## Limitations

- You probably want to understand how [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)affect the CPU and disk usage of your database server, to avoid bad surprises on your server's resource usage.

- If you use the setup of [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)on a cloud database, you **MUST** do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.

- Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.

## Credits

Specail credits: the R codes in this package are built on top of Gordon S. Linoff's book *Data Analysis Using SQL and Excel, 2nd Edition*. His work has been a tremendous inspiration for the creation of this package.

## Example 1: Generate SQL queries and execute them in DMBS

### Step 1: Install packages

You can install the library from my GitHub. If you have concerns regarding the package's security, you can download, check, and use the "get_summary_codes.R" file directly.

```{r, message=FALSE, warning=FALSE}
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")
```

### Step 2: Generate SQL quires

The following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.

```{r, fig.show='hold'}
library(sql.mechanic)

#SQL codes for basic summary, Netezza database
sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")

#SQL codes for advanced summary, Netezza database
sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")

#SQL codes for basic summary, Microsoft SQL Server
sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")

#SQL codes for advanced summary, Microsoft SQL Server
sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")

#copy some of the sql queries to the clipboard
writeClipboard(sql_basic_netezza)
```

### Step 3: Paste the codes in your clipboard and run it in SQL

In case you are curious, the SQL copied to your clipboard looks like this.

```{sql, eval = FALSE}
SELECT REPLACE(REPLACE(REPLACE(' SELECT '''' as colname,
COUNT(*) as numvalues,
MAX(freqnull) as freqnull,
CAST(MIN(minval) as CHAR(100)) as minval,
SUM(CASE WHEN = minval THEN freq ELSE 0 END) as numminvals,
CAST(MAX(maxval) as CHAR(100)) as maxval,
SUM(CASE WHEN = maxval THEN freq ELSE 0 END) as nummaxvals,
SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques

FROM (SELECT , COUNT(*) as freq
FROM SCHEMA_NAME. GROUP BY ) osum
CROSS JOIN (SELECT MIN() as minval, MAX() as maxval, SUM(CASE WHEN IS NULL THEN 1 ELSE 0 END) as freqnull
FROM (SELECT FROM SCHEMA_NAME.) osum
) summary',
'', column_name),
'', 'TABLE_NAME'),
'',
(CASE WHEN ordinal_position = 1 THEN ''
ELSE 'UNION ALL' END)) as codes_data_summary
FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP') then '"'||column_name||'"'
else column_name end as column_name , ordinal_position
FROM information_schema.columns
WHERE table_name ='TABLE_NAME') a;
```

### Step 4: Copy, paste and execute the query results from Step 3.

## Example 2: Automatically summarize tables in your databases {#example-2-automatically-summarize-tables-in-your-databases}

This example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.

```{r, fig.show='hold',eval=FALSE}
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")

# Load packages
library(sql.mechanic)
library(odbc)
library(DBI)

# Connect to database(s)
## Method 1: prompting user (you need to make some changes here)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "mysqlhost",
Database = "mydbname",
UID = "myuser",
PWD = "Database password",
Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly

## Alternative Method: Using a DSN
#con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')

sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")

res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"

res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
output_table = dbFetch(res) # part of Step 4 in "Example 1"
print(output) #desired summary table

dbDisconnect(con) #close database connection
```