{"id":14066956,"url":"https://github.com/casualcomputer/sql.mechanic","last_synced_at":"2025-07-30T00:31:47.741Z","repository":{"id":90504562,"uuid":"378328099","full_name":"casualcomputer/sql.mechanic","owner":"casualcomputer","description":"Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).","archived":false,"fork":false,"pushed_at":"2023-03-04T00:47:31.000Z","size":93,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-04T07:36:10.596Z","etag":null,"topics":["data-analysis","data-quality-checks","data-science","database","mysql","netezza","oracle","postgres","quality-control","r","sql","sql-server"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/casualcomputer.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-19T05:06:49.000Z","updated_at":"2023-03-03T23:17:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"98ecf3d8-abf3-421b-8b5c-b46c78a45ae8","html_url":"https://github.com/casualcomputer/sql.mechanic","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/casualcomputer/sql.mechanic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casualcomputer%2Fsql.mechanic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casualcomputer%2Fsql.mechanic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casualcomputer%2Fsql.mechanic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casualcomputer%2Fsql.mechanic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/casualcomputer","download_url":"https://codeload.github.com/casualcomputer/sql.mechanic/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casualcomputer%2Fsql.mechanic/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267785735,"owners_count":24144120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-quality-checks","data-science","database","mysql","netezza","oracle","postgres","quality-control","r","sql","sql-server"],"created_at":"2024-08-13T07:05:21.117Z","updated_at":"2025-07-30T00:31:46.378Z","avatar_url":"https://github.com/casualcomputer.png","language":"R","readme":"---\ntitle: \"Summarizing large tables with SQL\"\nauthor: \"Henry Luan\"\noutput: rmarkdown::github_document \nvignette: \u003e\n  %\\VignetteIndexEntry{Vignette Title}\n  %\\VignetteEngine{knitr::rmarkdown}\n  %\\VignetteEncoding{UTF-8}\n---\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\"\n)\n```\n\n## Overview\n\nThe core helper function **get_summary_codes** in the `sql.mechanic` package takes in a string that specifies a table's database name, schema name, and table name. It outputs an SQL query that summarizes the table's column statistics.\n\nIf you don't have much time to read or just want a quick solution, jump to [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases).\n\n## Intended User\n\n-   If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI tools (e.g. Python, R, SAS, etc.).\n\n-   If it's much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.\n\n## Limitations\n\n-   You probably want to understand how [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)affect the CPU and disk usage of your database server, to avoid bad surprises on your server's resource usage.\n\n-   If you use the setup of [**Example 2**](#example-2-automatically-summarize-tables-in-your-databases)on a cloud database, you **MUST** do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.\n\n-   Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.\n\n## Credits\n\nSpecail credits: the R codes in this package are built on top of Gordon S. Linoff's book *Data Analysis Using SQL and Excel, 2nd Edition*. His work has been a tremendous inspiration for the creation of this package.\n\n## Example 1: Generate SQL queries and execute them in DMBS\n\n### Step 1: Install packages\n\nYou can install the library from my GitHub. If you have concerns regarding the package's security, you can download, check, and use the \"get_summary_codes.R\" file directly.\n\n```{r,  message=FALSE, warning=FALSE}\n# Install package\nlibrary(devtools)\ninstall_github(\"casualcomputer/sql.mechanic\",quiet=TRUE)\n\n# Alternative: use the 'get_summary_codes.R' file only\n  # source(\"get_summary_codes.R\")\n```\n\n### Step 2: Generate SQL quires\n\nThe following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.\n\n```{r, fig.show='hold'}\nlibrary(sql.mechanic)\n\n#SQL codes for basic summary, Netezza database \n    sql_basic_netezza = get_summary_codes(\"DB_NAME.SCHEMA_NAME.TABLE_NAME\", type=\"basic\", dbtype=\"Netezza\")   \n    \n#SQL codes for advanced summary, Netezza database \n    sql_advanced_netezza = get_summary_codes(\"DB_NAME.SCHEMA_NAME.TABLE_NAME\", type=\"advanced\", dbtype=\"Netezza\")   \n    \n#SQL codes for basic summary, Microsoft SQL Server \n    sql_basic_mssql = get_summary_codes(\"DB_NAME.SCHEMA_NAME.TABLE_NAME\", type=\"basic\", dbtype=\"MSSQL\")  \n\n#SQL codes for advanced summary, Microsoft SQL Server\n    sql_advanced_mssql = get_summary_codes(\"DB_NAME.SCHEMA_NAME.TABLE_NAME\", type=\"advanced\", dbtype=\"MSSQL\")  \n    \n#copy some of the sql queries to the clipboard    \n    writeClipboard(sql_basic_netezza) \n```\n\n### Step 3: Paste the codes in your clipboard and run it in SQL\n\nIn case you are curious, the SQL copied to your clipboard looks like this.\n\n```{sql, eval = FALSE}\nSELECT REPLACE(REPLACE(REPLACE('\u003cstart\u003e SELECT ''\u003ccol\u003e'' as colname,\n                               COUNT(*) as numvalues,\n                               MAX(freqnull) as freqnull,\n                               CAST(MIN(minval) as CHAR(100)) as minval,\n                               SUM(CASE WHEN \u003ccol\u003e = minval THEN freq ELSE 0 END) as numminvals,\n                               CAST(MAX(maxval) as CHAR(100)) as maxval,\n                               SUM(CASE WHEN \u003ccol\u003e = maxval THEN freq ELSE 0 END) as nummaxvals,\n                               SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques\n\n                               FROM (SELECT \u003ccol\u003e, COUNT(*) as freq\n                               FROM SCHEMA_NAME.\u003ctab\u003e GROUP BY \u003ccol\u003e) osum\n                                                   CROSS JOIN (SELECT MIN(\u003ccol\u003e) as minval, MAX(\u003ccol\u003e) as maxval, SUM(CASE WHEN \u003ccol\u003e IS NULL THEN 1 ELSE 0 END) as freqnull\n                                                   FROM (SELECT \u003ccol\u003e FROM SCHEMA_NAME.\u003ctab\u003e) osum\n                                                   ) summary',\n                               '\u003ccol\u003e', column_name),\n                               '\u003ctab\u003e', 'TABLE_NAME'),\n                               '\u003cstart\u003e',\n                               (CASE WHEN ordinal_position = 1 THEN ''\n                               ELSE 'UNION ALL' END)) as codes_data_summary\n                               FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP')  then '\"'||column_name||'\"'\n                                                             else column_name end as column_name  , ordinal_position\n                               FROM information_schema.columns\n                               WHERE table_name ='TABLE_NAME') a;\n```\n\n### Step 4: Copy, paste and execute the query results from Step 3.\n\n## Example 2: Automatically summarize tables in your databases {#example-2-automatically-summarize-tables-in-your-databases}\n\nThis example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.\n\n```{r, fig.show='hold',eval=FALSE}\n# Install package\nlibrary(devtools)\ninstall_github(\"casualcomputer/sql.mechanic\",quiet=TRUE)\n\n# Alternative: use the 'get_summary_codes.R' file only\n  # source(\"get_summary_codes.R\")\n\n# Load packages  \nlibrary(sql.mechanic)\nlibrary(odbc)\nlibrary(DBI)\n\n# Connect to database(s)\n## Method 1: prompting user (you need to make some changes here)\n  con \u003c- dbConnect(odbc(),\n                   Driver = \"SQL Server\",\n                   Server = \"mysqlhost\",\n                   Database = \"mydbname\",\n                   UID = \"myuser\",\n                   PWD = \"Database password\",\n                   Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly\n\n## Alternative Method: Using a DSN\n  #con \u003c- dbConnect(odbc::odbc(), \"DNS_NAMES\", encoding = 'windows-1252')\n\nsql_query = get_summary_codes(\"DB_NAME.SCHEMA_NAME.TABLE_NAME\", type=\"basic\", dbtype=\"Netezza\") \n\nres = dbSendQuery(con, sql_query) # part of Step 3 in \"Example 1\"\nsql_query_mod = dbFetch(res) # part of Step 3 in \"Example 1\"\n\nres = dbSendQuery(con, sql_query_mod) # part of Step 4 in \"Example 1\"\noutput_table = dbFetch(res) # part of Step 4 in \"Example 1\"\nprint(output) #desired summary table\n\ndbDisconnect(con) #close database connection\n```\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcasualcomputer%2Fsql.mechanic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcasualcomputer%2Fsql.mechanic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcasualcomputer%2Fsql.mechanic/lists"}