Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MangoTheCat/dbloadss
Test SQL Server Load Times
https://github.com/MangoTheCat/dbloadss
Last synced: 3 months ago
JSON representation
Test SQL Server Load Times
- Host: GitHub
- URL: https://github.com/MangoTheCat/dbloadss
- Owner: MangoTheCat
- License: other
- Created: 2018-04-10T14:33:36.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-06-08T14:54:48.000Z (over 6 years ago)
- Last Synced: 2024-06-05T02:33:08.072Z (5 months ago)
- Language: R
- Size: 173 KB
- Stars: 1
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - MangoTheCat/dbloadss - Test SQL Server Load Times (R)
README
---
output:
md_document:
variant: markdown_github
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```# dbloadss
This package and accompanying scripts is to test load times to MS SQL Server 17 (SS) from various methods:
1. Push from R with `RODBC`
2. Push from R with `odbc`
3. Pull from SS with a stored procedureThis package is really for number 3. We can do the rest from a script, although it's a useful place to keep some functions.
# Test Database
The code for this post runs on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called `ml`.
# Load Times
We're going to push a big data frame into SQL Server using three methods. The data set is a simple randomly generated data frame.
```{r rand_data}
library(dbloadss)random_data_set(n_rows = 5, n_cols = 5)
```R is pretty quick at this sort of thing so we don't really need to worry about how long it takes to make a big data frame.
```{r time_create}
n_rows <- 3000000
system.time({
random_data <- random_data_set(n_rows = n_rows, n_cols = 5)
})
```but what we're interested in is how fast to push this to SQL Server?.
## RODBC
`RODBC` was, for a long time, the standard way to connect to SQL Server from R. It's a great package that makes it easy to send queries, collect results, and handles type conversions pretty well. However, it is a bit slow for pushing data in. The fastest I could manage was using the `sqlSave` function with the safeties off. Very interested to hear if there's a better method. For 3m rows it's a no go. So scaling back to 30k rows we get:
```{r rodbc}
library(RODBC)db <- odbcDriverConnect('driver={SQL Server};server=localhost\\SQL17ML;database=ml;trusted_connection=true')
n_rodbc <- 30000
odbcQuery(db, "drop table randData;")
time30k <- system.time({
RODBC::sqlSave(
db,
dat = random_data[1:n_rodbc,],
tablename = "randData",
rownames = FALSE,
fast = TRUE,
safer = FALSE
)
})
odbcClose(db)time30k
```Over 2 minutes! It's been roughly linear for me so that total write time for 3m rows is a few hours.
## ODBC
`odbc` is a relatively new package from RStudio which provides a DBI compliant ODBC interface. It is considerably faster for writes. Here we'll push the full 3m rows.
```{r odbc}
library(DBI)dbi <- dbConnect(odbc::odbc(),
driver = "SQL Server",
server="localhost\\SQL17ML",
database = "ml")time3modbc <- system.time({
dbWriteTable(dbi,
name = "randData",
value = random_data,
overwrite = TRUE)
})
dbDisconnect(dbi)time3modbc
```## SQL Server External Script
An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the `sp_execute_external_script` command, which we will wrap in a stored procedure. This is what that looks like for me:
```sql
use [ml];DROP PROC IF EXISTS generate_random_data;
GO
CREATE PROC generate_random_data(@nrow int)
AS
BEGIN
EXEC sp_execute_external_script
@language = N'R'
, @script = N'
library(dbloadss)
random_data <- random_data_set(n_rows = nrow_r, n_cols = 5)
'
, @output_data_1_name = N'random_data'
, @params = N'@nrow_r int'
, @nrow_r = @nrow
WITH RESULT SETS ((
"COL_1" float not null,
"COL_2" float not null,
"COL_3" float not null,
"COL_4" float not null,
"COL_5" float not null));
END;
GO
```We then call the stored procedure with another query (skipping out a step that clears it inbetween tests).
```sql
INSERT INTO randData
EXEC [dbo].[generate_random_data] @nrow = 3000000```
![SQL Server Timer](SStime.jpg)
and this runs in 34 seconds. My best guess for the performance increase is that the data is serialised more efficiently. More to investigate.
# License
MIT © Mango Solutions