https://github.com/fstpackage/synthetic
R package for dataset generation and benchmarking
https://github.com/fstpackage/synthetic
Last synced: 4 months ago
JSON representation
R package for dataset generation and benchmarking
- Host: GitHub
- URL: https://github.com/fstpackage/synthetic
- Owner: fstpackage
- License: agpl-3.0
- Created: 2019-08-27T09:52:38.000Z (over 5 years ago)
- Default Branch: develop
- Last Pushed: 2020-01-20T15:11:13.000Z (about 5 years ago)
- Last Synced: 2024-08-13T07:15:34.779Z (8 months ago)
- Language: R
- Homepage:
- Size: 206 KB
- Stars: 20
- Watchers: 3
- Forks: 1
- Open Issues: 12
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - fstpackage/synthetic - R package for dataset generation and benchmarking (R)
README
---
output:
github_document
editor_options:
chunk_output_type: console
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[](https://travis-ci.org/fstpackage/synthetic)
[](https://ci.appveyor.com/project/fstpackage/synthetic/branch/develop)
[](https://www.gnu.org/licenses/agpl-3.0)
[](https://www.tidyverse.org/lifecycle/#experimental)
[](https://codecov.io/gh/fstpackage/synthetic)## Overview
```{r, echo = FALSE}
set.seed(87617)
```The `synthetic` package provides tooling to greatly symplify the creation of synthetic datasets for testing purposes. It's features include:
* Creation of _dataset templates_ that can be used to generate arbitrary large datasets
* Creation of _column templates_ that can be used to define column data with custom range and distribution
* Automatic creation of dataset templates from existing datasets
* Many pre-defined templates to help you generate synthetic datasets with little effort
* Extented benchmark framework to help test the performance of serialization options such as `fst`, `arrow`, `fread` / `fwrite`, `sqlite`, etc.By using a standardized method of serialization benchmarking, benchmark results become more reliable and more easy to compare over various solutions, as can be seen further down in this introduction.
## Synthetic datasets
Most `R` users will probably be familiar with the _iris_ dataset as it's widely used in package examples and tutorials:
```{r, message = FALSE}
library(dplyr)iris %>%
as_tibble()
```But what if you need a a million row dataset for your purposes? The `synthetic` package makes that straightforward. Simply define a _dataset template_ using `synthetic_table()`:
```{r}
library(synthetic)# define a synthetic table
synt_table <- synthetic_table(iris)
```with the template, you can generate any number of rows:
```{r}
synt_table %>%
generate(1e6) # a million rows
```You can also select specific columns:
```{r}
synt_table %>%
generate(1e6, "Species") # single column
```## Creating your own template
If you want to generate a dataset with specific characteristics of it's columns, you can use _column templates_ to specify each column directly:
```{r}
# define a custom template
synt_table <- synthetic_table(
Logical = template_logical(true_false_na_ratio = c(85, 10, 5)),
Integer = template_integer(max_value = 100L),
Real = template_numerical_uniform(0.01, 100, max_distict_values = 20)
# ,
# Factor = template_string_random(5, 8, ))
)synt_table %>%
generate(10)
```## Benchmarking serialization
Benchmarks performed With `synthetic` have the following features:
* Each measurement of serialization speed uses a unique dataset (_avoid disk caching_)
* A read is not executed immediately after a write of the same dataset (_avoid disk caching_)
* All (column-) data is generated on the fly using predefined generators (_no need to download large test sets_)
* A wide range of data profiles can be used for the creation of synthetic data (_understand dependencies on data format and profile_)
* Object- en file sizes are recorded and speeds automatically calculated (_reproducible results_)
* A progress bar shows percentage done and time remaining (_know when to go and get a cup of coffee_)
* Only the actual serialization speed is benchmarked (_measure only what must be measured_)
* Multithreaded solutions are correctly measured (_unlike some benchmark techniques_)But most importantly, with the use of `synthetic`, complex benchmarks are reduced to a few simple statements, increasing your productivity and reproducibility!
## Walkthrough: setting up a benchmark
A lot of claims are made on the performance of serializers and databases, but the truth is that all solutions have their own strenghts and weaknesses.
_some more text here_
Define the template of a test dataset:
Do some benchmarking on the _fst_ format:
```{r, eval = FALSE}
library(dplyr)synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst()) %>%
bench_rows(1e7) %>%
collect()
```Congratulations, that's your first structured benchmark :-)
Now, let´s add a second _streamer_ and allow for two different sizes of datasets:
```{r, eval = FALSE}
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst(), streamer_parguet()) %>% # two streamers
bench_rows(1e7, 5e7) %>%
collect()
```As you can see, although benchmarking two solutions at different sizes is more complex than the single solution benchmark, with `synthetic` it´s just a matter of expanding some of the arguments.
Let´s add two more _streamers_ and add compression settings to the mix:
```{r, eval = FALSE}
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_rds(), streamer_fst(), streamer_parguet(), streamer_feather()) %>%
bench_rows(1e7, 5e7) %>%
bench_compression(50, 80) %>%
collect()
```