An open API service indexing awesome lists of open source software.

https://github.com/fstpackage/synthetic

R package for dataset generation and benchmarking
https://github.com/fstpackage/synthetic

Last synced: 4 months ago
JSON representation

R package for dataset generation and benchmarking

Awesome Lists containing this project

README

        

---
output:
github_document
editor_options:
chunk_output_type: console
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

[![Linux/OSX Build Status](https://travis-ci.org/fstpackage/synthetic.svg?branch=develop)](https://travis-ci.org/fstpackage/synthetic)
[![Windows Build status](https://ci.appveyor.com/api/projects/status/o983hmredcg3ww91/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/synthetic/branch/develop)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-blue.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![codecov](https://codecov.io/gh/fstpackage/synthetic/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/synthetic)

## Overview

```{r, echo = FALSE}
set.seed(87617)
```

The `synthetic` package provides tooling to greatly symplify the creation of synthetic datasets for testing purposes. It's features include:

* Creation of _dataset templates_ that can be used to generate arbitrary large datasets
* Creation of _column templates_ that can be used to define column data with custom range and distribution
* Automatic creation of dataset templates from existing datasets
* Many pre-defined templates to help you generate synthetic datasets with little effort
* Extented benchmark framework to help test the performance of serialization options such as `fst`, `arrow`, `fread` / `fwrite`, `sqlite`, etc.

By using a standardized method of serialization benchmarking, benchmark results become more reliable and more easy to compare over various solutions, as can be seen further down in this introduction.

## Synthetic datasets

Most `R` users will probably be familiar with the _iris_ dataset as it's widely used in package examples and tutorials:

```{r, message = FALSE}
library(dplyr)

iris %>%
as_tibble()
```

But what if you need a a million row dataset for your purposes? The `synthetic` package makes that straightforward. Simply define a _dataset template_ using `synthetic_table()`:

```{r}
library(synthetic)

# define a synthetic table
synt_table <- synthetic_table(iris)
```

with the template, you can generate any number of rows:

```{r}
synt_table %>%
generate(1e6) # a million rows
```

You can also select specific columns:

```{r}
synt_table %>%
generate(1e6, "Species") # single column
```

## Creating your own template

If you want to generate a dataset with specific characteristics of it's columns, you can use _column templates_ to specify each column directly:

```{r}
# define a custom template
synt_table <- synthetic_table(
Logical = template_logical(true_false_na_ratio = c(85, 10, 5)),
Integer = template_integer(max_value = 100L),
Real = template_numerical_uniform(0.01, 100, max_distict_values = 20)
# ,
# Factor = template_string_random(5, 8, ))
)

synt_table %>%
generate(10)
```

## Benchmarking serialization

Benchmarks performed With `synthetic` have the following features:

* Each measurement of serialization speed uses a unique dataset (_avoid disk caching_)
* A read is not executed immediately after a write of the same dataset (_avoid disk caching_)
* All (column-) data is generated on the fly using predefined generators (_no need to download large test sets_)
* A wide range of data profiles can be used for the creation of synthetic data (_understand dependencies on data format and profile_)
* Object- en file sizes are recorded and speeds automatically calculated (_reproducible results_)
* A progress bar shows percentage done and time remaining (_know when to go and get a cup of coffee_)
* Only the actual serialization speed is benchmarked (_measure only what must be measured_)
* Multithreaded solutions are correctly measured (_unlike some benchmark techniques_)

But most importantly, with the use of `synthetic`, complex benchmarks are reduced to a few simple statements, increasing your productivity and reproducibility!

## Walkthrough: setting up a benchmark

A lot of claims are made on the performance of serializers and databases, but the truth is that all solutions have their own strenghts and weaknesses.

_some more text here_

Define the template of a test dataset:

Do some benchmarking on the _fst_ format:

```{r, eval = FALSE}
library(dplyr)

synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst()) %>%
bench_rows(1e7) %>%
collect()
```

Congratulations, that's your first structured benchmark :-)

Now, let´s add a second _streamer_ and allow for two different sizes of datasets:

```{r, eval = FALSE}
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst(), streamer_parguet()) %>% # two streamers
bench_rows(1e7, 5e7) %>%
collect()
```

As you can see, although benchmarking two solutions at different sizes is more complex than the single solution benchmark, with `synthetic` it´s just a matter of expanding some of the arguments.

Let´s add two more _streamers_ and add compression settings to the mix:

```{r, eval = FALSE}
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_rds(), streamer_fst(), streamer_parguet(), streamer_feather()) %>%
bench_rows(1e7, 5e7) %>%
bench_compression(50, 80) %>%
collect()
```