{"id":14068987,"url":"https://github.com/fstpackage/synthetic","last_synced_at":"2026-01-17T17:56:56.464Z","repository":{"id":224199688,"uuid":"204672865","full_name":"fstpackage/synthetic","owner":"fstpackage","description":"R package for dataset generation and benchmarking","archived":false,"fork":false,"pushed_at":"2020-01-20T15:11:13.000Z","size":211,"stargazers_count":20,"open_issues_count":12,"forks_count":1,"subscribers_count":3,"default_branch":"develop","last_synced_at":"2024-12-04T10:38:52.309Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fstpackage.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-08-27T09:52:38.000Z","updated_at":"2024-07-28T08:31:19.000Z","dependencies_parsed_at":"2024-02-24T13:54:01.140Z","dependency_job_id":null,"html_url":"https://github.com/fstpackage/synthetic","commit_stats":null,"previous_names":["fstpackage/synthetic"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fstpackage/synthetic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fstpackage%2Fsynthetic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fstpackage%2Fsynthetic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fstpackage%2Fsynthetic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fstpackage%2Fsynthetic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fstpackage","download_url":"https://codeload.github.com/fstpackage/synthetic/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fstpackage%2Fsynthetic/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267815187,"owners_count":24148356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-13T07:06:31.561Z","updated_at":"2026-01-17T17:56:56.451Z","avatar_url":"https://github.com/fstpackage.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput:\n  github_document\neditor_options: \n  chunk_output_type: console\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"README-\"\n)\n```\n\n\u003cimg src=\"synthetic.png\" align=\"right\" height=\"221\" width=\"192\" /\u003e\n\n[![Linux/OSX Build Status](https://travis-ci.org/fstpackage/synthetic.svg?branch=develop)](https://travis-ci.org/fstpackage/synthetic)\n[![Windows Build status](https://ci.appveyor.com/api/projects/status/o983hmredcg3ww91/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/synthetic/branch/develop)\n[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-blue.svg)](https://www.tidyverse.org/lifecycle/#experimental)\n[![codecov](https://codecov.io/gh/fstpackage/synthetic/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/synthetic)\n\n## Overview\n\n```{r, echo = FALSE}\nset.seed(87617)\n```\n\nThe `synthetic` package provides tooling to greatly symplify the creation of synthetic datasets for testing purposes. It's features include:\n\n* Creation of _dataset templates_ that can be used to generate arbitrary large datasets\n* Creation of _column templates_ that can be used to define column data with custom range and distribution\n* Automatic creation of dataset templates from existing datasets\n* Many pre-defined templates to help you generate synthetic datasets with little effort\n* Extented benchmark framework to help test the performance of serialization options such as `fst`, `arrow`, `fread` / `fwrite`, `sqlite`, etc.\n\nBy using a standardized method of serialization benchmarking, benchmark results become more reliable and more easy to compare over various solutions, as can be seen further down in this introduction.\n\n## Synthetic datasets\n\nMost `R` users will probably be familiar with the _iris_ dataset as it's widely used in package examples and tutorials:\n\n```{r, message = FALSE}\nlibrary(dplyr)\n\niris %\u003e%\n  as_tibble()\n```\n\nBut what if you need a a million row dataset for your purposes? The `synthetic` package makes that straightforward. Simply define a _dataset template_ using `synthetic_table()`:\n\n```{r}\nlibrary(synthetic)\n\n# define a synthetic table\nsynt_table \u003c- synthetic_table(iris)\n```\n\nwith the template, you can generate any number of rows:\n\n```{r}\nsynt_table %\u003e%\n  generate(1e6) # a million rows\n```\n\nYou can also select specific columns:\n\n```{r}\nsynt_table %\u003e%\n  generate(1e6, \"Species\")  # single column\n```\n\n## Creating your own template\n\nIf you want to generate a dataset with specific characteristics of it's columns, you can use _column templates_ to specify each column directly:\n\n```{r}\n# define a custom template\nsynt_table \u003c- synthetic_table(\n  Logical = template_logical(true_false_na_ratio = c(85, 10, 5)),\n  Integer = template_integer(max_value = 100L),\n  Real    = template_numerical_uniform(0.01, 100, max_distict_values = 20)\n  # ,\n  # Factor  = template_string_random(5, 8, ))\n)\n\nsynt_table %\u003e%\n  generate(10)\n```\n\n\n## Benchmarking serialization\n\nBenchmarks performed With `synthetic` have the following features:\n\n* Each measurement of serialization speed uses a unique dataset (_avoid disk caching_)\n* A read is not executed immediately after a write of the same dataset  (_avoid disk caching_)\n* All (column-) data is generated on the fly using predefined generators (_no need to download large test sets_)\n* A wide range of data profiles can be used for the creation of synthetic data (_understand dependencies on data format and profile_)\n* Object- en file sizes are recorded and speeds automatically calculated (_reproducible results_)\n* A progress bar shows percentage done and time remaining (_know when to go and get a cup of coffee_)\n* Only the actual serialization speed is benchmarked (_measure only what must be measured_)\n* Multithreaded solutions are correctly measured (_unlike some benchmark techniques_)\n\nBut most importantly, with the use of `synthetic`, complex benchmarks are reduced to a few simple statements, increasing your productivity and reproducibility!\n\n\n## Walkthrough: setting up a benchmark\n\nA lot of claims are made on the performance of serializers and databases, but the truth is that all solutions have their own strenghts and weaknesses.\n\n_some more text here_\n\nDefine the template of a test dataset:\n\n\nDo some benchmarking on the _fst_ format:\n\n```{r, eval = FALSE}\nlibrary(dplyr)\n\nsynthetic_bench() %\u003e%\n  bench_generators(generator) %\u003e%\n  bench_streamers(streamer_fst()) %\u003e%\n  bench_rows(1e7) %\u003e%\n  collect()\n```\n\nCongratulations, that's your first structured benchmark :-)\n\nNow, let´s add a second _streamer_ and allow for two different sizes of datasets:\n\n```{r, eval = FALSE}\nsynthetic_bench() %\u003e%\n  bench_generators(generator) %\u003e%\n  bench_streamers(streamer_fst(), streamer_parguet()) %\u003e%  # two streamers\n  bench_rows(1e7, 5e7) %\u003e%\n  collect()\n```\n\nAs you can see, although benchmarking two solutions at different sizes is more complex than the single solution benchmark, with `synthetic` it´s just a matter of expanding some of the arguments.\n\nLet´s add two more _streamers_ and add compression settings to the mix:\n\n```{r, eval = FALSE}\nsynthetic_bench() %\u003e%\n  bench_generators(generator) %\u003e%\n  bench_streamers(streamer_rds(), streamer_fst(), streamer_parguet(), streamer_feather()) %\u003e%\n  bench_rows(1e7, 5e7) %\u003e%\n  bench_compression(50, 80) %\u003e%\n  collect()\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffstpackage%2Fsynthetic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffstpackage%2Fsynthetic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffstpackage%2Fsynthetic/lists"}