{"id":28532127,"url":"https://github.com/databrickslabs/dbldatagen","last_synced_at":"2025-07-07T13:31:15.188Z","repository":{"id":40953664,"uuid":"198418889","full_name":"databrickslabs/dbldatagen","owner":"databrickslabs","description":"Generate relevant synthetic data quickly for your projects.  The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines","archived":false,"fork":false,"pushed_at":"2025-05-12T21:52:23.000Z","size":11606,"stargazers_count":407,"open_issues_count":28,"forks_count":74,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-06-09T15:43:33.033Z","etag":null,"topics":["data-generation","databricks","datagen","datageneration","datagenerator","delta-live-tables","deltalake","faker","pyspark","python","spark","spark-streaming","synthetic-data"],"latest_commit_sha":null,"homepage":"https://databrickslabs.github.io/dbldatagen","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databrickslabs.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-23T11:42:56.000Z","updated_at":"2025-06-02T13:23:53.000Z","dependencies_parsed_at":"2023-10-04T06:04:05.818Z","dependency_job_id":"563d2917-639a-45f7-a988-0a920cd5b259","html_url":"https://github.com/databrickslabs/dbldatagen","commit_stats":{"total_commits":253,"total_committers":9,"mean_commits":28.11111111111111,"dds":0.2569169960474308,"last_synced_commit":"d5ddb8e71ae5278c82ccf544facf141c7a8a7d35"},"previous_names":["databrickslabs/data-generator"],"tags_count":48,"template":false,"template_full_name":null,"purl":"pkg:github/databrickslabs/dbldatagen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fdbldatagen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fdbldatagen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fdbldatagen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fdbldatagen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databrickslabs","download_url":"https://codeload.github.com/databrickslabs/dbldatagen/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fdbldatagen/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264085372,"owners_count":23555186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-generation","databricks","datagen","datageneration","datagenerator","delta-live-tables","deltalake","faker","pyspark","python","spark","spark-streaming","synthetic-data"],"created_at":"2025-06-09T15:31:01.350Z","updated_at":"2025-07-07T13:31:15.179Z","avatar_url":"https://github.com/databrickslabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Databricks Labs Data Generator (`dbldatagen`) \n\n\u003c!-- Top bar will be removed from PyPi packaged versions --\u003e\n\u003c!-- Dont remove: exclude package --\u003e\n[Documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) |\n[Release Notes](CHANGELOG.md) |\n[Examples](examples) |\n[Tutorial](tutorial) \n\u003c!-- Dont remove: end exclude package --\u003e\n\n[![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)\n[![PyPi package](https://img.shields.io/pypi/v/dbldatagen?color=green)](https://pypi.org/project/dbldatagen/)\n[![codecov](https://codecov.io/gh/databrickslabs/dbldatagen/branch/master/graph/badge.svg)](https://codecov.io/gh/databrickslabs/dbldatagen)\n[![PyPi downloads](https://img.shields.io/pypi/dm/dbldatagen?label=PyPi%20Downloads)](https://pypistats.org/packages/dbldatagen)\n[![lines of code](https://tokei.rs/b1/github/databrickslabs/dbldatagen)]([https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen))\n\n\u003c!-- \n[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm\u0026logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)\n[![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs\u0026repo=dbldatagen)\n--\u003e\n\n## Project Description\nThe `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks \nenvironment using Spark. The generated data may be used for testing, benchmarking, demos, and many \nother uses.\n\nIt operates by defining a data generation specification in code that controls \nhow the synthetic data is generated.\nThe specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.\n\nIt has no dependencies on any libraries that are not already installed in the Databricks \nruntime, and you can use it from Scala, R or other languages by defining\na view over the generated data.\n\n### Feature Summary\nIt supports:\n* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters \n* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture, \nmerge and join scenarios with consistency between primary and foreign keys\n* Generating synthetic data for all of the \nSpark SQL supported primitive types as a Spark data frame which may be persisted, \nsaved to external storage or \nused in other computations\n* Generating ranges of dates, timestamps, and numeric values\n* Generation of discrete values - both numeric and text\n* Generation of values at random and based on the values of other fields \n(either based on the `hash` of the underlying values or the values themselves)\n* Ability to specify a distribution for random data generation \n* Generating arrays of values for ML-style feature arrays\n* Applying weights to the occurrence of values\n* Generating values to conform to a schema or independent of an existing schema\n* use of SQL expressions in synthetic data generation\n* plugin mechanism to allow use of 3rd party libraries such as Faker\n* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source\n* Generate synthetic data generation code from existing schema or data (experimental)\n* Use of standard datasets for quick generation of synthetic data\n\nDetails of these features can be found in the online documentation  -\n [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html). \n\n## Documentation\n\nPlease refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for \ndetails of use and many examples.\n\nRelease notes and details of the latest changes for this specific release\ncan be found in the GitHub repository\n[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.4.0post2/CHANGELOG.md)\n\n# Installation\n\nUse `pip install dbldatagen` to install the PyPi package.\n\nWithin a Databricks notebook, invoke the following in a notebook cell\n```commandline\n%pip install dbldatagen\n```\n\nThe Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline \nand even works on the Databricks community edition.\n\nThe documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html) \ncontains details of installation using alternative mechanisms.\n\n## Compatibility \nThe Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are \ncompatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support, \nwe recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)\n\nFor full library compatibility for a specific Databricks Spark release, see the Databricks \nrelease notes for library compatibility\n\n- https://docs.databricks.com/release-notes/runtime/releases.html\n\nWhen using the Databricks Labs Data Generator on \"Unity Catalog\" enabled Databricks environments, \nthe Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks \nruntimes prior to release 13.2. This is because some needed features are not available in `Shared` \nmode (for example, use of 3rd party libraries, use of Python UDFs) in these releases. \nDepending on settings, the `Custom` access mode may be supported.\n\nThe use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2\nonwards.\n\nSee the following documentation for more information:\n\n- https://docs.databricks.com/data-governance/unity-catalog/compute.html\n\n## Using the Data Generator\nTo use the data generator, install the library using the `%pip install` method or install the Python wheel directly \nin your environment.\n\nOnce the library has been installed, you can use it to generate a data frame composed of synthetic data.\n\nThe easiest way to use the data generator is to use one of the standard datasets which can be further customized\nfor your use case.\n\n```buildoutcfg\nimport dbldatagen as dg\ndf = dg.Datasets(spark, \"basic/user\").get(rows=1000_000).build()\nnum_rows=df.count()                          \n```\n\nYou can also define fully custom data sets using the `DataGenerator` class.\n\nFor example\n\n```buildoutcfg\nimport dbldatagen as dg\nfrom pyspark.sql.types import IntegerType, FloatType, StringType\ncolumn_count = 10\ndata_rows = 1000 * 1000\ndf_spec = (dg.DataGenerator(spark, name=\"test_data_set1\", rows=data_rows,\n                                                  partitions=4)\n           .withIdOutput()\n           .withColumn(\"r\", FloatType(), \n                            expr=\"floor(rand() * 350) * (86400 + 3600)\",\n                            numColumns=column_count)\n           .withColumn(\"code1\", IntegerType(), minValue=100, maxValue=200)\n           .withColumn(\"code2\", IntegerType(), minValue=0, maxValue=10)\n           .withColumn(\"code3\", StringType(), values=['a', 'b', 'c'])\n           .withColumn(\"code4\", StringType(), values=['a', 'b', 'c'], \n                          random=True)\n           .withColumn(\"code5\", StringType(), values=['a', 'b', 'c'], \n                          random=True, weights=[9, 1, 1])\n \n           )\n                            \ndf = df_spec.build()\nnum_rows=df.count()                          \n```\nRefer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further \nexamples. \n\nThe GitHub repository also contains further examples in the examples directory.\n\n## Spark and Databricks Runtime Compatibility\nThe `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including \nolder LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes, \nincluding `current` and `preview`. \n\nWhile we don't specifically drop support for older runtimes, changes in Pyspark APIs or\nAPIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older\nruntimes. \n\nBy design, installing `dbldatagen` does not install releases of dependent packages in order \nto preserve the curated set of packages pre-installed in any Databricks runtime environment.\n\nWhen building on local environments, the build process uses the `Pipfile` and requirements files to determine \nthe package versions for releases and unit tests. \n\n## Project Support\nPlease note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)\n are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements \n(SLAs).  They are provided AS-IS, and we do not make any guarantees of any kind.  Please do not submit a support ticket \nrelating to any issues arising from the use of these projects.\n\nAny issues discovered through the use of this project should be filed as issues on the GitHub Repo.  \nThey will be reviewed as time permits, but there are no formal SLAs for support.\n\n\n## Feedback\n\nIssues with the application?  Found a bug?  Have a great idea for an addition?\nFeel free to file an [issue](https://github.com/databrickslabs/dbldatagen/issues/new).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Fdbldatagen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabrickslabs%2Fdbldatagen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Fdbldatagen/lists"}