{"id":44979647,"url":"https://github.com/edanalytics/dbt_synth_data","last_synced_at":"2026-02-18T18:03:20.386Z","repository":{"id":200566041,"uuid":"536768455","full_name":"edanalytics/dbt_synth_data","owner":"edanalytics","description":"A dbt package for creating synthetic data.","archived":false,"fork":false,"pushed_at":"2025-01-16T15:54:00.000Z","size":1192,"stargazers_count":2,"open_issues_count":4,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-16T17:15:59.295Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edanalytics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-14T21:41:50.000Z","updated_at":"2024-07-31T12:55:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"a615d35b-9a2c-443e-909f-cdd09b6cb791","html_url":"https://github.com/edanalytics/dbt_synth_data","commit_stats":null,"previous_names":["edanalytics/dbt_synth_data"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/edanalytics/dbt_synth_data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edanalytics%2Fdbt_synth_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edanalytics%2Fdbt_synth_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edanalytics%2Fdbt_synth_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edanalytics%2Fdbt_synth_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edanalytics","download_url":"https://codeload.github.com/edanalytics/dbt_synth_data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edanalytics%2Fdbt_synth_data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29588777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T16:55:40.614Z","status":"ssl_error","status_checked_at":"2026-02-18T16:55:37.558Z","response_time":162,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-18T18:03:18.443Z","updated_at":"2026-02-18T18:03:20.377Z","avatar_url":"https://github.com/edanalytics.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- Logo/image --\u003e\n![dbt_synth_data](assets/dalle-mini_small_laptop_on_a_white_background_showing_fake_data_in_a_spreadsheet.png)\n\nThis is a [`dbt`](https://www.getdbt.com/) package for creating synthetic data. Currently it supports [Snowflake](https://www.snowflake.com/en/), [Postgres](https://www.postgresql.org/), [DuckDB](https://duckdb.org/), and [SQLite](https://www.sqlite.org/index.html) (with the [`stats` extension](https://docs.getdbt.com/reference/warehouse-setups/sqlite-setup#sqlite-extensions)). Other backends may be added eventually.\n\nAll the magic happens in `macros/*`.\n\n# Table of Contents  \n* [About](#about)\n* [Installation](#installation)\n* [Architecture](#architecture)\n* [Simple example](#simple-example)\n* [Distributions](#distributions)\n* [Column types](#column-types)\n* [Advanced usage](#advanced-usage)\n* [Datasets](#datasets)\n* [Performance](#performance)\n* [Changelog](#changelog)\n* [Contributing](#contributing)\n* [License](#license)\n\n\n# About\nRobert Fehrmann (CTO at Snowflake) has a couple good blog posts about [generating random integers and strings](https://www.snowflake.com/blog/synthetic-data-generation-at-scale-part-1/) or [dates and times](https://www.snowflake.com/blog/synthetic-data-generation-at-scale-part-2/) in Snowflake, which the [base column types](#base-column-types) in this package emulate.\n\nHowever, creating more realistic synthetic data requires more complex data [column types](#column-types), advanced random [distributions](#distributions), supporting [datasets](#datasets), and [references](#reference-column-types) to other models.\n\n`dbt_synth_data` provides many macros to facilitate building out realistic synthetic data. It builds up a series of CTEs and joins from a base of randomly-generated values - see [Architecture](#architecture) for details. `dbt_synth_data` is powerful, especially on Snowflake - it can create billions of rows and hundreds of GB of data. See [Performance](#performance) for details.\n\n## Philosophy\nThere are generally two approaches to creating synthetic or \"fake\" data:\n1. start with real data, de-identify it, and possibly \"fuzz\" or \"jitter\" some values\n1. start with nothing and synthesize data by describing it, including and distributions and correlations in the data\n\n(Recent research has proposed a hybrid approach, where a \"nearby\" or similar synthetic data row (2) is selected for each row of a real, de-dentified row (1)... but adequately defining \"nearby\" is difficult.)\n\nApproach (1) can be dangerous, suscpetible to re-identification and other adversarial attacks. `dbt_synth_data` implements approach (2) *only*.\n\n## Intended Use\nSynthetic data generated with `dbt_synth_data` can be useful for testing user interfaces, demoing applications, performance-tuning operational systems, preparing training and other materials with realistic data, and potentially other uses.\n\n## Limitations\nThe synthetic data created using `dbt_synth_data` should not be mistaken as being fully realistic, reflecting all correlations that may be present in the real world. Therefore **please do not use data generated using this package to train ML models!**\n\n## Supported backends\nThis package currently supports the following backends:\n* `snowflake` (with `pip install dbt-snowflake`)\n* `postgres` (with `pip install dbt-postgres`)\n* `sqlite` (with `pip install dbt-sqlite`)\n* `duckdb` (with `pip install dbt-duckdb`)\n\n\n# Installation\n1. add `dbt_synth_data` to your `packages.yml`\n1. run `dbt deps`\n1. run `dbt seed`\n1. add `\"dbt_packages/dbt_synth/macros\"` to your `dbt_project.yml`'s `macro-paths`\n1. build your synthetic models as documented below\n1. `dbt run`\n\n\n\n# Architecture\n\nCTEs, joins, and fields defined by `synth_column_*()` are temporarily stored in `dbt`'s [`target` object](https://docs.getdbt.com/reference/dbt-jinja-functions/target) during parse/run time, as this is one of few dbt objects that persist and are scoped across `macro`s. Finally, `synth_table()` stitches everything together into a query of the general form\n```sql\n-- [various CTEs as required for selecting seed data or values from other models]\nbase as (\n    select\n        -- base CTE includes a row_number, which facilitates generating integer or date sequences, primary keys, and more\n        row_number() over (order by 1) as __row_number\n    from table(generator( rowcount =\u003e [rows] )) -- snowflake\n    -- from generate_series( 1, [rows] ) as s(idx) -- postgres, sqlite\n),\njoin0 as (\n     select\n        base.__row_number,\n        -- randomness source fields, such as\n        UNIFORM(0::float, 1::float, RANDOM()) as field1__rand, -- snowflake\n        -- RANDOM() as field1__rand, -- postgres\n        -- [similar *__rand fields for other columns as required]\n    from base\n),\n-- arbitrarily many further joins to the CTEs defined above\njoinN as (\n    select\n        join[N-1].*, -- all fields from prior joins, plus:\n        CTEx.field2,\n        CTEx.field3\n    from join[N-1]\n        left join CTEx on ... -- something involving join[N-1].*__rand\n),\nsynth_table as (\n    select\n        field1,\n        field2,\n        field3\n        -- only the fields we actually want to keep in the final table\n        -- (intermediate fields, including *__rand, are dropped)\n    from joinN\n)\n```\n**Note:** with SQLite, the behavior of `random()` within CTEs and joins is non-deterministic, due to how the query optimizer works - see [this link](https://stackoverflow.com/questions/64328853/sqlite-random-function-in-cte) for details. Therefore, on SQLite only, temporary tables are created (`CREATE TEMP TABLE ...`) instead of most of the CTEs mentioned above. Only the final `synth_table` CTE is created, so the `with` syntax shown below still works. Temporary tables are deleted when the `dbt run` completes.\n\n\n# Simple example\nConsider the example model `orders.sql` below:\n```sql\nwith\n{{ synth_column_primary_key(name='order_id') }}\n{{ synth_column_foreign_key(name='product_id', model_name='products', column='product_id') }}\n{{ synth_column_distribution(name='status', \n    distribution=synth_distribution(class='discrete', type='probabilities',\n        probabilities={\"New\":0.2, \"Shipped\":0.5, \"Returned\":0.2, \"Lost\":0.1}\n    )\n) }}\n{{ synth_column_integer(name='quantity', min=1, max=10) }}\n{{ synth_table(rows = 5000) }}\nselect * from synth_table\n```\nThe model begins by defining the columns we want in the table, including:\n* `order_id` is the primary key on the table - it wil contain a unique hash value per row\n* `product_id` is a foreign key to the `products` table - values in this column will be uniformly-distributed, valid primary keys of the `products` table\n* each order has a `status` with several possible values, whose prevalence/likelihoods are given by a discrete probability distribution\n* `quantity` is the count of how many of the product were ordered, a uniformly-distributed integer from 1-10\n\nThen a CTE called `synth_table` with 5000 rows of synthetic data is created and we select the results.\n\nNote that the user must provide the opening `with` for CTEs and a final `select * from synth_table` - this allows flexibility to add your own CTEs at the top or bottom of the model, as well as arbitrary post-processing of columns produced by `dbt_synth_data` - see [Advanced Usage](#advanced-usage) for more details.\n\n\n\n# Distributions\nThis package provides the following distributions:\n\n### Continuous Distributions\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003euniform\u003c/code\u003e\u003c/summary\u003e\n\nGenerates [uniformly-distributed](https://en.wikipedia.org/wiki/Continuous_uniform_distribution) real numbers.\n```python\n    synth_distribution_continuous_uniform(min=0.6, max=7.9)\n```\nDefault `min` is `0.0`. Default `max` is `1.0`. `min` and `max` are inclusive.\n\n![Example of continuous uniform distribution](/assets/continuous_uniform.png)\n**Above:** Histogram of a continuous uniform distribution (1M values).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003enormal\u003c/code\u003e\u003c/summary\u003e\n\nGenerates [normally-distributed (Gaussian)](https://en.wikipedia.org/wiki/Normal_distribution) real numbers.\n```python\n    synth_distribution_continuous_normal(mean=5, stddev=0.5)\n```\nDefault `mean` is `0.0`, default `stddev` is `1.0`.\n\n![Example of continuous uniform distribution](/assets/continuous_normal.png)\n**Above:** Histogram of a continuous normal distribution (1M values).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eexponential\u003c/code\u003e\u003c/summary\u003e\n\nGenerates [exponentially-distributed](https://en.wikipedia.org/wiki/Exponential_distribution) real numbers.\n```python\n    synth_distribution_continuous_exponential(lambda=5.0)\n```\nDefault `lambda` is `1.0`.\n\n![Example of continuous uniform distribution](/assets/continuous_exponential.png)\n**Above:** Histogram of a continuous exponential distribution (1M values).\n\u003c/details\u003e\n\n\n### Discrete Distributions\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ebernoulli\u003c/code\u003e\u003c/summary\u003e\n\nGenerates integers (`0` and `1`) according to a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution).\n```python\n    synth_distribution_discrete_bernoulli(p=0.3)\n```\nDefault `p` is `0.5`.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ebinomial\u003c/code\u003e\u003c/summary\u003e\n\nGenerates integers according to a [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution).\n```python\n    synth_distribution_discrete_binomial(n=100, p=0.3)\n```\nDefault `n` is `10`, default `p` is `0.5`.\n\nNote that the implementation is approximate, based on a normal distribution (see [here](https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation)). For small `n` or `p` near `0` or `1`, normally-distributed values may be `\u003c 0` or `\u003e n`, which is impossible in a binomial distribution. These long-tail values are rare, so, while not completely correct, we use\n* `abs()` to shift those `\u003c 0`\n* `mod(..., n+1)` to shift those `\u003e n`\n\nThis may artificially increase small values. However, the approximation is close if `n*p` and `n*(1-p)` are large.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eweights\u003c/code\u003e\u003c/summary\u003e\n\nGenerates discrete values according to a user-defined probability set.\n```python\n    synth_distribution_discrete_weights(values=[1,3,5,7,9], weights=[1,1,6,3,1])\n```\n`values` is a required list of strings, floats, or integers; it has no default.\n\n`weights` is an optional list of integers. It's length should be the same the length of `values`. If `weights` is omitted, each of the `values` will be equally likely. Otherwise, the integers indicate likelihood; in the example above, the value `5` will be about six times as prevalent as the value `9`.\n\nAvoid using `weights` with a large sum; this will generate long `case` statements which can run slowly.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eprobabilities\u003c/code\u003e\u003c/summary\u003e\n\nGenerates discrete values according to a user-defined probability set.\n```python\n    synth_distribution_discrete_probabilities(probabilities={\"1\":0.15, \"5\":0.5, \"8\": 0.35})\n```\n`probabilities` is required and has no default. It may be\n* a list (array) such as `[0.05, 0.8, 0.15]`, in which case the (zero-based) indices are the integer values generated\n* or a dictionary (key-value) structure such as `{ \"1\":0.05, \"3\":0.8, \"7\":0.15 }` with integer keys (specified as strings in order to be valud JSON), in which case the keys are the integers generated\n\nYou may actually specify string or float keys in your `probabilities` dict to generate those values instead of integers, however you must specify the additional parameter `keys_type=\"varchar\"` (or similar) so the the value types are correct. For example:\n```python\n    synth_distributions_discrete_probabilities(probabilities={\"cat\":0.3, \"dog\":0.5, \"parrot\":0.2}, keys_type=\"varchar\")\n```\n\n`probabilities` must sum to `1.0`.\n\nNote that, because values are generated using `case` statements, supplying `probabilities` with many digits of specificity will run slower, i.e., `probabilities=[0.1, 0.3, 0.6]` will generate something like\n```sql\ncase floor( 10*random() )\n    when 0 then 0\n    when 1 then 1\n    when 2 then 1\n    when 3 then 1\n    when 4 then 2\n    when 5 then 2\n    when 6 then 2\n    when 7 then 2\n    when 8 then 2\n    when 9 then 2\nend\n```\nwhile `probabilities=[0.101, 0.301, 0.598]` will generate something like\n```sql\ncase floor( 1000*random() )\n    when 0 then 0\n    when ...\n    when 99 then 0\n    when 100 then 0\n    when 101 then 1\n    when ...\n    when 400 then 1\n    when 401 then 1\n    when 402 then 1\n    when 403 then 2\n    ...\n    when 998 then 2\n    when 999 then 2\nend\n```\nwhich takes longer for the database engine to evaluate.\n\nReally you should avoid specifiying `probabilities` of more than 4 digits at the most.\n\u003c/details\u003e\n\n## Discretizing Continuous Distributions\n\nAny of the [continuous distributions](#continuous-distributions) listed above can be made discrete using the following mechanisms:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ediscretize_floor\u003c/code\u003e\u003c/summary\u003e\n\nConverts values from [continuous distributions](#continuous-distributions) to (discrete) integers by applying the `floor()` function.\n```python\n    synth_distribution_discretize_floor(\n        distribution=synth_distribution(class='...', type='...', ...),\n    )\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ediscretize_ceil\u003c/code\u003e\u003c/summary\u003e\n\nConverts values from [continuous distributions](#continuous-distributions) to (discrete) integers by applying the `ceil()` function.\n```python\n    synth_distribution_discretize_ceil(\n        distribution=synth_distribution(class='...', type='...', ...),\n    )\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ediscretize_round\u003c/code\u003e\u003c/summary\u003e\n\nConverts values from [continuous distributions](#continuous-distributions) to discrete values by applying the `round()` function.\n```python\n    synth_distribution_discretize_round(\n        distribution=synth_distribution(class='...', type='...', ...),\n        precision=0\n    )\n```\n`precision` indicates the number of digits to round to.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ediscretize_width_bucket\u003c/code\u003e\u003c/summary\u003e\n\n**Note** that SQLite doesn't support `width_bucket()`; you will get an error if you try to use this function on SQLite.\n\nConverts values from [continuous distributions](#continuous-distributions) to discrete values by bucketing them. Buckets are specified by `from` and `to` bounds and either `count` (the number of buckets) or `size` (the target bucket size).\n\nFor some distributions (like `uniform`), the bounds may be strict - values outside the bounds are impossible. For other distributions (like exponential), specifying strict `from` and `to` bounds may be difficult. For this reason, if `strict_bounds=False`, the first bucket (index `0`) will represent values below `from`. Likewise the last bucket (index `count`) will represent values above `to`. (`strict_bounds` defaults to `True`.) It is up to you to chose reasonable and useful `from` and `to` bounds for discretization.\n\n`labels` may be \n* unspecified, in which case values will be mapped to the (1-based) bucket index (0-based if `strict_bounds=False`)\n* the string \"lower_bound\", in which case values will be mapped to the lower bound of the bucket (or `-Infinity` for the first bucket, if `strict_bounds=False`)\n* the string \"upper_bound\", in which case values will be mapped to the upper bound of the bucket (or `+Infinity` for the last bucket, if `strict_bounds=False`)\n* the string \"bucket_range\", in which case values will be mapped to a string of the format \"[lower_bound] - [upper_bound]\" for each bucket (`lower_bound` may be `-Infinity` and `upper_bound` may be `Infinity` if `strict_bounds=False`)\n  * optionally specify the `bucket_range_separator` string that separates the upper and lower bucket bounds (default is \" - \")\n* the string \"bucket_average\", in which case values will be mapped to bucket middle or average (or `from` for the first bucket and `to` for the last bucket, if `strict_bounds=False`)\n* a list of (string or numeric) bucket labels (the list must be equal in length to the number of buckets)\n\nFor all but the last option, you may optionally specify a `label_precision`, which is the number of digits bounds get rounded to. (Default is `4`.)\n\n**Examples:**\n```python\n    synth_distribution_discretize_width_bucket(\n        distribution=synth_distribution(class='...', type='...', ...),\n        from=0.0, to=1.5, count=20, labels='lower_bound'\n    )\n```\n```python\n    synth_distribution_discretize_width_bucket(\n        distribution=synth_distribution(class='...', type='...', ...),\n        from=0.0, to=1.5, size=0.1\n    )\n```\n```python\n    synth_distribution_discretize_width_bucket(\n        distribution=synth_distribution(class='...', type='...', ...),\n        from=0.0, to=1.5, count=5, strict_bounds=False,\n        labels=['\u003c 0.0', '0.0 to 0.5', '0.5 to 1.0', '1.0 to 1.5', '\u003e 1.5']\n    )\n```\n\u003c/details\u003e\n\n\n## Constructing Complex Distributions\nThis package provides the following mechanisms for composing several distributions:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eunion\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values from several distributions with optional `weights`. If `weights` is omitted, each distribution is equally likely.\n```python\n    {{ synth_distribution_union(\n        synth_distribution(class='...', type='...', ...),\n        synth_distribution(class='...', type='...', ...),\n        weights=[1, 2, ...]\n    ) }}\n```\nUp to 10 distributions may be unioned. (Compose the macro to union more.)\n\nFor example, make a [bimodal distribution](https://en.wikipedia.org/wiki/Multimodal_distribution) as follows:\n```python\n{{ synth_table(\n  rows = 100000,\n  columns = [\n    synth_column_distribution(name=\"continuous_bimodal\",\n        distribution=synth_distribution_union(\n            synth_distribution(class='continuous', type='normal', mean=5.0, stddev=1.0),\n            synth_distribution(class='continuous', type='normal', mean=8.0, stddev=1.0),\n            weights=[1, 2]\n        )\n    ),\n  ]\n) }}\n{{ config(post_hook=synth_get_post_hooks())}}\n```\nHere, values will come from the union of the two normal distributions, with the second distribution twice as likely as the first.\n\n![Example of continuous bimodal distribution](/assets/continuous_bimodal.png)\n**Above:** Histogram of a continuous bimodal distribution composed of the union of two normal distributions (1M values).\n\n![Example of union of continuous normal distributions](/assets/continuous_union_normals.png)\n**Above:** Histogram of the union of three continuous normal distributions (1M values).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eaverage\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values from the (optionally weighted) average of values from several distributions. If `weights` is omitted, each distribution contributes equally to the average.\n```python\n    {{ synth_distribution_average(\n        synth_distribution(class='...', type='...', ...),\n        synth_distribution(class='...', type='...', ...),\n        weights=[1, 2, ...]\n    ) }}\n```\nUp to 10 distributions may be averaged. (Compose the macro to average more.)\n\n![Example of continuous average distribution](/assets/continuous_average_exponential_normal.png)\n**Above:** Histogram of a continuous average distribution composed of a normal and an exponential distribution (1M values).\n\u003c/details\u003e\n\n\n## Making distributions configurable\n`dbt` doesn't allow macro calls in [project `vars`](https://docs.getdbt.com/docs/build/project-variables), but `dbt_synth_data` gets around this limitation and allows you to configure distributions in your `vars` and then parse and use them in your models. Consider the following example:\n\n```yaml\n...\nvars:\n  teacher_student_ratio:\n    synth_distribution_union():\n      d0:\n        synth_distribution_continuous_normal():\n          mean: 15\n          stddev: 5\n      d1:\n        synth_distribution_continuous_normal():\n          mean: 20\n          stddev: 5\n      weights: [1, 2]\n```\nYou can use this distribution via `synth_var()` in `models/schools.sql` as follows:\n```sql\nwith\n{{ synth_column_primary_key(name='school_id') }}\n{{ synth_column_integer(name=\"current_enrollment\", min=100, max=2000) }}\n{{ synth_column_distribution(name='teacher_student_ratio', \n    distribution=synth_var('teacher_student_ratio'))\n) }}\n{{ synth_column_integer(name='year_founded', min=1937, max=2022) }}\n{{ synth_table(rows = 500) }}\nselect * from synth_table\n```\n\nBesides using `synth_distribution_union()` and `synth_distribution_average()`, you can also combine and compose distributions using `synth_expression()` like so:\n```yaml\n...\nvars:\n  teacher_student_ratio:\n    synth_expression:\n      expression: greatest( 5, 10 + $0 + ln($1) )\n      p0:\n        synth_distribution_continuous_normal():\n          mean: 5\n          stddev: 1.5\n      p1:\n        synth_distribution_continuous_normal():\n          mean: 10\n          stddev: 2\n```\n\n\n# Column types\nThis package provides the following data types:\n\n\n## Basic column types\nBasic column types, which are quite performant.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eboolean\u003c/code\u003e\u003c/summary\u003e\n\nGenerates boolean values.\n```python\n{{ synth_column_boolean(name=\"is_complete\", pct_true=0.2) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003einteger\u003c/code\u003e\u003c/summary\u003e\n\nGenerates integer values.\n\nFor uniformly-distributed values, simply specify `min` and `max`:\n```python\n{{ synth_column_integer(name=\"event_year\", min=2000, max=2020) }}\n```\n\nFor non-uniformly-distributed values, specify a discretized distribution:\n```python\n{{ synth_column_distribution(name=\"event_year\",\n    distribution=synth_distribution_discretize_floor(\n        distribution=synth_distribution_continuous_normal(mean=2010, stddev=2.5,)\n    )\n) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003enumeric\u003c/code\u003e\u003c/summary\u003e\n\nGenerates numeric values.\n```python\n{{ synth_column_numeric(name=\"price\", min=1.99, max=999.99, precision=2) }}\n```\n\nFor non-uniformly-distributed values, specify a distribution rounded to the desired `precision`:\n```python\n{{ synth_column_distribution(name=\"event_year\",\n    distribution=synth_distribution_discretize_round(\n        distribution=synth_distribution_continuous_normal(mean=500, stddev=180,),\n        precision=2\n    )\n) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003estring\u003c/code\u003e\u003c/summary\u003e\n\nGenerates random strings.\n```python\n{{ synth_column_string(name=\"password\", min_length=10, max_length=20) }}\n```\nString characters will include `A-Z`, `a-z`, and `0-9`.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003edate\u003c/code\u003e\u003c/summary\u003e\n\nGenerates date values.\n```python\n{{ synth_column_date(name=\"birth_date\", min='1938-01-01', max='1994-12-31') }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003einteger sequence\u003c/code\u003e\u003c/summary\u003e\n\nGenerates an integer sequence (value is incremented at each row).\n```python\n{{ synth_column_integer_sequence(name=\"day_of_year\", step=1, start=1) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003edate sequence\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a date sequence.\n```python\n{{ synth_column_date_sequence(name=\"calendar_date\", start_date='2020-08-10', step=3)}}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eprimary key\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a primary key column. (Values are distinct hash strings.)\n```python\n{{ synth_column_primary_key(name=\"product_id\") }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003evalue\u003c/code\u003e\u003c/summary\u003e\n\nGenerates the same (single, static) value for every row.\n```python\n{{ synth_column_value(name=\"is_registered\", value='Yes') }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003evalues\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values from a list of possible values, with optional probability weighting.\n```python\n{{ synth_column_values(name=\"academic_subject\",\n    values=['Mathematics', 'Science', 'English Language Arts', 'Social Studies'],\n    probabilities=[0.2, 0.3, 0.15, 0.35]\n) }}\n```\nIf `probabilities` are omitted, every value is equally likely.\n\n(Uses `synth_distribution_discrete_probabilities()` under the hood.)\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eexpression\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values based on an expression (which may refer to other columns, or invoke SQL functions).\n```python\n{{ synth_column_expression(name='week_of_calendar_year',\n    expression=\"DATE_PART('week', calendar_date)::int\"\n) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003emapping\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values by mapping from an `expression` involving existing columns to values in a dictionary.\n```python\n{{ synth_column_mapping(name='day_type', expression='is_school_day',\n    mapping=({ true:'Instructional day', false:'Non-instructional day' })\n) }}\n```\n\u003c/details\u003e\n\n\n## Statistical column types\nStatistical column types can be used to make advanced statistical relationships between tables and columns.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ecorrelation\u003c/code\u003e\u003c/summary\u003e\n\nGenerates two or more columns with correlated values.\n```python\n{% set birthyear_grade_correlations = ({\n    \"columns\": {\n        \"birth_year\": [ 2010, 2009, 2008, 2007, 2006, 2005, 2004 ],\n        \"grade\": [ 'Eighth grade', 'Ninth grade', 'Tenth grade', 'Eleventh grade', 'Twelfth grade' ]\n    },\n    \"probabilities\": [\n        [ 0.02, 0.00, 0.00, 0.00, 0.00 ],\n        [ 0.15, 0.02, 0.00, 0.00, 0.00 ],\n        [ 0.03, 0.15, 0.02, 0.00, 0.00 ],\n        [ 0.00, 0.03, 0.15, 0.02, 0.00 ],\n        [ 0.00, 0.00, 0.03, 0.15, 0.02 ],\n        [ 0.00, 0.00, 0.00, 0.03, 0.15 ],\n        [ 0.00, 0.00, 0.00, 0.00, 0.03 ]\n    ]\n    })\n%}\nwith\n{{ synth_column_primary_key(name='k_student') }}\n{{ synth_column_correlation(data=birthyear_grade_correlations, column='birth_year') }}\n{{ synth_column_correlation(data=birthyear_grade_correlations, column='grade') }}\n{{ synth_table(rows=var('num_students')) }}\nselect * from synth_table\n```\nTo created correlated columns, you must specify a `data` object representing the correlation, which contains\n* `columns` is a list of column names and possible values.\n* `probabilities` is a hypercube, with dimension equal to the number of `columns`, the elements of which sum to `1.0`, indicating the probability of each possible combination of values for the `columns`. The outermost elements of the `probabilities` hypercube corresond to the values of the first column; the innermost elements of the hypercube correspond to the values of the last column. Each dimension of the hypercube must have the same size as the number of values for its corresponding column.\n\nConstructing a `probabilities` hypercube of dimension more than two or three can be difficult \u0026ndash; we recommend adding (temporary) comments and using indentation to keep track of columns, values, and dimensions.\n\u003c/details\u003e\n\n\n## Reference column types\nColumn types which reference values in other columns of the same or different table.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eforeign key\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values that are a primary key of another table.\n```python\n{{ synth_column_foreign_key(name='product_id', model_name='products', column='id') }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003elookup\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values based on looking up values from one column in another table..\n```python\n{{ synth_column_lookup(name='gender', model_name='synth_firstnames', value_cols='first_name', from_col='name', to_col='gender', do_ref=True) }}\n```\n`do_ref` defaults to true, meaning that `model_name` will be wrapped in dbt's `{{ ref(model_name) }}`. However you can set `do_ref=False` to reference a local CTE instead.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eselect\u003c/code\u003e\u003c/summary\u003e\n\nGenerates values by selecting them from another table, optionally weighted using a specified column of the other table.\n```python\n{{ synth_column_select(\n    name='random_ajective',\n    model_name=\"synth_words\",\n    value_cols=\"word\",\n    distribution=\"weighted\",\n    weight_col=\"prevalence\",\n    filter=\"part_of_speech like '%ADJ%'\",\n    do_ref=True\n) }}\n```\nThe above will generate randomly-chosen adjectives (based on the specified `filter`), weighted by prevalence.\n\n`do_ref` defaults to true, meaning that `model_name` will be wrapped in dbt's `{{ ref(model_name) }}`. However you can set `do_ref=False` to reference a local CTE instead.\n\u003c/details\u003e\n\n\n## Advanced column types\nAdvanced column types use real-world data which is maintained in the `seeds/` directory. Some effort has been made to make these data sets\n* **Generalized**, rather than specific to a particular country, region, language, etc. For example, the *words* dictionary contains common words from many common languages, not just English.\n* **Statistically rich**, with associated metadata which makes the data more useful by capturing various distributions embedded in the data. For example, the *countries* list includes the (approximate) population and land area of each country, which facilitates generating country lists weighted according to these features. Likewise, the *cities* list has the latitude and longitude coordinates for each city, which facilitates generating fairly realistic coordinates for synthetic addresses.\n\nAdvanced column types may all specify a `distribution=\"weighted\"` and `weight_col=\"population\"` (or similar) to skew value distributions. They may also specify `filter`, which is a SQL `where` expression narrowing down the pool of data values that will be used. Finally, they may specify a `filter_expressions` dictionary which allows dynamic filtering based on expressions which can involve row values from other columns. If, for example, we are creating a country column and pass `filter_expressions` as\n```json\n{\n    \"country_name\": \"INITCAP(my_country_col)\",\n    \"geo_region_code\": \"my_geo_region_col\"\n}\n```\nthen a `WHERE` clause like this will result:\n```sql\nsynth_countries.country_name=INITCAP(my_country_col)\nAND synth_countries.geo_region_code=my_geo_region_col\n```\n(`filter_expressions` and `filter` - if any - are combined via logical `AND`.)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ecity\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a city, selected from the `synth_cities` seed table.\n```python\n{{ synth_column_city(name='city', distribution=\"weighted\", weight_col=\"population\", filter=\"timezone like 'Europe/%'\") }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003egeo region\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a geo region (state, province, or territory), selected from the `synth_geo_regions` seed table.\n```python\n{{ synth_column_geo_region(name='geo_region', distribution=\"weighted\", weight_col=\"population\", filter=\"country='United States'\") }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ecountry\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a country, selected from the `synth_countries` seed table.\n```python\n{{ synth_column_country(name='country', distribution=\"weighted\", weight_col=\"population\", filter=\"continent='Europe'\") }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003efirst name\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a first name, selected from the `synth_firstnames` seed table.\n```python\n{{ synth_column_firstname(name='first_name', filter=\"gender='Male'\") }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003elast name\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a last name, selected from the `synth_lastnames` seed table.\n```python\n{{ synth_column_lastname(name='last_name') }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eword\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a single word, selected from the `synth_words` seed table.\n```python\n{{ synth_column_word(name='random_word', language_code=\"en\", distribution=\"weighted\", pos=[\"NOUN\", \"VERB\"], filter=\"LENGTH(word)\u003e3\") }}\n```\nThe above generates a randomly-selected English noun or verb, weighted according to frequency, of at least four characters.\n\nRather than `language_code` you may specify `language` (such as `language=\"English\"`), but a language *must* be specified with one of these parameters. See [Words (Datasets)](#words) for a list of supported languages and parts of speech.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ewords\u003c/code\u003e\u003c/summary\u003e\n\nGenerates several words, selected from the `synth_words` seed table.\n```python\n{{ synth_column_words(name='random_phrase', language_code=\"en\", distribution=\"uniform\", n=5) }}\n```\nThe above generates a random string of five words, uniformly districbuted, with the first letter of each word capitalized.\n\nAlternatively, you can generate words using format strings, for example\n```python\n{{ synth_column_words(name='course_title', language_code=\"en\", distribution=\"uniform\", format_strings=[\n    \"{ADV} learning for {ADJ} {NOUN}s\",\n    \"{ADV} {VERB} {NOUN} course\"\n]) }}\n```\nThis will generate sets of words according to one of the format strings you specify.\n\nNote that this data type is constructed by separately generating a single word `n` times (or, for `format_string`s, the set union of all word instances from any `format_string`) and then concatenating them together, which can be slow if `n` is large (or you have many tokens in your `format_string`s).\n\nRather than `language_code` you may specify `language` (such as `language=\"English\"`), but a language *must* be specified with one of these parameters. See [Words (Data Sets)](#words) for a list of supported languages and parts of speech.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003elanguage\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a spoken language (name or 2- or 3-letter code), selected from the `synth_languages` seed table.\n```python\n{{ synth_column_language(name='random_lang', type=\"name\", distribution=\"weighted\") }}\n```\nThe optional `type` (which defaults to `name`) can take values `name` (the full English name of the language, e.g. *Spanish*), `code2` (the ISO 693-2 two-letter code for the langage, e.g. `es`), or `code3` (the ISO 693-3 three-letter code for the language, e.g. `spa`).\n\u003c/details\u003e\n\n\n## Composite column types\nComposite column types put together several other column types into a more complex data type.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eaddress\u003c/code\u003e\u003c/summary\u003e\n\nGenerates an address, based on `city`, `geo region`, `country`, `words`, and other values.\n\nCreating a column `myaddress` using this macro will also create intermediate columns `myaddress__street_address`, `myaddress__city`, `myaddress__geo_region`, and `myaddress__postal_code` (or whatever `parts` you specify). You can then `add_update_hook()`s that reference these intermediate columns if you'd like. For example:\n```python\n{{ synth_column_primary_key(name='k_person') }}\n{{ synth_column_firstname(name='first_name') }}\n{{ synth_column_lastname(name='last_name') }}\n{{ synth_column_address(name='home_address', countries=['United States'],\n    parts=['street_address', 'city', 'geo_region', 'country', 'postal_code']) }}\n{{ synth_column_expression(name='home_address_street', expression=\"home_address__street_address\") }}\n{{ synth_column_expression(name='home_address_city', expression=\"home_address__city\") }}\n{{ synth_column_expression(name='home_address_geo_region', expression=\"home_address__geo_region\") }}\n{{ synth_column_expression(name='home_address_country', expression=\"home_address__country\") }}\n{{ synth_column_expression(name='home_address_postal_code', expression=\"home_address__postal_code\") }}\n\n{{ synth_table(rows = 100) }}\n{{ synth_add_cleanup_hook(\"alter table {{this}} drop column home_address\") or \"\" }}\n```\n\nAlternatively, you may use something like\n\n```python\n{{ synth_column_primary_key(name='k_person') }}\n{{ synth_column_firstname(name='first_name') }}\n{{ synth_column_lastname(name='last_name') }}\n{{ synth_column_address(name='home_address_street', countries=['United States'], parts=['street_address']) }}\n{{ synth_column_address(name='home_address_city', countries=['United States'], parts=['city']) }}\n{{ synth_column_address(name='home_address_geo_region', countries=['United States'], parts=['geo_region']) }}\n{{ synth_column_address(name='home_address_country', countries=['United States'], parts=['country']) }}\n{{ synth_column_address(name='home_address_postal_code', countries=['United States'], parts=['postal_code']) }}\n{{ synth_table(rows = 100) }}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003ephone_number\u003c/code\u003e\u003c/summary\u003e\n\nGenerates a phone number in the format `(123) 456-7890`.\n\n```python\n{{ synth_column_phone_number(name=\"phone_number\") }}\n```\n\u003c/details\u003e\n\n\n# Advanced usage\n\n## Combining columns with expressions\nOccasionally you may want to build up a more complex column's values from several simpler ones. This is easily done with an expression column, for example\n```sql\n{{ synth_column_primary_key(name=\"k_person\") }}\n{{ synth_column_firstname(name='first_name') }}\n{{ synth_column_lastname(name='last_name') }}\n{{ synth_column_expression(name='full_name', expression=\"first_name || ' ' || last_name\") }}\n{{ synth_remove(collection=\"final_fields\", key=\"first_name\") }}\n{{ synth_remove(collection=\"final_fields\", key=\"last_name\") }}\n{{ synth_table(rows = 100) }}\n```\nNote that you may want to \"clean up\" by supressing some of your intermediate columns, as shown with the `synth_remove()` calls in the example above.\n\n## Creating temporary columns\nYou may also want to modify another table *only after this one is built*. This is also possible using cleanup hooks.\n\nFor example, suppose you want to create `products` and `orders`, but you want some `products` to be exponentially more popular (more `orders` for) than others. This is possible by\n1. creating a `products` model with an extra popularity column:\n    ```sql\n    {{ synth_column_primary_key(name=\"k_product\") }}\n    {{ synth_column_string(name=\"name\", min_length=10, max_length=20) }}\n    {{ synth_column_distribution(name=\"popularity\",\n        distribution=synth_distribution(class='continuous', type='exponential', lambda=0.05)\n    ) }}\n    {{ synth_table(rows=50) }}\n    ```\n1. creating an `orders` model with a `synth_column_select()` to `products` using your popularity column, then use a cleanup hook to drop the `popularity` column:\n    ```sql\n    {{ synth_column_primary_key(name=\"k_order\") }}\n    {{ synth_column_select(name=\"k_product\", lookup_table=\"products\", \n        value_col=\"k_product\", distribution=\"weighted\", weight_col=\"popularity\") }}\n    {{ synth_column_distribution(name=\"status\",\n        distribution=synth_distribution(class='discrete', type='probabilities',\n            probabilities={\"New\":0.2, \"Shipped\":0.5, \"Returned\":0.2, \"Lost\":0.1}\n        )\n    ) }}\n    {{ synth_column_integer(name=\"num_ordered\", min=1, max=10) }}\n\n    {{ synth_add_cleanup_hook(\n        'alter table {{target.database}}.{{target.schema}}.products drop column popularity'\n    ) }}\n\n    {{ synth_table(rows=5000) }}\n    ```\n    Note that the cleanup hook *must* go after any column definitions that rely on it, and before the `synth_table()` call.\n\n## Random seed\nWith Snowflake only (not other backends), you can [specify a random seed](https://docs.snowflake.com/en/sql-reference/functions/random#arguments). This package uses the dbt var `{{ var(\"synth_randseed\") }}` (which defaults to `10000`) and increments it each time `random()` is called. [Snowflake asserts](https://docs.snowflake.com/en/sql-reference/functions/random#usage-notes) that even with a fixed seed, \"there is no guarantee that RANDOM will generate the same set of values each time\"; however in our testing it generally does. This means that (1) repeated `dbt run`s with the same seed wil likely generate same/similar data and (2) if you want new/different data, you should consider changing the `synth_randseed` var.\n\n## Configurable distributions\n`dbt` allows configuration to be defined in the `vars` section of your `dbt_project.yml` but dynamic values are not supported (they must be numbers, strings, lists, or dictionaries, but not macro invocations). However it can be very useful to make various distributions in your synthetic data configurable. This is possibly by defining them in the `vars` section using a specific format and then referencing them using the `synth_var()` macro provided by this package.\n\nFor example, in your `dbt_project.yml`:\n```yaml\n...\nvars:\n  my_complicated_custom_distribution:\n    synth_distribution_discretize_ceil():\n      distribution:\n        synth_expression():\n          # this ensures that the value is \u003e= 1\n          expression: greatest(1, 1 + $0)\n          p0:\n            # average of an exponential and normal distribution\n            # result is a skewed distribution, peaking around 1000\n            synth_distribution_average():\n              d0:\n                synth_distribution_continuous_exponential():\n                  lambda: 0.0002\n              d1:\n                synth_distribution_continuous_normal():\n                  mean: 1100\n                  stddev: 400\n              weights: [1,2]\n```\nand then in your model:\n```sql\nwith\n...\n{{ synth_column_distribution(name=\"my_column\",\n    distribution=synth_var('my_complicated_custom_distribution')\n) }}\n{{ synth_table(rows=1000) }}\n```\nWhen defining `vars` this way:\n* reference a macro by name, with `()` at the end\n* you may only reference macros for available [distributions](#distributions) and [discretizations](#discretizing-continuous-distributions)\n* macro parameters must be passed by name\n* macro invocations may be nested arbitraily deep\n* values may be combined using `synth_expression()` with parameters `expression` and `p0` up to `p9` which `expression` references as `$0` up to `$9`\n\n\n# Datasets\n\n## Words\nThe word list in `seeds/synth_words.csv` contains 70k words \u0026ndash; the top 5k most common words from each of the following 14 languages:\n* Bulgarian (`bg`)\n* Czech (`cs`)\n* Danish (`da`)\n* Dutch (`nl`)\n* English (`en`)\n* Finnish (`fi`)\n* French (`fr`)\n* German (`de`)\n* Hungarian (`hu`)\n* Indonesian (`id`)\n* Italian (`it`)\n* Portuguese (`pt`)\n* Slovenian (`sv`)\n* Spanish (`es`)\n\nWith each word is associated a **frequency**, which is a value between 0 and 1 representing the frequency with which the word appears in common usage of the language, and a **part of speech** for the word, which is one of:\n* ADJ: adjective\n* ADP: adposition\n* ADV: adverb\n* AUX: auxiliary verb\n* CONJ: coordinating conjunction\n* DET: determiner\n* INTJ: interjection\n* NOUN: noun\n* NUM: numeral\n* PART: particle\n* PRON: pronoun\n* PROPN: proper noun\n* PUNCT: punctuation\n* SCONJ: subordinating conjunction\n* SYM: symbol\n* VERB: verb\n* X: other\n\nSome words may functionally belong to multiple parts of speech; this dataset uses only the single most common.\n\nThe dataset is constructed based on word lists and frequencies from [`wordfreq`](https://github.com/rspeer/wordfreq) and part-of-speech tagging from [`polyglot`](https://polyglot.readthedocs.io/en/latest/POS.html). Language availability is based on the set intersection of the languages supported by these two libraries.\n\nYou may run into an error when loading this data using `dbt seed` on SQLite - [an issue](https://github.com/codeforkjeff/dbt-sqlite/issues/35) has been raised with the `dbt-sqlite` adapter to solve this, in the meantime, you'd have to manually edit the seed batch size (make it smaller) to load `synth_words` in SQLite.\n\n## Languages\nThe language list in `seeds/synth_languages.csv` contains 222 commonly-spoken (living) languages, with, for each, the ISO 693-2 and ISO 693-3 language codes, the approximate number of speakers, and a list of countries in which the language is predominantly spoken. Country names are consistent with those in the countries dataset at `seeds/synth_countries.csv`.\n\nThe dataset is assembled primarily from Wikipedia, including [this list of official languages by country](https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory), and the specific pages for each individual language.\n\n\n# Performance\nHere we provide approximate benchmarks for synthetic data generation, using the models found in `example_models/*.sql`, for the various supported backends.\n\n| Model | Columns | Rows | Snowflake runtime, size | Postgres runtime, size | SQLite runtime, size | DuckDB runtime, size |\n| --- | --- | --- | --- | --- | --- | --- |\n| [distributions](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/distributions.sql) | 13-15 |  10k |    1.95s, 804KB |    0.77s, 1.7MB |   0.29s, 1.13MB |  0.20s, 1.76MB |\n| [distributions](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/distributions.sql) | 13-15 |   1M |    7.15s,  73MB |    8.93s, 166MB |   8.70s,  115MB |  16.0s,  189MB |\n| [distributions](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/distributions.sql) | 13-15 | 100M |   66.19s, 7.2GB | 14.76min,  16GB | 16.6min, 11.2GB |              - |\n| [distributions](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/distributions.sql) | 13-15 |  10B |  95.5min, 765GB |               - |               - |              - |\n|  |  |  |  |  |  |  |\n| [columns](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/columns.sql)             |    28 |  10k |   20.2s,  2.2MB |   6.5min, 4.6MB |  37.26s, 3.92MB |  0.82s, 2.25MB |\n| [columns](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/columns.sql)             |    28 | 100k |   69.0s, 21.2MB |  64.9min,  46MB |  9.3min, 39.1MB | 12.44s, 18.2MB |\n| [columns](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/columns.sql)             |    28 |   1M | 10.2min,  109MB |               - | 77.3min,  392MB |              - |\n| [columns](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/columns.sql)             |    28 |  10M | 27.3min,  654MB |               - |               - |              - |\n|  |  |  |  |  |  |  |\n| [customers](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/customers.sql)         |     8 |  100 |   7.07s, 36.5KB |    1.34s,  32KB |   0.67s,   10KB |  0.20s,  1.0MB |\n| [products](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/products.sql)           |     3 |   50 |   4.01s, 16.0KB |    1.09s,  16KB |   0.43s,    4KB |  0.11s,  256KB |\n| [stores](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/stores.sql)               |     5 |    2 |   4.96s,  4.0KB |    0.68s,  16KB |   0.45s,    4KB |  0.11s,  256KB |\n| [orders](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/orders.sql)               |     4 | 1000 |   5.26s, 59.5KB |    0.66s, 120KB |   0.26s,   24KB |  0.14s,  256KB |\n| [inventory](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/inventory.sql)         |     4 |  100 |   2.76s, 21.5KB |    0.58s,  24KB |   0.20s,    4KB |  0.13s,  256KB |\n|  |  |  |  |  |  |  |\n| [customers](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/customers.sql)         |     8 |  10k |   4.89s,  960KB |   58.11s, 1.7MB |   8.09s, 1.16MB |  0.43s,  2.0MB |\n| [products](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/products.sql)           |     3 |   5k |   2.57s,  275KB |   41.33s, 544KB |   3.63s,  248KB |  0.25s,  1.0MB |\n| [stores](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/stores.sql)               |     5 |  200 |   2.25s,   32KB |    1.84s,  40KB |   0.79s,   20KB |  0.18s,  1.3MB |\n| [orders](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/orders.sql)               |     4 | 100k |   3.63s,  5.3MB |  36.2min,  10MB |  19.52s,  2.2MB |  0.76s,  2.3MB |\n| [inventory](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/inventory.sql)         |     4 |   1M |  18.75s, 60.3MB |  35.9min, 134MB |  3.6min, 18.7MB |  19.3s, 25.9MB |\n|  |  |  |  |  |  |  |\n| [customers](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/customers.sql)         |    8 |    1M |  58.75s, 57.6MB |   1.55hr, 163MB | 11.0min,  118MB | 67.09s, 68.5MB |\n| [products](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/products.sql)           |    3 |   50k |  11.51s,  2.4MB |  6.76min, 4.9MB |  33.54s, 2.49MB | 0.56s,  2.75MB |\n| [stores](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/stores.sql)               |    5 |   20k |   3.54s,  1.3MB |  1.86min, 2.5MB |  12.82s, 1.56MB | 0.28s,   2.0MB |\n| [orders](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/orders.sql)               |    4 |   50M |  2.24hr,  1.0GB |               - |               - |              - |\n| [inventory](https://github.com/edanalytics/dbt_synth_data/blob/main/example_models/inventory.sql)         |    4 |  100M |   6.3hr,  2.5GB |               - |               - |              - |\n\nMissing values in the table above denote either failed runs (DuckDB kills a process that uses too much memory) or runs that took too long (much more than a couple of hours).\n\nSnowflake runtimes are using a single Xsmall warehouse. Postgres runtimes are using an AWS RDS small instance. SQLite and DuckDB runtimes are using a Lenovo laptop with Intel i-5 2.6GHz processor, 16GB RAM, and 500GB SSD.\n\n## Performance comments\nSome takeaways from the above data include\n* generating *large* data (\u003e 50 GB) is really only possible using Snowflake\n* generating *small* data (\u003c 1GB) is usually fastest using DuckDB or SQLite\n* model complexity (number of columns, and especially joins/references to other tables) significantly influences runtime\n\n\n# Changelog\nComing soon!\n\n\n\n# Contributing\nBugfixes and new features (such as additional transformation operations) are gratefully accepted via pull requests here on GitHub.\n\n## Contributions\n* Cover image created with [DALL \u0026bull; E mini](https://huggingface.co/spaces/dalle-mini/dalle-mini)\n\n\n\n# License\nSee [License](LICENSE).\n\n\n\n# Todo\n- [ ] fix address so it selects a city, then uses the country (and geo_region) for that city, rather than a (different) random country (and geo_region)\n- [ ] implement other [distributions](#distributions)... Poisson, Gamma, Power law/Pareto, Multinomial?\n- [ ] flesh out more seeds (and corresponding data columns) and composite columns (email address, IP address, user agent strings, file_name, URL, etc.)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedanalytics%2Fdbt_synth_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedanalytics%2Fdbt_synth_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedanalytics%2Fdbt_synth_data/lists"}