https://github.com/paulf-999/snowflake_synthetic_data_generator
Python scripts to create synthetic data in Snowflake based upon an input list of table schemas.
https://github.com/paulf-999/snowflake_synthetic_data_generator
fake-data fake-data-generator faker snowflake synthetic-data
Last synced: 3 months ago
JSON representation
Python scripts to create synthetic data in Snowflake based upon an input list of table schemas.
- Host: GitHub
- URL: https://github.com/paulf-999/snowflake_synthetic_data_generator
- Owner: paulf-999
- Created: 2022-11-07T03:47:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-04T07:28:09.000Z (about 2 years ago)
- Last Synced: 2025-01-27T23:48:22.612Z (4 months ago)
- Topics: fake-data, fake-data-generator, faker, snowflake, synthetic-data
- Language: Python
- Homepage:
- Size: 151 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Snowflake Synthetic Data Generator
Python scripts to generate synthetic data in Snowflake, based upon a list of input tables provided. The scripts rely on reading the input table schemas to understand the name & data types of columns, to in turn use the Python library `Faker` to generate meaningful fake data.
## Summary
* For each input table listed in `ip/config.yaml`, these scripts generate X amount of fake data records.
* Where X is based upon the value of `num_records_to_generate` in `ip/config.yaml`.
* The script reads the table schemas of each input table list provided to understand the name & data types of inputs columns, to in turn generate fake data.
* The script initially focusses on detecting the 5 groupings of Snowflake data types (see [Summary of Data Types | docs.snowflake.com](https://docs.snowflake.com/en/sql-reference/intro-summary-data-types.html)) listed below:
* String data types (i.e.: `VARCHAR`, `TEXT` and `STRING`)
* Numeric data types (i.e.: `NUMBER`, `NUMERIC` and `DECIMAL`)
* Date data types (i.e., `DATE`, `DATETIME`, `TIME`, `TIMESTAMP`, `TIMESTAMP_{LTZ}{_NTZ}{_TZ}`)
* Boolean
* Binary
* The script then inserts the generated fake data into each of the target tables.---
### Technologies Used
* `Makefile`
* `Python`
* Python packages (see `requirements.txt`):
* `snowflake-connector-python` - to query the Snowflake DB from the python script.
* `Faker` - to generate fake data (see `py/dt_data_generation.py`).
* `pandas` - to write Snowflake query output to Pandas data frames.
* `pyyaml` & `yq` - to parse input from `config.yaml`
* `cryptography` - to render an input private RSA key to query the Snowflake DB.## Prerequisites
Before you begin, ensure you have met the following requirements:
1. Install the prerequisite python packages by running: `make deps`.
2. Provide values for each of the keys in ip/config.yaml. For a description breakdown of each of the input args, see `ip/README.md`.## How-to Run
Run `make run` to:
* Write the table schemas for each input table listed underneath the `input_tbls` key in `ip/config.yaml`.
* Note: the table schema output is written to `tmp/{input_tbl}.csv`. See the python function `get_table_schema()` in `snowflake_query.py` for more details.
* Generate X amount of fake data records (based upon the value of `num_records_to_generate` in `ip/config.yaml`) for each data type within each input table
* Note: the generated fake data for each table is written to `op/{input_tbl}.csv`. See the function `generate_fake_data()` in `gen_fake_data.py` for more details.
* Insert the generated fake data into each of the target tables listed underneath the `input_tbls` key in `ip/config.yaml`.
generate_fake_data
* See the function `insert_fake_data()` in `gen_fake_data.py` for more details.---
## Credits
This is an adapted version of the following [README](https://gist.github.com/DomPizzie/7a5ff55ffa9081f2de27c315f5018afc).