{"id":49956373,"url":"https://github.com/hyangminj/ddl2data","last_synced_at":"2026-05-18T00:13:03.061Z","repository":{"id":348638966,"uuid":"1196806243","full_name":"hyangminj/ddl2data","owner":"hyangminj","description":"Turn any SQL schema into realistic test data — instantly","archived":false,"fork":false,"pushed_at":"2026-04-13T15:50:27.000Z","size":98,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-13T17:41:26.306Z","etag":null,"topics":["bigquery","cli","data-engineering","date-testing","dynamodb","etl","faker","postgresql","python","rds","synthetic-data","test-data-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyangminj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-31T04:00:08.000Z","updated_at":"2026-04-13T15:50:30.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hyangminj/ddl2data","commit_stats":null,"previous_names":["hyangminj/datagen","hyangminj/ddl2data"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/hyangminj/ddl2data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyangminj%2Fddl2data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyangminj%2Fddl2data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyangminj%2Fddl2data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyangminj%2Fddl2data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyangminj","download_url":"https://codeload.github.com/hyangminj/ddl2data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyangminj%2Fddl2data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33160176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T22:39:12.733Z","status":"ssl_error","status_checked_at":"2026-05-17T22:39:10.741Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","cli","data-engineering","date-testing","dynamodb","etl","faker","postgresql","python","rds","synthetic-data","test-data-generation"],"created_at":"2026-05-18T00:12:58.326Z","updated_at":"2026-05-18T00:13:03.044Z","avatar_url":"https://github.com/hyangminj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ddl2data\n\nTurn any SQL schema into realistic test data — instantly.\n\n- PyPI package: `ddl2data`\n- CLI command: `ddl2data`\n- Python module: `ddl2data`\n\n`ddl2data` turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.\n\nFor relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.\n\n---\n\n## Why use it\n\nUse `ddl2data` when you need to:\n\n- stress a pipeline or service with large synthetic datasets\n- generate relational fixtures from existing DDL\n- seed staging or local environments while preserving foreign-key order\n- inspect a live schema and generate data without hand-writing metadata\n- generate DynamoDB-shaped typed JSON from a real table definition\n- export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON\n- produce validation reports before loading data elsewhere\n\n---\n\n## How it works\n\n```text\nSchema input\n  -\u003e parse DDL, inspect a live relational DB, or inspect a DynamoDB table\n  -\u003e build metadata and dependency order\n  -\u003e generate rows with Faker and optional distributions\n  -\u003e validate FK, null, unique, and optional CHECK constraints\n  -\u003e write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB\n```\n\nCore modules:\n\n- `ddl2data/parser/ddl.py`: SQL DDL parsing via `sqlglot`\n- `ddl2data/parser/introspect.py`: relational schema introspection via SQLAlchemy\n- `ddl2data/parser/dynamodb.py`: DynamoDB schema loading via `boto3`\n- `ddl2data/generator/base.py`: main row generation engine\n- `ddl2data/generator/dist.py`: distribution parsing and sampling\n- `ddl2data/writer/`: SQL, JSON, CSV, Parquet, and DynamoDB JSON writers\n- `ddl2data/validation.py` and `ddl2data/report.py`: validation and reporting\n\n---\n\n## Core capabilities\n\n- Input sources:\n  - SQL DDL via `--ddl`\n  - live relational schema introspection via `--schema-from-db --db-url ...`\n  - live DynamoDB table schema via `--schema-from-dynamodb --dynamodb-table ...`\n- Relationship-aware generation:\n  - foreign-key dependency graph\n  - parent tables generated before child tables\n  - FK-only tables can use the Polars generation path\n- Output formats:\n  - PostgreSQL `INSERT`\n  - MySQL `INSERT`\n  - SQLite `INSERT`\n  - BigQuery `INSERT` and `INSERT ALL`\n  - JSON\n  - CSV\n  - Parquet\n  - DynamoDB typed JSON\n- Validation and reporting:\n  - FK integrity checks\n  - non-null checks\n  - unique collision checks\n  - optional CHECK validation with `--strict-checks`\n  - JSON report output via `--report-path`\n- Generation controls:\n  - per-table row overrides with `--table-rows`\n  - reproducible runs with `--seed`\n  - config-file driven runs with JSON, TOML, or YAML\n  - per-column and per-table distribution overrides with `--dist`\n  - optional `python` or `polars` engine selection\n\n---\n\n## Install\n\nRecommended for CLI use:\n\n```bash\npipx install ddl2data\n```\n\nOr install with `pip`:\n\n```bash\npip install ddl2data\n```\n\nOptional Polars support:\n\n```bash\npip install \"ddl2data[polars]\"\n```\n\nDynamoDB schema loading requires `boto3`:\n\n```bash\npip install boto3\n```\n\nFrom source:\n\n```bash\ngit clone https://github.com/hyangminj/ddl2data.git\ncd ddl2data\npython3 -m venv .venv\n. .venv/bin/activate\n.venv/bin/python -m pip install -e .\n```\n\nContributor setup with test and Polars extras:\n\n```bash\n.venv/bin/python -m pip install -e \".[test,polars]\"\n```\n\nSanity check:\n\n```bash\nddl2data --help\n```\n\nGitHub Releases also include wheel (`.whl`) and source (`.tar.gz`) artifacts.\n\n---\n\n## Quick start\n\nMinimal schema:\n\n```sql\nCREATE TABLE users (\n  id INT PRIMARY KEY,\n  email VARCHAR(100) UNIQUE NOT NULL,\n  age INT,\n  tier VARCHAR(10)\n);\n\nCREATE TABLE orders (\n  id INT PRIMARY KEY,\n  user_id INT NOT NULL,\n  amount NUMERIC,\n  FOREIGN KEY (user_id) REFERENCES users(id)\n);\n```\n\nGenerate JSON:\n\n```bash\nddl2data --ddl schema.sql --rows 100 --out json --output-path data.json\n```\n\nGenerate PostgreSQL inserts:\n\n```bash\nddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql\n```\n\nGenerate CSV files, one per table:\n\n```bash\nddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out\n```\n\nWith this schema, `users` is generated first and `orders.user_id` is filled from generated `users.id` values.\n\n---\n\n## Common workflows\n\n### Generate SQL for different targets\n\n```bash\nddl2data --ddl schema.sql --rows 100 --out postgres\nddl2data --ddl schema.sql --rows 100 --out mysql\nddl2data --ddl schema.sql --rows 100 --out sqlite\nddl2data --ddl schema.sql --rows 100 --out bigquery\n```\n\n### Read schema directly from a relational database\n\n```bash\nddl2data \\\n  --schema-from-db \\\n  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \\\n  --rows 100 \\\n  --out postgres\n```\n\nLimit introspection to selected tables:\n\n```bash\nddl2data \\\n  --schema-from-db \\\n  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \\\n  --tables users,orders,events \\\n  --rows 100 \\\n  --out json\n```\n\n### Insert generated rows directly into a relational database\n\n```bash\nddl2data \\\n  --ddl schema.sql \\\n  --rows 1000 \\\n  --insert \\\n  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb\n```\n\n### Generate DynamoDB typed JSON from a live table schema\n\n```bash\nddl2data \\\n  --schema-from-dynamodb \\\n  --dynamodb-table users \\\n  --dynamodb-region us-east-1 \\\n  --dynamodb-extra-attr email:string \\\n  --dynamodb-extra-attr score:int \\\n  --rows 100 \\\n  --out dynamodb-json \\\n  --output-path users.jsonl\n```\n\nNotes:\n\n- key attributes and GSI or LSI key attributes are inferred from the live table\n- use `--dynamodb-extra-attr name:type` to add non-key fields to generated output\n- supported extra attribute aliases include `string`, `uuid`, `date`, `datetime`, `numeric`, `float`, `int`, and `boolean`\n\n### Control row counts per table\n\n```bash\nddl2data \\\n  --ddl schema.sql \\\n  --rows 100 \\\n  --table-rows users=20,orders=500 \\\n  --table-rows events=2000 \\\n  --out json\n```\n\n- `--rows` stays the global default\n- `--table-rows table=count` overrides individual tables\n- config files can use a `table_rows` map\n\n### Generate a validation report\n\n```bash\nddl2data \\\n  --ddl schema.sql \\\n  --rows 500 \\\n  --strict-checks \\\n  --report-path report.json \\\n  --out json \\\n  --output-path data.json\n```\n\nThe report includes counts and sample issues for:\n\n- FK violations\n- non-null violations\n- unique collisions\n- supported CHECK violations when `--strict-checks` is enabled\n\n### Use Parquet or the Polars engine\n\nParquet output:\n\n```bash\nddl2data \\\n  --ddl schema.sql \\\n  --rows 100 \\\n  --out parquet \\\n  --parquet-compression zstd \\\n  --output-path ./parquet_out\n```\n\nPolars generation and write path:\n\n```bash\nddl2data \\\n  --ddl schema.sql \\\n  --rows 100000 \\\n  --out csv \\\n  --engine polars \\\n  --output-path ./csv_out\n```\n\nEngine behavior:\n\n- tables with CHECK constraints fall back to the Python row-wise generator\n- tables with unique or primary-key constraints fall back to the Python row-wise generator\n- FK-only tables can remain on the Polars path\n- `--out parquet` requires the optional `polars` dependency regardless of `--engine`\n\n### BigQuery-specific output\n\n```bash\nddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql\nddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql\n```\n\nBigQuery output supports:\n\n- dataset-qualified table names like `dataset.table`\n- `INSERT ALL` rendering via `--bq-insert-all`\n- typed `DATE` and `TIMESTAMP` literals\n\n---\n\n## Config file example\n\n```bash\nddl2data --config ddl2data.toml\n```\n\nExample `ddl2data.toml`:\n\n```toml\nddl = \"schema.sql\"\nrows = 500\nout = \"postgres\"\nseed = 42\nstrict_checks = true\ntable_rows = { users = 100, orders = 500, events = 2000 }\ndist = [\n  \"users.age:normal,mean=33,std=7\",\n  \"orders.amount:pareto,alpha=1.7,xm=1\",\n]\n```\n\nCLI arguments override config file values.\n\n---\n\n## Distribution overrides\n\nSyntax:\n\n```text\n--dist \u003ccolumn_or_table.column\u003e:\u003ckind\u003e,k=v,k=v\n```\n\nSupported distribution kinds:\n\n- `normal`\n- `poisson`\n- `weighted`\n- `exponential`\n- `pareto`\n- `zipf`\n- `peak`\n\nExamples:\n\n```bash\n# global column rule\n--dist age:normal,mean=35,std=7\n\n# table-qualified rule (higher priority than global)\n--dist users.age:normal,mean=30,std=6\n\n# poisson counts\n--dist daily_orders:poisson,lambda=3\n\n# weighted categorical values\n--dist tier:weighted,A=60%,B=30%,C=10%\n\n# heavy-tail behavior\n--dist amount:pareto,alpha=1.5,xm=1\n--dist category_rank:zipf,skew=1.8,n=200\n\n# peak-hour timestamps\n--dist created_at:peak,hours=9-11,18-20\n```\n\nPriority when both exist:\n\n1. `table.column`\n2. `column`\n\n---\n\n## CHECK-aware generation\n\n`ddl2data` uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with `--strict-checks`.\n\nSupported forms include:\n\n- numeric comparison: `age \u003e= 18`, `qty \u003e 0`, `score \u003c= 100`, `x != 0`\n- ranges: `price BETWEEN 10 AND 20`\n- enum lists: `status IN ('A', 'B', 'C')`\n- regex-like checks: `code ~ '^[A-Z]{2}$'`, `REGEXP_LIKE(code, '^[0-9]+$')`\n- nested compound forms built from supported expressions with `AND` and `OR`\n\nThis is intentionally best-effort, not a full SQL expression engine.\n\n---\n\n## Reproducible runs\n\nUse `--seed` to stabilize Python random and Faker output:\n\n```bash\nddl2data --ddl schema.sql --rows 100 --seed 42 --out json\n```\n\n---\n\n## Development and testing\n\n### Local setup\n\n```bash\npython3 -m venv .venv\n. .venv/bin/activate\n.venv/bin/python -m pip install -e \".[test,polars]\"\n```\n\nIf you only need the core package:\n\n```bash\n.venv/bin/python -m pip install -e .\n```\n\nIf you need DynamoDB schema loading outside the test extra:\n\n```bash\n.venv/bin/python -m pip install boto3\n```\n\n### Useful test commands\n\nRun the full suite:\n\n```bash\n.venv/bin/python -m pytest\n```\n\nRun only non-integration tests:\n\n```bash\n.venv/bin/python -m pytest -m \"not integration\"\n```\n\nRun one file:\n\n```bash\n.venv/bin/python -m pytest tests/test_parser_graph.py\n```\n\nRun one integration target:\n\n```bash\n.venv/bin/python -m pytest tests/test_integration_postgres.py\n.venv/bin/python -m pytest tests/test_integration_dynamodb.py\n.venv/bin/python -m pytest tests/test_integration_bigquery.py\n```\n\nAvailable markers:\n\n- `integration`\n- `postgres`\n- `dynamodb`\n- `bigquery`\n\n### Local integration services\n\nThe repo includes `docker-compose.yml` for PostgreSQL and LocalStack DynamoDB:\n\n```bash\ndocker compose up -d postgres localstack\ndocker compose ps\n```\n\nService summary:\n\n- PostgreSQL: `localhost:5432`\n- LocalStack DynamoDB: `http://localhost:4566`\n\nSuggested `.env.test`:\n\n```dotenv\nTEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb\nAWS_ACCESS_KEY_ID=test\nAWS_SECRET_ACCESS_KEY=test\nAWS_DEFAULT_REGION=us-east-1\nDYNAMODB_ENDPOINT_URL=http://localhost:4566\nTEST_BQ_PROJECT=your-gcp-project-id\nTEST_BQ_DATASET=ddl2data_integration_test\n# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json\n```\n\n`.env.test` is already ignored and is the right place for local test-only credentials.\n\n### BigQuery authentication\n\nTwo common options work for the integration tests:\n\nApplication Default Credentials:\n\n```bash\ngcloud auth application-default login\n```\n\nService-account key file:\n\n```bash\nexport GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json\n```\n\nIf `TEST_BQ_PROJECT` is missing or credentials are unavailable, the BigQuery integration tests skip automatically.\n\n---\n\n## CLI reference\n\n```bash\nddl2data [--ddl schema.sql\n        | --schema-from-db --db-url URL [--tables t1,t2]\n        | --schema-from-dynamodb --dynamodb-table NAME]\n        [--dynamodb-region REGION]\n        [--dynamodb-extra-attr name:type]\n        [--config config.toml]\n        [--rows 100]\n        [--table-rows users=20,orders=500] [--table-rows events=2000]\n        [--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]\n        [--engine python|polars]\n        [--bq-insert-all]\n        [--output-path PATH]\n        [--insert --db-url URL]\n        [--dist ...] [--dist ...]\n        [--seed INT]\n        [--report-path report.json]\n        [--strict-checks]\n        [--parquet-compression snappy|zstd|lz4|gzip|none]\n```\n\nFlag summary:\n\n- `--config`: load defaults from JSON, TOML, YAML, or YML\n- `--ddl`: input DDL file\n- `--schema-from-db`: inspect relational table metadata from a live database\n- `--tables`: optional comma-separated table filter for relational introspection mode\n- `--schema-from-dynamodb`: inspect a live DynamoDB table definition\n- `--dynamodb-table`: DynamoDB table name for `--schema-from-dynamodb`\n- `--dynamodb-region`: AWS region for DynamoDB schema loading\n- `--dynamodb-extra-attr`: add synthetic non-key DynamoDB attributes, repeatable\n- `--rows`: global default rows per table\n- `--table-rows`: per-table row overrides\n- `--out`: output format, default `postgres`\n- `--engine`: `python` or `polars`\n- `--bq-insert-all`: BigQuery `INSERT ALL ... SELECT 1;` mode\n- `--output-path`: output file path or output directory depending on format\n- `--db-url`: SQLAlchemy database URL for introspection or direct insert\n- `--insert`: insert generated rows directly into the target relational database\n- `--dist`: distribution overrides\n- `--seed`: deterministic generation seed\n- `--report-path`: write a JSON report with validation summary\n- `--strict-checks`: validate supported CHECK constraints after generation\n- `--parquet-compression`: Parquet compression codec\n\n---\n\n## Limitations\n\n- DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.\n- DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.\n- CHECK-aware generation and `--strict-checks` cover a practical subset, not arbitrary SQL expressions.\n- Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.\n- `--engine polars` still falls back to Python for tables with CHECK, unique, or primary-key constraints.\n\n---\n\n## Future work\n\n- More distribution types for realistic synthetic workloads\n- Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches\n- Broader dialect coverage and richer schema inference for advanced database features\n\n---\n\n## License\n\nApache License 2.0. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyangminj%2Fddl2data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyangminj%2Fddl2data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyangminj%2Fddl2data/lists"}