https://github.com/hyangminj/ddl2data

Turn any SQL schema into realistic test data — instantly
https://github.com/hyangminj/ddl2data

bigquery cli data-engineering date-testing dynamodb etl faker postgresql python rds synthetic-data test-data-generation

Last synced: about 2 months ago
JSON representation

Turn any SQL schema into realistic test data — instantly

Host: GitHub
URL: https://github.com/hyangminj/ddl2data
Owner: hyangminj
License: apache-2.0
Created: 2026-03-31T04:00:08.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-13T15:50:27.000Z (3 months ago)
Last Synced: 2026-04-13T17:41:26.306Z (3 months ago)
Topics: bigquery, cli, data-engineering, date-testing, dynamodb, etl, faker, postgresql, python, rds, synthetic-data, test-data-generation
Language: Python
Homepage:
Size: 95.7 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# ddl2data

Turn any SQL schema into realistic test data — instantly.

- PyPI package: `ddl2data`
- CLI command: `ddl2data`
- Python module: `ddl2data`

`ddl2data` turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.

For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.

---

## Why use it

Use `ddl2data` when you need to:

- stress a pipeline or service with large synthetic datasets
- generate relational fixtures from existing DDL
- seed staging or local environments while preserving foreign-key order
- inspect a live schema and generate data without hand-writing metadata
- generate DynamoDB-shaped typed JSON from a real table definition
- export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
- produce validation reports before loading data elsewhere

---

## How it works

```text
Schema input
-> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
-> build metadata and dependency order
-> generate rows with Faker and optional distributions
-> validate FK, null, unique, and optional CHECK constraints
-> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB
```

Core modules:

- `ddl2data/parser/ddl.py`: SQL DDL parsing via `sqlglot`
- `ddl2data/parser/introspect.py`: relational schema introspection via SQLAlchemy
- `ddl2data/parser/dynamodb.py`: DynamoDB schema loading via `boto3`
- `ddl2data/generator/base.py`: main row generation engine
- `ddl2data/generator/dist.py`: distribution parsing and sampling
- `ddl2data/writer/`: SQL, JSON, CSV, Parquet, and DynamoDB JSON writers
- `ddl2data/validation.py` and `ddl2data/report.py`: validation and reporting

---

## Core capabilities

- Input sources:
- SQL DDL via `--ddl`
- live relational schema introspection via `--schema-from-db --db-url ...`
- live DynamoDB table schema via `--schema-from-dynamodb --dynamodb-table ...`
- Relationship-aware generation:
- foreign-key dependency graph
- parent tables generated before child tables
- FK-only tables can use the Polars generation path
- Output formats:
- PostgreSQL `INSERT`
- MySQL `INSERT`
- SQLite `INSERT`
- BigQuery `INSERT` and `INSERT ALL`
- JSON
- CSV
- Parquet
- DynamoDB typed JSON
- Validation and reporting:
- FK integrity checks
- non-null checks
- unique collision checks
- optional CHECK validation with `--strict-checks`
- JSON report output via `--report-path`
- Generation controls:
- per-table row overrides with `--table-rows`
- reproducible runs with `--seed`
- config-file driven runs with JSON, TOML, or YAML
- per-column and per-table distribution overrides with `--dist`
- optional `python` or `polars` engine selection

---

## Install

Recommended for CLI use:

```bash
pipx install ddl2data
```

Or install with `pip`:

```bash
pip install ddl2data
```

Optional Polars support:

```bash
pip install "ddl2data[polars]"
```

DynamoDB schema loading requires `boto3`:

```bash
pip install boto3
```

From source:

```bash
git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .
```

Contributor setup with test and Polars extras:

```bash
.venv/bin/python -m pip install -e ".[test,polars]"
```

Sanity check:

```bash
ddl2data --help
```

GitHub Releases also include wheel (`.whl`) and source (`.tar.gz`) artifacts.

---

## Quick start

Minimal schema:

```sql
CREATE TABLE users (
id INT PRIMARY KEY,
email VARCHAR(100) UNIQUE NOT NULL,
age INT,
tier VARCHAR(10)
);

CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
amount NUMERIC,
FOREIGN KEY (user_id) REFERENCES users(id)
);
```

Generate JSON:

```bash
ddl2data --ddl schema.sql --rows 100 --out json --output-path data.json
```

Generate PostgreSQL inserts:

```bash
ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql
```

Generate CSV files, one per table:

```bash
ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out
```

With this schema, `users` is generated first and `orders.user_id` is filled from generated `users.id` values.

---

## Common workflows

### Generate SQL for different targets

```bash
ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigquery
```

### Read schema directly from a relational database

```bash
ddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--rows 100 \
--out postgres
```

Limit introspection to selected tables:

```bash
ddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--tables users,orders,events \
--rows 100 \
--out json
```

### Insert generated rows directly into a relational database

```bash
ddl2data \
--ddl schema.sql \
--rows 1000 \
--insert \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb
```

### Generate DynamoDB typed JSON from a live table schema

```bash
ddl2data \
--schema-from-dynamodb \
--dynamodb-table users \
--dynamodb-region us-east-1 \
--dynamodb-extra-attr email:string \
--dynamodb-extra-attr score:int \
--rows 100 \
--out dynamodb-json \
--output-path users.jsonl
```

Notes:

- key attributes and GSI or LSI key attributes are inferred from the live table
- use `--dynamodb-extra-attr name:type` to add non-key fields to generated output
- supported extra attribute aliases include `string`, `uuid`, `date`, `datetime`, `numeric`, `float`, `int`, and `boolean`

### Control row counts per table

```bash
ddl2data \
--ddl schema.sql \
--rows 100 \
--table-rows users=20,orders=500 \
--table-rows events=2000 \
--out json
```

- `--rows` stays the global default
- `--table-rows table=count` overrides individual tables
- config files can use a `table_rows` map

### Generate a validation report

```bash
ddl2data \
--ddl schema.sql \
--rows 500 \
--strict-checks \
--report-path report.json \
--out json \
--output-path data.json
```

The report includes counts and sample issues for:

- FK violations
- non-null violations
- unique collisions
- supported CHECK violations when `--strict-checks` is enabled

### Use Parquet or the Polars engine

Parquet output:

```bash
ddl2data \
--ddl schema.sql \
--rows 100 \
--out parquet \
--parquet-compression zstd \
--output-path ./parquet_out
```

Polars generation and write path:

```bash
ddl2data \
--ddl schema.sql \
--rows 100000 \
--out csv \
--engine polars \
--output-path ./csv_out
```

Engine behavior:

- tables with CHECK constraints fall back to the Python row-wise generator
- tables with unique or primary-key constraints fall back to the Python row-wise generator
- FK-only tables can remain on the Polars path
- `--out parquet` requires the optional `polars` dependency regardless of `--engine`

### BigQuery-specific output

```bash
ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql
```

BigQuery output supports:

- dataset-qualified table names like `dataset.table`
- `INSERT ALL` rendering via `--bq-insert-all`
- typed `DATE` and `TIMESTAMP` literals

---

## Config file example

```bash
ddl2data --config ddl2data.toml
```

Example `ddl2data.toml`:

```toml
ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
"users.age:normal,mean=33,std=7",
"orders.amount:pareto,alpha=1.7,xm=1",
]
```

CLI arguments override config file values.

---

## Distribution overrides

Syntax:

```text
--dist :,k=v,k=v
```

Supported distribution kinds:

- `normal`
- `poisson`
- `weighted`
- `exponential`
- `pareto`
- `zipf`
- `peak`

Examples:

```bash
# global column rule
--dist age:normal,mean=35,std=7

# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6

# poisson counts
--dist daily_orders:poisson,lambda=3

# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%

# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200

# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20
```

Priority when both exist:

1. `table.column`
2. `column`

---

## CHECK-aware generation

`ddl2data` uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with `--strict-checks`.

Supported forms include:

- numeric comparison: `age >= 18`, `qty > 0`, `score <= 100`, `x != 0`
- ranges: `price BETWEEN 10 AND 20`
- enum lists: `status IN ('A', 'B', 'C')`
- regex-like checks: `code ~ '^[A-Z]{2}$'`, `REGEXP_LIKE(code, '^[0-9]+$')`
- nested compound forms built from supported expressions with `AND` and `OR`

This is intentionally best-effort, not a full SQL expression engine.

---

## Reproducible runs

Use `--seed` to stabilize Python random and Faker output:

```bash
ddl2data --ddl schema.sql --rows 100 --seed 42 --out json
```

---

## Development and testing

### Local setup

```bash
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"
```

If you only need the core package:

```bash
.venv/bin/python -m pip install -e .
```

If you need DynamoDB schema loading outside the test extra:

```bash
.venv/bin/python -m pip install boto3
```

### Useful test commands

Run the full suite:

```bash
.venv/bin/python -m pytest
```

Run only non-integration tests:

```bash
.venv/bin/python -m pytest -m "not integration"
```

Run one file:

```bash
.venv/bin/python -m pytest tests/test_parser_graph.py
```

Run one integration target:

```bash
.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.py
```

Available markers:

- `integration`
- `postgres`
- `dynamodb`
- `bigquery`

### Local integration services

The repo includes `docker-compose.yml` for PostgreSQL and LocalStack DynamoDB:

```bash
docker compose up -d postgres localstack
docker compose ps
```

Service summary:

- PostgreSQL: `localhost:5432`
- LocalStack DynamoDB: `http://localhost:4566`

Suggested `.env.test`:

```dotenv
TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
```

`.env.test` is already ignored and is the right place for local test-only credentials.

### BigQuery authentication

Two common options work for the integration tests:

Application Default Credentials:

```bash
gcloud auth application-default login
```

Service-account key file:

```bash
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
```

If `TEST_BQ_PROJECT` is missing or credentials are unavailable, the BigQuery integration tests skip automatically.

---

## CLI reference

Flag summary:

- `--config`: load defaults from JSON, TOML, YAML, or YML
- `--ddl`: input DDL file
- `--schema-from-db`: inspect relational table metadata from a live database
- `--tables`: optional comma-separated table filter for relational introspection mode
- `--schema-from-dynamodb`: inspect a live DynamoDB table definition
- `--dynamodb-table`: DynamoDB table name for `--schema-from-dynamodb`
- `--dynamodb-region`: AWS region for DynamoDB schema loading
- `--dynamodb-extra-attr`: add synthetic non-key DynamoDB attributes, repeatable
- `--rows`: global default rows per table
- `--table-rows`: per-table row overrides
- `--out`: output format, default `postgres`
- `--engine`: `python` or `polars`
- `--bq-insert-all`: BigQuery `INSERT ALL ... SELECT 1;` mode
- `--output-path`: output file path or output directory depending on format
- `--db-url`: SQLAlchemy database URL for introspection or direct insert
- `--insert`: insert generated rows directly into the target relational database
- `--dist`: distribution overrides
- `--seed`: deterministic generation seed
- `--report-path`: write a JSON report with validation summary
- `--strict-checks`: validate supported CHECK constraints after generation
- `--parquet-compression`: Parquet compression codec

---

## Limitations

- DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
- DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
- CHECK-aware generation and `--strict-checks` cover a practical subset, not arbitrary SQL expressions.
- Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
- `--engine polars` still falls back to Python for tables with CHECK, unique, or primary-key constraints.

---

## Future work

- More distribution types for realistic synthetic workloads
- Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
- Broader dialect coverage and richer schema inference for advanced database features

---

## License

Apache License 2.0. See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hyangminj/ddl2data

Awesome Lists containing this project

README