https://github.com/opendsr-std/seedfaker
Deterministic synthetic data generator for realistic, correlated, and noisy test records across 68 locales. Rust CLI/Python/Node.js/Browser WASM/Go/PHP/Ruby/MCP
https://github.com/opendsr-std/seedfaker
ai-mcp cli database-seeding deterministic fake-data faker faker-js faker-provider fixtures locale mcp mock-data pii rust synthetic-data synthetic-dataset test-data-generator
Last synced: 20 days ago
JSON representation
Deterministic synthetic data generator for realistic, correlated, and noisy test records across 68 locales. Rust CLI/Python/Node.js/Browser WASM/Go/PHP/Ruby/MCP
- Host: GitHub
- URL: https://github.com/opendsr-std/seedfaker
- Owner: opendsr-std
- License: other
- Created: 2026-04-08T23:19:50.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-12T10:03:44.000Z (about 1 month ago)
- Last Synced: 2026-04-12T11:12:44.170Z (about 1 month ago)
- Topics: ai-mcp, cli, database-seeding, deterministic, fake-data, faker, faker-js, faker-provider, fixtures, locale, mcp, mock-data, pii, rust, synthetic-data, synthetic-dataset, test-data-generator
- Language: Rust
- Homepage:
- Size: 1.59 MB
- Stars: 18
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# seedfaker
Deterministic synthetic data generator. Same seed, same output — across CLI, Python, Node.js, Go, PHP, Ruby, WASM.
200+ fields, 68 locales, multi-table FK, expressions, templates, streaming, `replace` for anonymising existing data.
## Highlights
- **Deterministic across 7 runtimes** — CLI, Python, Node, Go, PHP, Ruby, WASM. Same seed → byte-identical bytes. `--fingerprint` catches algorithm drift. [→](#determinism)
- **Multi-table FK** — anchors (`users.id:zipf`), dereference (`customer_id->email`), self-reference, `ctx:strict` identity correlation. [→](docs/multi-table.md)
- **Distributed** — `--shard I/N` on three hosts, concatenate, bit-identical to single-host. No coordinator. [→](#distributed-generation)
- **Database ingest** — `seedfaker | psql "\COPY"`, no files, constant memory. [→](guides/seed-database.md)
- **TB-scale** — 1 GB into Postgres in 9 s on 8-core ([benchmark](benchmarks/payments_5gb.sh)); 1 TB ≈ 4.3 h. [→](guides/seed-large-database.md)
- **Throughput** — ~90 MB/s per core (TPC-H dbgen parity), 403 MB/s on 8 threads. Reproducible in [`benchmarks/`](benchmarks/).
- **In-place anonymisation** — `seedfaker replace email ssn < dump.csv`. Same value + seed = same replacement; cross-file joins survive. [→](docs/replace.md)
- **ML/LLM datasets** — `--annotated` (byte-offset spans), `--corrupt` (15 noise types), templates (prompt/completion), multi-table FK (conversations, RAG). [→](guides/training-data.md)
- **Locale-aware PII** — Luhn credit cards, IBAN check digits, 48 gov-ID formats, 68 locales, native scripts. [→](docs/fields.md)
## Contents
- [Install](#install)
- [Library](#library)
- [CLI](#cli)
- [Multi-table and FK](#multi-table-and-fk)
- [Distributed generation](#distributed-generation)
- [Bulk load into a database](#bulk-load-into-a-database)
- [Anonymise existing data](#anonymise-existing-data)
- [Annotated output for ML](#annotated-output-for-ml)
- [Determinism](#determinism)
- [Packages and bindings](#packages-and-bindings)
- [Documentation](#documentation)
- [Quick start](#quick-start)
- [Guides](#guides)
- [Benchmarks](#benchmarks)
- [License](#license)
## Install
One of:
```bash
pip install seedfaker # Python
npm install @opendsr/seedfaker # Node.js
go get github.com/opendsr-std/seedfaker-go # Go
composer require opendsr/seedfaker # PHP
gem install seedfaker # Ruby
npm install @opendsr/seedfaker-wasm # Browser (WASM)
brew install opendsr-std/tap/seedfaker # CLI (macOS / Linux)
cargo install seedfaker # CLI (from source)
npm install -g @opendsr/seedfaker-cli # CLI (npm)
```
All packages wrap the same Rust core and produce byte-identical output for a given seed. See [Packages and bindings](#packages-and-bindings) for per-package documentation.
## Library
One value:
```python
from seedfaker import SeedFaker
sf = SeedFaker(seed="test")
sf.field("email") # "janet.marsh@inbox.com"
sf.field("phone", e164=True) # "+14155551234"
sf.field("credit-card", space=True) # "4174 0785 8323 6433"
```
One record, with `ctx="strict"` locking every field to one identity:
```python
sf.record(["name", "email", "phone"], ctx="strict")
# {"name": "Janet Marsh", "email": "janet.marsh@inbox.com", "phone": "+1 (957) 226-4272"}
```
Batch:
```python
sf.records(["name", "email", "phone"], n=1000, ctx="strict")
```
Locales, weighted mix, native script:
```python
SeedFaker(seed="test", locale="de").field("name") # "Baldur Adler"
SeedFaker(seed="test", locale="ja").field("name") # "石本 和彦"
SeedFaker(seed="test", locale="en=7,de=2,fr=1") # weighted
```
Node.js API is identical:
```js
const sf = new SeedFaker({ seed: "test", locale: "en" });
sf.records(["name", "email"], { n: 1000, ctx: "strict" });
```
Full API: [docs/library](docs/library.md). Locale list: [docs/context](docs/context.md).
## CLI
```bash
seedfaker name email phone --seed test --until 2025 -n 1000
seedfaker name email phone --format csv --seed test --until 2025 -n 1000
seedfaker name email phone --format jsonl --seed test --until 2025 -n 1000
seedfaker name email --ctx strict -l ja,zh --abc native -n 10
```
Pipe directly into a database:
```bash
seedfaker name email phone --format sql=users -n 1000000 --seed staging --until 2025 | psql mydb
```
Arithmetic between columns:
```bash
seedfaker price=amount:1..500:plain qty=integer:1..20 "total=price*qty" \
--format csv --seed ci -n 3 --until 2025
# price,qty,total
# 424.49,14,5942.86
# 459.67,3,1379.01
# 309.44,12,3713.28
```
Presets for common log/data shapes:
```bash
seedfaker run nginx --rate 5000 --seed demo -n 0 > access.log
seedfaker run payment --format jsonl --seed bench -n 1000 --until 2025
```
All flags: [docs/cli](docs/cli.md). Field syntax: [docs/fields](docs/fields.md). Configs: [docs/configs](docs/configs.md). Presets: [docs/presets](docs/presets.md).
## Multi-table and FK
```yaml
# shop.yaml
users:
columns:
id: serial
name: first-name
email: email
options: { count: 1000, ctx: strict }
orders:
columns:
id: serial
customer_id: users.id:zipf
customer_name: customer_id->name
customer_email: customer_id->email
total: amount:usd:1..5000
options: { count: 50000 }
```
```bash
seedfaker run shop.yaml --all --output-dir ./data --format csv
```
- `users.id:zipf` — FK anchor with power-law distribution. `:zipf=N` for tunable exponent; omit for uniform.
- `customer_id->email` — FK dereference; resolves to the email of the same parent row selected by `customer_id`.
- Self-referencing FK supported (`employees.manager_id: employees.id`).
Details: [docs/multi-table](docs/multi-table.md), [docs/expressions](docs/expressions.md).
For bulk-loading a real database at GB/TB scale see [guides/seed-large-database](guides/seed-large-database.md).
## Distributed generation
Determinism enables horizontal scale without coordination. `--shard I/N` emits a disjoint, contiguous slice of the full `serial` range; the same seed on different hosts produces non-overlapping output. Concatenating all N shards (first shard's header retained, rest with `--no-header`) yields bytes bit-identical to an `N=1` run.
Three hosts, one dataset:
```bash
# host-a
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
--shard 0/3 --format csv > events.part0.csv
# host-b
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
--shard 1/3 --format csv --no-header > events.part1.csv
# host-c
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
--shard 2/3 --format csv --no-header > events.part2.csv
```
Collect and concatenate:
```bash
cat events.part0.csv events.part1.csv events.part2.csv > events.csv
# Same bytes, same SHA-256 as:
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 --format csv
```
No shared state between hosts. No coordinator. No post-processing merge step. Each host is CPU-bound on its own slice and finishes independently.
Per-host generation can also use `--threads N` on top of `--shard`, stacking process and in-process parallelism:
```bash
seedfaker ... --shard 0/3 --threads 8 --format csv > events.part0.csv
```
Details on which mechanism to pick and how they compose: [docs/cli § Sharding and threads](docs/cli.md#sharding-and-threads), [guides/seed-large-database](guides/seed-large-database.md).
## Bulk load into a database
Pipe generated CSV straight into `COPY FROM STDIN` — no intermediate files, constant memory:
```bash
seedfaker run shop.yaml --table users --format csv \
| psql "$PGURL" -q -c "\COPY users (id,name,email) FROM STDIN WITH (FORMAT csv, HEADER true)"
```
For GB/TB-scale loads: strip all constraints during phase 1, add them back afterwards.
```sql
CREATE UNLOGGED TABLE users (id UUID NOT NULL, name TEXT, email TEXT);
-- load rows with COPY FROM STDIN (no PK, no FK, no indexes)
ALTER TABLE users SET LOGGED;
ALTER TABLE users ADD PRIMARY KEY (id);
```
Reason: Postgres constraint and index maintenance is per-row during INSERT/COPY; deferring to a single post-load scan is dramatically faster. seedfaker guarantees id uniqueness by construction, so phase-1 validation is wasted work.
`--shard I/N` splits one table's generation into N disjoint serial ranges. Run multiple `seedfaker | psql` pipelines in parallel into the same table — Postgres takes a RowExclusive lock per backend, not Exclusive, so concurrent `COPY` into one table is supported.
```bash
# 4 shards into the same table, concurrent
for i in 0 1 2 3; do
seedfaker run shop.yaml --table events --format csv --shard $i/4 \
| psql "$PGURL" -q -c "\COPY events (id,ts,user_id) FROM STDIN WITH (FORMAT csv, HEADER true)" &
done
wait
```
The reference benchmark [`benchmarks/payments_5gb.sh`](benchmarks/payments_5gb.sh) implements this pattern end-to-end: 10-table payment dataset, Dockerised Postgres 17 with tuned settings, per-table shard pool, Postgres-side WAL / checkpoint / cache-hit counters.
```bash
./benchmarks/payments_5gb.sh # ~100 MB, default
./benchmarks/payments_5gb.sh --scale 50 --shards 3 # ~5 GB with 3-way sharding of the big tables
./benchmarks/payments_5gb.sh --cleanup
```
Full workflow, tuning rationale, per-knob cost table, cross-engine notes (MySQL, ClickHouse, SQLite): [guides/seed-large-database](guides/seed-large-database.md).
## Anonymise existing data
Replace specific columns in existing CSV or JSONL, keeping other columns untouched and preserving referential integrity across files:
```bash
$ echo 'name,email,ssn
Alice,alice@corp.com,123-45-6789' | seedfaker replace email ssn --seed anon
name,email,ssn
Alice,nolan.moreno.xxy@icloud.com,404-16-7659
```
Same value + same seed yields the same replacement every run, so joining `users.email` and `events.email` (after masking each independently) still matches. Details: [docs/replace](docs/replace.md).
## Annotated output for ML
`--annotated` emits JSONL with byte-offset spans, suitable for NER / PII training sets:
```bash
$ seedfaker name email ssn --annotated --seed demo -n 1 --until 2025
{"text":"Paulina Laca\tim.ivana@eunet.rs\t9580255797203","spans":[{"s":0,"e":12,"f":"name","v":"Paulina Laca"},{"s":13,"e":30,"f":"email","v":"im.ivana@eunet.rs"},{"s":31,"e":44,"f":"ssn","v":"9580255797203"}]}
```
Combine with `--corrupt low|mid|high|extreme` for noisy training data. Details: [docs/annotated](docs/annotated.md), [docs/corruption](docs/corruption.md).
## Determinism
Each value is derived from `(seed, record_number, field_name)`. Consequences:
- Adding a field does not change values of existing fields.
- Reordering fields in the config does not change values.
- The same seed produces byte-identical output across languages and versions within the same algorithm fingerprint.
Pin the fingerprint in CI to detect algorithm changes:
```bash
seedfaker --fingerprint
# sf0-158dc9f79ce46b43
```
Details: [docs/determinism](docs/determinism.md), [docs/context](docs/context.md) (identity correlation).
## Packages and bindings
| Language / runtime | Install | Registry | Local docs |
| ------------------ | -------------------------------------------- | -------------------------------------------------------------------------------------------------------- | ------------------------------------- |
| Python | `pip install seedfaker` | [pypi.org/project/seedfaker](https://pypi.org/project/seedfaker/) | [packages/pip](packages/pip/) |
| Node.js | `npm install @opendsr/seedfaker` | [npmjs.com/package/@opendsr/seedfaker](https://www.npmjs.com/package/@opendsr/seedfaker) | [packages/npm](packages/npm/) |
| Go | `go get github.com/opendsr-std/seedfaker-go` | [pkg.go.dev/github.com/opendsr-std/seedfaker-go](https://pkg.go.dev/github.com/opendsr-std/seedfaker-go) | [packages/go](packages/go/) |
| PHP | `composer require opendsr/seedfaker` | [packagist.org/packages/opendsr/seedfaker](https://packagist.org/packages/opendsr/seedfaker) | [packages/php](packages/php/) |
| Ruby | `gem install seedfaker` | [rubygems.org/gems/seedfaker](https://rubygems.org/gems/seedfaker) | [packages/ruby](packages/ruby/) |
| Browser (WASM) | `npm install @opendsr/seedfaker-wasm` | [npmjs.com/package/@opendsr/seedfaker-wasm](https://www.npmjs.com/package/@opendsr/seedfaker-wasm) | [packages/wasm](packages/wasm/) |
| CLI (npm) | `npm install -g @opendsr/seedfaker-cli` | [npmjs.com/package/@opendsr/seedfaker-cli](https://www.npmjs.com/package/@opendsr/seedfaker-cli) | [packages/npm-cli](packages/npm-cli/) |
| CLI (Homebrew) | `brew install opendsr-std/tap/seedfaker` | [github.com/opendsr-std/homebrew-tap](https://github.com/opendsr-std/homebrew-tap) | [docs/cli](docs/cli.md) |
| CLI (Cargo) | `cargo install seedfaker` | [crates.io/crates/seedfaker](https://crates.io/crates/seedfaker) | [docs/cli](docs/cli.md) |
All packages wrap the same Rust core. API surface is intentionally identical across languages except for idiomatic naming.
## Documentation
Reference: [docs/](docs/).
| | |
| ---------------- | --------------------------------------------------------------------------------------------------------- |
| **Start here** | [Quick start](docs/quick-start.md) |
| **CLI** | [Commands and flags](docs/cli.md) · [Determinism](docs/determinism.md) |
| **Fields** | [Syntax and modifiers](docs/fields.md) · [Field reference (200+)](docs/field-reference.md) |
| **Configs** | [YAML configs](docs/configs.md) · [Multi-table](docs/multi-table.md) · [Expressions](docs/expressions.md) |
| **Output** | [Templates](docs/templates.md) · [Annotated](docs/annotated.md) · [Streaming](docs/streaming.md) |
| **Data quality** | [Context](docs/context.md) · [Corruption](docs/corruption.md) · [Replace](docs/replace.md) |
| **Presets** | [Built-in presets](docs/presets.md) (nginx, payment, auth, postgres, syslog, medical, …) |
| **Integrations** | [Library API](docs/library.md) · [MCP](docs/mcp.md) |
Workflows: [guides/](guides/). Runnable examples: [examples/](examples/).
## Quick start
```bash
pip install seedfaker
python -c 'from seedfaker import SeedFaker; print(SeedFaker(seed="demo").record(["name","email"]))'
```
Or with the CLI:
```bash
brew install opendsr-std/tap/seedfaker
seedfaker name email phone --seed demo --until 2025 -n 5
```
Then: [docs/quick-start](docs/quick-start.md) for the 10-minute walkthrough, [docs/cli](docs/cli.md) for flags, [docs/fields](docs/fields.md) for field syntax.
## Guides
End-to-end workflows in [guides/](guides/):
| | |
| ----------------------------------------------------------- | ------------------------------------------------------------------------ |
| [Seed a database](guides/seed-database.md) | Postgres/MySQL staging DB with multi-table FK |
| [Seed a large database](guides/seed-large-database.md) | GB/TB bulk load — parallel COPY, UNLOGGED, tuning |
| [Distributed generation](guides/distributed-generation.md) | Multi-host sharded generation without coordination |
| [Anonymise production data](guides/anonymize-data.md) | `replace` on CSV/JSONL, FK integrity across files |
| [Training and evaluation datasets](guides/training-data.md) | NER/PII, LLM fine-tuning, eval with ground truth, red-team, multilingual |
| [Reproducible datasets](guides/reproducible-datasets.md) | Deterministic fixtures, CI, fingerprint guard |
| [Library usage](guides/library-usage.md) | Python / Node.js SDK patterns |
| [Mock API server](guides/mock-api-server.md) | Express / FastAPI mock endpoint |
| [API load testing](guides/api-load-testing.md) | Rate-limited streaming, corruption |
| [MCP for AI agents](guides/mcp-ai-agents.md) | Claude / Cursor / VS Code integration |
## Benchmarks
Reproducible throughput measurements, install scripts, per-field breakdowns, and an end-to-end Postgres load benchmark (`payments_5gb.sh`): [benchmarks/](benchmarks/).
## License
MIT
---
> [README](README.md) · [Docs](docs/) · [Guides](guides/) · [Packages](packages/)