{"id":48645121,"url":"https://github.com/opendsr-std/seedfaker","last_synced_at":"2026-05-03T21:02:50.235Z","repository":{"id":350106705,"uuid":"1205387882","full_name":"opendsr-std/seedfaker","owner":"opendsr-std","description":" Deterministic synthetic data generator for realistic, correlated, and noisy test records across 68 locales. Rust   CLI/Python/Node.js/Browser WASM/Go/PHP/Ruby/MCP                                                                                                                                       ","archived":false,"fork":false,"pushed_at":"2026-04-12T10:03:44.000Z","size":1670,"stargazers_count":18,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-12T11:12:44.170Z","etag":null,"topics":["ai-mcp","cli","database-seeding","deterministic","fake-data","faker","faker-js","faker-provider","fixtures","locale","mcp","mock-data","pii","rust","synthetic-data","synthetic-dataset","test-data-generator"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opendsr-std.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-08T23:19:50.000Z","updated_at":"2026-04-12T10:03:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"4ec20f24-9d96-448a-947b-5807050c8fcd","html_url":"https://github.com/opendsr-std/seedfaker","commit_stats":null,"previous_names":["opendsr-std/seedfaker"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/opendsr-std/seedfaker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendsr-std%2Fseedfaker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendsr-std%2Fseedfaker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendsr-std%2Fseedfaker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendsr-std%2Fseedfaker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opendsr-std","download_url":"https://codeload.github.com/opendsr-std/seedfaker/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendsr-std%2Fseedfaker/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31751705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T09:16:15.125Z","status":"ssl_error","status_checked_at":"2026-04-13T09:16:05.023Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-mcp","cli","database-seeding","deterministic","fake-data","faker","faker-js","faker-provider","fixtures","locale","mcp","mock-data","pii","rust","synthetic-data","synthetic-dataset","test-data-generator"],"created_at":"2026-04-10T02:28:16.024Z","updated_at":"2026-05-03T21:02:49.589Z","avatar_url":"https://github.com/opendsr-std.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# seedfaker\n\nDeterministic synthetic data generator. Same seed, same output — across CLI, Python, Node.js, Go, PHP, Ruby, WASM.\n\n200+ fields, 68 locales, multi-table FK, expressions, templates, streaming, `replace` for anonymising existing data.\n\n## Highlights\n\n- **Deterministic across 7 runtimes** — CLI, Python, Node, Go, PHP, Ruby, WASM. Same seed → byte-identical bytes. `--fingerprint` catches algorithm drift. [→](#determinism)\n- **Multi-table FK** — anchors (`users.id:zipf`), dereference (`customer_id-\u003eemail`), self-reference, `ctx:strict` identity correlation. [→](docs/multi-table.md)\n- **Distributed** — `--shard I/N` on three hosts, concatenate, bit-identical to single-host. No coordinator. [→](#distributed-generation)\n- **Database ingest** — `seedfaker | psql \"\\COPY\"`, no files, constant memory. [→](guides/seed-database.md)\n- **TB-scale** — 1 GB into Postgres in 9 s on 8-core ([benchmark](benchmarks/payments_5gb.sh)); 1 TB ≈ 4.3 h. [→](guides/seed-large-database.md)\n- **Throughput** — ~90 MB/s per core (TPC-H dbgen parity), 403 MB/s on 8 threads. Reproducible in [`benchmarks/`](benchmarks/).\n- **In-place anonymisation** — `seedfaker replace email ssn \u003c dump.csv`. Same value + seed = same replacement; cross-file joins survive. [→](docs/replace.md)\n- **ML/LLM datasets** — `--annotated` (byte-offset spans), `--corrupt` (15 noise types), templates (prompt/completion), multi-table FK (conversations, RAG). [→](guides/training-data.md)\n- **Locale-aware PII** — Luhn credit cards, IBAN check digits, 48 gov-ID formats, 68 locales, native scripts. [→](docs/fields.md)\n\n## Contents\n\n- [Install](#install)\n- [Library](#library)\n- [CLI](#cli)\n- [Multi-table and FK](#multi-table-and-fk)\n- [Distributed generation](#distributed-generation)\n- [Bulk load into a database](#bulk-load-into-a-database)\n- [Anonymise existing data](#anonymise-existing-data)\n- [Annotated output for ML](#annotated-output-for-ml)\n- [Determinism](#determinism)\n- [Packages and bindings](#packages-and-bindings)\n- [Documentation](#documentation)\n- [Quick start](#quick-start)\n- [Guides](#guides)\n- [Benchmarks](#benchmarks)\n- [License](#license)\n\n## Install\n\nOne of:\n\n```bash\npip install seedfaker                          # Python\nnpm install @opendsr/seedfaker                 # Node.js\ngo get github.com/opendsr-std/seedfaker-go     # Go\ncomposer require opendsr/seedfaker             # PHP\ngem install seedfaker                          # Ruby\nnpm install @opendsr/seedfaker-wasm            # Browser (WASM)\nbrew install opendsr-std/tap/seedfaker         # CLI (macOS / Linux)\ncargo install seedfaker                        # CLI (from source)\nnpm install -g @opendsr/seedfaker-cli          # CLI (npm)\n```\n\nAll packages wrap the same Rust core and produce byte-identical output for a given seed. See [Packages and bindings](#packages-and-bindings) for per-package documentation.\n\n## Library\n\nOne value:\n\n```python\nfrom seedfaker import SeedFaker\nsf = SeedFaker(seed=\"test\")\nsf.field(\"email\")                  # \"janet.marsh@inbox.com\"\nsf.field(\"phone\", e164=True)       # \"+14155551234\"\nsf.field(\"credit-card\", space=True) # \"4174 0785 8323 6433\"\n```\n\nOne record, with `ctx=\"strict\"` locking every field to one identity:\n\n```python\nsf.record([\"name\", \"email\", \"phone\"], ctx=\"strict\")\n# {\"name\": \"Janet Marsh\", \"email\": \"janet.marsh@inbox.com\", \"phone\": \"+1 (957) 226-4272\"}\n```\n\nBatch:\n\n```python\nsf.records([\"name\", \"email\", \"phone\"], n=1000, ctx=\"strict\")\n```\n\nLocales, weighted mix, native script:\n\n```python\nSeedFaker(seed=\"test\", locale=\"de\").field(\"name\")        # \"Baldur Adler\"\nSeedFaker(seed=\"test\", locale=\"ja\").field(\"name\")        # \"石本 和彦\"\nSeedFaker(seed=\"test\", locale=\"en=7,de=2,fr=1\")          # weighted\n```\n\nNode.js API is identical:\n\n```js\nconst sf = new SeedFaker({ seed: \"test\", locale: \"en\" });\nsf.records([\"name\", \"email\"], { n: 1000, ctx: \"strict\" });\n```\n\nFull API: [docs/library](docs/library.md). Locale list: [docs/context](docs/context.md).\n\n## CLI\n\n```bash\nseedfaker name email phone --seed test --until 2025 -n 1000\nseedfaker name email phone --format csv --seed test --until 2025 -n 1000\nseedfaker name email phone --format jsonl --seed test --until 2025 -n 1000\nseedfaker name email --ctx strict -l ja,zh --abc native -n 10\n```\n\nPipe directly into a database:\n\n```bash\nseedfaker name email phone --format sql=users -n 1000000 --seed staging --until 2025 | psql mydb\n```\n\nArithmetic between columns:\n\n```bash\nseedfaker price=amount:1..500:plain qty=integer:1..20 \"total=price*qty\" \\\n  --format csv --seed ci -n 3 --until 2025\n# price,qty,total\n# 424.49,14,5942.86\n# 459.67,3,1379.01\n# 309.44,12,3713.28\n```\n\nPresets for common log/data shapes:\n\n```bash\nseedfaker run nginx   --rate 5000 --seed demo -n 0 \u003e access.log\nseedfaker run payment --format jsonl --seed bench -n 1000 --until 2025\n```\n\nAll flags: [docs/cli](docs/cli.md). Field syntax: [docs/fields](docs/fields.md). Configs: [docs/configs](docs/configs.md). Presets: [docs/presets](docs/presets.md).\n\n## Multi-table and FK\n\n```yaml\n# shop.yaml\nusers:\n  columns:\n    id: serial\n    name: first-name\n    email: email\n  options: { count: 1000, ctx: strict }\n\norders:\n  columns:\n    id: serial\n    customer_id: users.id:zipf\n    customer_name: customer_id-\u003ename\n    customer_email: customer_id-\u003eemail\n    total: amount:usd:1..5000\n  options: { count: 50000 }\n```\n\n```bash\nseedfaker run shop.yaml --all --output-dir ./data --format csv\n```\n\n- `users.id:zipf` — FK anchor with power-law distribution. `:zipf=N` for tunable exponent; omit for uniform.\n- `customer_id-\u003eemail` — FK dereference; resolves to the email of the same parent row selected by `customer_id`.\n- Self-referencing FK supported (`employees.manager_id: employees.id`).\n\nDetails: [docs/multi-table](docs/multi-table.md), [docs/expressions](docs/expressions.md).\n\nFor bulk-loading a real database at GB/TB scale see [guides/seed-large-database](guides/seed-large-database.md).\n\n## Distributed generation\n\nDeterminism enables horizontal scale without coordination. `--shard I/N` emits a disjoint, contiguous slice of the full `serial` range; the same seed on different hosts produces non-overlapping output. Concatenating all N shards (first shard's header retained, rest with `--no-header`) yields bytes bit-identical to an `N=1` run.\n\nThree hosts, one dataset:\n\n```bash\n# host-a\nseedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \\\n  --shard 0/3 --format csv \u003e events.part0.csv\n\n# host-b\nseedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \\\n  --shard 1/3 --format csv --no-header \u003e events.part1.csv\n\n# host-c\nseedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \\\n  --shard 2/3 --format csv --no-header \u003e events.part2.csv\n```\n\nCollect and concatenate:\n\n```bash\ncat events.part0.csv events.part1.csv events.part2.csv \u003e events.csv\n# Same bytes, same SHA-256 as:\nseedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 --format csv\n```\n\nNo shared state between hosts. No coordinator. No post-processing merge step. Each host is CPU-bound on its own slice and finishes independently.\n\nPer-host generation can also use `--threads N` on top of `--shard`, stacking process and in-process parallelism:\n\n```bash\nseedfaker ... --shard 0/3 --threads 8 --format csv \u003e events.part0.csv\n```\n\nDetails on which mechanism to pick and how they compose: [docs/cli § Sharding and threads](docs/cli.md#sharding-and-threads), [guides/seed-large-database](guides/seed-large-database.md).\n\n## Bulk load into a database\n\nPipe generated CSV straight into `COPY FROM STDIN` — no intermediate files, constant memory:\n\n```bash\nseedfaker run shop.yaml --table users --format csv \\\n  | psql \"$PGURL\" -q -c \"\\COPY users (id,name,email) FROM STDIN WITH (FORMAT csv, HEADER true)\"\n```\n\nFor GB/TB-scale loads: strip all constraints during phase 1, add them back afterwards.\n\n```sql\nCREATE UNLOGGED TABLE users (id UUID NOT NULL, name TEXT, email TEXT);\n-- load rows with COPY FROM STDIN (no PK, no FK, no indexes)\nALTER TABLE users SET LOGGED;\nALTER TABLE users ADD PRIMARY KEY (id);\n```\n\nReason: Postgres constraint and index maintenance is per-row during INSERT/COPY; deferring to a single post-load scan is dramatically faster. seedfaker guarantees id uniqueness by construction, so phase-1 validation is wasted work.\n\n`--shard I/N` splits one table's generation into N disjoint serial ranges. Run multiple `seedfaker | psql` pipelines in parallel into the same table — Postgres takes a RowExclusive lock per backend, not Exclusive, so concurrent `COPY` into one table is supported.\n\n```bash\n# 4 shards into the same table, concurrent\nfor i in 0 1 2 3; do\n  seedfaker run shop.yaml --table events --format csv --shard $i/4 \\\n    | psql \"$PGURL\" -q -c \"\\COPY events (id,ts,user_id) FROM STDIN WITH (FORMAT csv, HEADER true)\" \u0026\ndone\nwait\n```\n\nThe reference benchmark [`benchmarks/payments_5gb.sh`](benchmarks/payments_5gb.sh) implements this pattern end-to-end: 10-table payment dataset, Dockerised Postgres 17 with tuned settings, per-table shard pool, Postgres-side WAL / checkpoint / cache-hit counters.\n\n```bash\n./benchmarks/payments_5gb.sh                       # ~100 MB, default\n./benchmarks/payments_5gb.sh --scale 50 --shards 3 # ~5 GB with 3-way sharding of the big tables\n./benchmarks/payments_5gb.sh --cleanup\n```\n\nFull workflow, tuning rationale, per-knob cost table, cross-engine notes (MySQL, ClickHouse, SQLite): [guides/seed-large-database](guides/seed-large-database.md).\n\n## Anonymise existing data\n\nReplace specific columns in existing CSV or JSONL, keeping other columns untouched and preserving referential integrity across files:\n\n```bash\n$ echo 'name,email,ssn\nAlice,alice@corp.com,123-45-6789' | seedfaker replace email ssn --seed anon\nname,email,ssn\nAlice,nolan.moreno.xxy@icloud.com,404-16-7659\n```\n\nSame value + same seed yields the same replacement every run, so joining `users.email` and `events.email` (after masking each independently) still matches. Details: [docs/replace](docs/replace.md).\n\n## Annotated output for ML\n\n`--annotated` emits JSONL with byte-offset spans, suitable for NER / PII training sets:\n\n```bash\n$ seedfaker name email ssn --annotated --seed demo -n 1 --until 2025\n{\"text\":\"Paulina Laca\\tim.ivana@eunet.rs\\t9580255797203\",\"spans\":[{\"s\":0,\"e\":12,\"f\":\"name\",\"v\":\"Paulina Laca\"},{\"s\":13,\"e\":30,\"f\":\"email\",\"v\":\"im.ivana@eunet.rs\"},{\"s\":31,\"e\":44,\"f\":\"ssn\",\"v\":\"9580255797203\"}]}\n```\n\nCombine with `--corrupt low|mid|high|extreme` for noisy training data. Details: [docs/annotated](docs/annotated.md), [docs/corruption](docs/corruption.md).\n\n## Determinism\n\nEach value is derived from `(seed, record_number, field_name)`. Consequences:\n\n- Adding a field does not change values of existing fields.\n- Reordering fields in the config does not change values.\n- The same seed produces byte-identical output across languages and versions within the same algorithm fingerprint.\n\nPin the fingerprint in CI to detect algorithm changes:\n\n```bash\nseedfaker --fingerprint\n# sf0-158dc9f79ce46b43\n```\n\nDetails: [docs/determinism](docs/determinism.md), [docs/context](docs/context.md) (identity correlation).\n\n## Packages and bindings\n\n| Language / runtime | Install                                      | Registry                                                                                                 | Local docs                            |\n| ------------------ | -------------------------------------------- | -------------------------------------------------------------------------------------------------------- | ------------------------------------- |\n| Python             | `pip install seedfaker`                      | [pypi.org/project/seedfaker](https://pypi.org/project/seedfaker/)                                        | [packages/pip](packages/pip/)         |\n| Node.js            | `npm install @opendsr/seedfaker`             | [npmjs.com/package/@opendsr/seedfaker](https://www.npmjs.com/package/@opendsr/seedfaker)                 | [packages/npm](packages/npm/)         |\n| Go                 | `go get github.com/opendsr-std/seedfaker-go` | [pkg.go.dev/github.com/opendsr-std/seedfaker-go](https://pkg.go.dev/github.com/opendsr-std/seedfaker-go) | [packages/go](packages/go/)           |\n| PHP                | `composer require opendsr/seedfaker`         | [packagist.org/packages/opendsr/seedfaker](https://packagist.org/packages/opendsr/seedfaker)             | [packages/php](packages/php/)         |\n| Ruby               | `gem install seedfaker`                      | [rubygems.org/gems/seedfaker](https://rubygems.org/gems/seedfaker)                                       | [packages/ruby](packages/ruby/)       |\n| Browser (WASM)     | `npm install @opendsr/seedfaker-wasm`        | [npmjs.com/package/@opendsr/seedfaker-wasm](https://www.npmjs.com/package/@opendsr/seedfaker-wasm)       | [packages/wasm](packages/wasm/)       |\n| CLI (npm)          | `npm install -g @opendsr/seedfaker-cli`      | [npmjs.com/package/@opendsr/seedfaker-cli](https://www.npmjs.com/package/@opendsr/seedfaker-cli)         | [packages/npm-cli](packages/npm-cli/) |\n| CLI (Homebrew)     | `brew install opendsr-std/tap/seedfaker`     | [github.com/opendsr-std/homebrew-tap](https://github.com/opendsr-std/homebrew-tap)                       | [docs/cli](docs/cli.md)               |\n| CLI (Cargo)        | `cargo install seedfaker`                    | [crates.io/crates/seedfaker](https://crates.io/crates/seedfaker)                                         | [docs/cli](docs/cli.md)               |\n\nAll packages wrap the same Rust core. API surface is intentionally identical across languages except for idiomatic naming.\n\n## Documentation\n\nReference: [docs/](docs/).\n\n|                  |                                                                                                           |\n| ---------------- | --------------------------------------------------------------------------------------------------------- |\n| **Start here**   | [Quick start](docs/quick-start.md)                                                                        |\n| **CLI**          | [Commands and flags](docs/cli.md) · [Determinism](docs/determinism.md)                                    |\n| **Fields**       | [Syntax and modifiers](docs/fields.md) · [Field reference (200+)](docs/field-reference.md)                |\n| **Configs**      | [YAML configs](docs/configs.md) · [Multi-table](docs/multi-table.md) · [Expressions](docs/expressions.md) |\n| **Output**       | [Templates](docs/templates.md) · [Annotated](docs/annotated.md) · [Streaming](docs/streaming.md)          |\n| **Data quality** | [Context](docs/context.md) · [Corruption](docs/corruption.md) · [Replace](docs/replace.md)                |\n| **Presets**      | [Built-in presets](docs/presets.md) (nginx, payment, auth, postgres, syslog, medical, …)                  |\n| **Integrations** | [Library API](docs/library.md) · [MCP](docs/mcp.md)                                                       |\n\nWorkflows: [guides/](guides/). Runnable examples: [examples/](examples/).\n\n## Quick start\n\n```bash\npip install seedfaker\npython -c 'from seedfaker import SeedFaker; print(SeedFaker(seed=\"demo\").record([\"name\",\"email\"]))'\n```\n\nOr with the CLI:\n\n```bash\nbrew install opendsr-std/tap/seedfaker\nseedfaker name email phone --seed demo --until 2025 -n 5\n```\n\nThen: [docs/quick-start](docs/quick-start.md) for the 10-minute walkthrough, [docs/cli](docs/cli.md) for flags, [docs/fields](docs/fields.md) for field syntax.\n\n## Guides\n\nEnd-to-end workflows in [guides/](guides/):\n\n|                                                             |                                                                          |\n| ----------------------------------------------------------- | ------------------------------------------------------------------------ |\n| [Seed a database](guides/seed-database.md)                  | Postgres/MySQL staging DB with multi-table FK                            |\n| [Seed a large database](guides/seed-large-database.md)      | GB/TB bulk load — parallel COPY, UNLOGGED, tuning                        |\n| [Distributed generation](guides/distributed-generation.md)  | Multi-host sharded generation without coordination                       |\n| [Anonymise production data](guides/anonymize-data.md)       | `replace` on CSV/JSONL, FK integrity across files                        |\n| [Training and evaluation datasets](guides/training-data.md) | NER/PII, LLM fine-tuning, eval with ground truth, red-team, multilingual |\n| [Reproducible datasets](guides/reproducible-datasets.md)    | Deterministic fixtures, CI, fingerprint guard                            |\n| [Library usage](guides/library-usage.md)                    | Python / Node.js SDK patterns                                            |\n| [Mock API server](guides/mock-api-server.md)                | Express / FastAPI mock endpoint                                          |\n| [API load testing](guides/api-load-testing.md)              | Rate-limited streaming, corruption                                       |\n| [MCP for AI agents](guides/mcp-ai-agents.md)                | Claude / Cursor / VS Code integration                                    |\n\n## Benchmarks\n\nReproducible throughput measurements, install scripts, per-field breakdowns, and an end-to-end Postgres load benchmark (`payments_5gb.sh`): [benchmarks/](benchmarks/).\n\n## License\n\nMIT\n\n---\n\n\u003e [README](README.md) · [Docs](docs/) · [Guides](guides/) · [Packages](packages/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopendsr-std%2Fseedfaker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopendsr-std%2Fseedfaker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopendsr-std%2Fseedfaker/lists"}