{"id":42587012,"url":"https://github.com/ey-asu-rnd/syntheticdata","last_synced_at":"2026-03-07T00:01:23.747Z","repository":{"id":333696374,"uuid":"1134528871","full_name":"ey-asu-rnd/SyntheticData","owner":"ey-asu-rnd","description":"A high-performance, configurable synthetic data generator for complete enterprise simulation. Produces realistic, interconnected General Ledger Journal Entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, and ML-ready graph exports at scale (10K to 100M+ transactions).","archived":false,"fork":false,"pushed_at":"2026-02-27T20:47:34.000Z","size":18384,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-28T00:53:10.505Z","etag":null,"topics":["audit","coso","data-factory","digital-twin","enterprise-simulation","erp-data","esg","gaap","general-ledger","ifrs","ml-training","neo4j","process-mining","python","rust","sap","sox","synthetic-data","tax","teaching-tool"],"latest_commit_sha":null,"homepage":"https://ey-asu-rnd.github.io/SyntheticData/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ey-asu-rnd.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-14T21:00:20.000Z","updated_at":"2026-02-27T20:41:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"683b7786-e844-4ed1-a162-2708f475a72f","html_url":"https://github.com/ey-asu-rnd/SyntheticData","commit_stats":null,"previous_names":["ey-asu-rnd/syntheticdata"],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/ey-asu-rnd/SyntheticData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ey-asu-rnd%2FSyntheticData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ey-asu-rnd%2FSyntheticData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ey-asu-rnd%2FSyntheticData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ey-asu-rnd%2FSyntheticData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ey-asu-rnd","download_url":"https://codeload.github.com/ey-asu-rnd/SyntheticData/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ey-asu-rnd%2FSyntheticData/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30204109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T19:07:06.838Z","status":"ssl_error","status_checked_at":"2026-03-06T18:57:34.882Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audit","coso","data-factory","digital-twin","enterprise-simulation","erp-data","esg","gaap","general-ledger","ifrs","ml-training","neo4j","process-mining","python","rust","sap","sox","synthetic-data","tax","teaching-tool"],"created_at":"2026-01-28T23:02:46.200Z","updated_at":"2026-03-07T00:01:23.739Z","avatar_url":"https://github.com/ey-asu-rnd.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataSynth\n\n[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)\n[![Rust](https://img.shields.io/badge/rust-1.88%2B-orange.svg)](https://www.rust-lang.org)\n[![CI](https://github.com/ey-asu-rnd/SyntheticData/actions/workflows/ci.yml/badge.svg)](https://github.com/ey-asu-rnd/SyntheticData/actions/workflows/ci.yml)\n\n**High-performance synthetic enterprise data generation for ML, audit analytics, and system testing.**\n\nDataSynth generates statistically realistic, fully interconnected enterprise financial data at scale. It produces coherent General Ledger journal entries, document flows, subledger records, banking transactions, process mining event logs, and graph exports — covering 20+ enterprise process families from Procure-to-Pay through ESG reporting.\n\nAll generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford's Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.\n\n**Developed by [Ernst \u0026 Young Ltd.](https://www.ey.com/ch), Zurich, Switzerland**\n\n---\n\n## Table of Contents\n\n- [Quick Start](#quick-start)\n- [Key Capabilities](#key-capabilities)\n- [Architecture](#architecture)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [Output Structure](#output-structure)\n- [Python SDK](#python-sdk)\n- [Server \u0026 Deployment](#server--deployment)\n- [Desktop UI](#desktop-ui)\n- [Privacy-Preserving Fingerprinting](#privacy-preserving-fingerprinting)\n- [Use Cases](#use-cases)\n- [Performance](#performance)\n- [Documentation](#documentation)\n- [License](#license)\n\n---\n\n## Quick Start\n\n```bash\n# Build from source\ngit clone https://github.com/ey-asu-rnd/SyntheticData.git\ncd SyntheticData\ncargo build --release\n\n# Demo mode — generates a complete dataset with defaults\n./target/release/datasynth-data generate --demo --output ./demo-output\n\n# Or configure for your use case\n./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml\n./target/release/datasynth-data validate --config config.yaml\n./target/release/datasynth-data generate --config config.yaml --output ./output\n```\n\n---\n\n## Key Capabilities\n\n### Statistical Foundations\n\nDataSynth models real-world financial data characteristics from the ground up:\n\n- **Distribution engine** — Log-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, and zero-inflated distributions with configurable components\n- **Copula correlations** — Cross-field dependency modeling via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas\n- **Benford's Law** — First and second-digit compliance with configurable deviation for anomaly injection\n- **Temporal patterns** — Month-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags, and fiscal calendar support\n- **Regime changes** — Economic cycles, acquisition effects, and structural breaks in time series\n- **Industry profiles** — Pre-configured distributions for Retail, Manufacturing, Financial Services, Healthcare, and Technology\n\n### Enterprise Process Simulation\n\nEvery process chain generates its own master data, documents, and journal entries — all cross-referenced:\n\n| Process Family | Scope |\n|----------------|-------|\n| **General Ledger** | Journal entries, chart of accounts (small/medium/large), ACDOCA event logs |\n| **Procure-to-Pay** | Purchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match |\n| **Order-to-Cash** | Sales orders, deliveries, customer invoices, receipts, dunning |\n| **Source-to-Contract** | Spend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards |\n| **Hire-to-Retire** | Payroll runs, tax/deduction calculations, time \u0026 attendance, expense reports, benefit enrollment |\n| **Manufacturing** | Production orders, BOM explosion, routing operations, WIP costing, quality inspections, cycle counts |\n| **Financial Reporting** | Balance sheet, income statement, cash flow, changes in equity, KPIs, budget variance |\n| **Tax Accounting** | Multi-jurisdiction tax (Federal/State/Local), VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48 uncertain positions, withholding |\n| **Treasury** | Cash positioning, probability-weighted forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting |\n| **Project Accounting** | WBS hierarchies, cost lines, percentage-of-completion revenue, earned value (SPI/CPI/EAC), change orders |\n| **ESG / Sustainability** | GHG Scope 1/2/3 emissions, energy/water/waste, workforce diversity, safety metrics, GRI/SASB/TCFD disclosures |\n| **Intercompany** | IC matching, transfer pricing, consolidation eliminations, currency translation |\n| **Subledgers** | AR, AP, Fixed Assets, Inventory — each with GL reconciliation |\n| **Period Close** | Monthly close engine, depreciation runs, accruals, year-end closing entries |\n| **Banking / KYC / AML** | Customer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel) |\n| **Sales** | Quote-to-order pipeline with win rate modeling and pricing negotiation |\n| **Bank Reconciliation** | Statement matching, outstanding checks, deposits in transit |\n| **Audit** | ISA-compliant engagements, workpapers, evidence, risk assessments, findings |\n\n### Accounting \u0026 Audit Standards\n\n- **Accounting frameworks** — US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), and dual reporting\n- **Revenue recognition** — ASC 606 / IFRS 15 with contract generation, performance obligations, and SSP allocation\n- **Leases** — ASC 842 / IFRS 16 with ROU assets, lease liabilities, and classification\n- **Fair value** — ASC 820 / IFRS 13 Level 1/2/3 hierarchy\n- **Impairment** — ASC 360 / IAS 36 testing with fair value estimation\n- **Audit standards** — ISA (34 standards), PCAOB (19+ standards) with procedure mapping\n- **SOX compliance** — Section 302/404 assessments with deficiency classification and material weakness detection\n- **COSO 2013** — 5 components, 17 principles, maturity levels, entity-level and transaction-level controls\n- **Localized exports** — FEC (French) and GoBD (German) audit file formats\n\n### Interconnectivity \u0026 Relationships\n\n- **Multi-tier vendor networks** — Tier 1/2/3 supply chain with behavioral clusters (Strategic, Operational, Transactional, Problematic)\n- **Customer segmentation** — Enterprise/MidMarket/SMB/Consumer with Pareto-like revenue distribution and lifecycle stages\n- **Relationship strength** — Composite scoring from volume, count, duration, recency, and mutual connections\n- **Cross-process links** — P2P and O2C linked via inventory; payments linked to bank reconciliation\n- **Entity graphs** — 16 entity types, 26 relationship types with connectivity and clustering metrics\n\n### Fraud, Anomalies \u0026 Data Quality\n\n- **ACFE-aligned fraud taxonomy** — Asset misappropriation, corruption, and financial statement fraud with calibrated rates\n- **60+ anomaly types** — Fraud, errors, process issues, statistical outliers, and relational anomalies\n- **Collusion modeling** — 9 ring types with role-based conspirators, defection, and escalation dynamics\n- **Management override** — Senior-level fraud patterns with fraud triangle modeling\n- **Red flag generation** — 40+ probabilistic fraud indicators with Bayesian calibration\n- **Industry-specific patterns** — Manufacturing yield manipulation, retail sweethearting, healthcare upcoding\n- **Data quality variations** — Missing values (MCAR/MAR/MNAR), format variations, typos (keyboard-aware, OCR), duplicates, encoding issues\n- **Full labeling** — Every injected anomaly and quality issue is labeled for supervised ML training\n\n### Process \u0026 Behavioral Drift\n\n- **Organizational events** — Acquisitions, divestitures, mergers, reorganizations with volume multipliers\n- **Process evolution** — S-curve automation rollout, workflow changes, policy updates\n- **Technology transitions** — ERP migrations with phased rollout (parallel run, cutover, stabilization)\n- **Market drift** — Economic cycles, commodity price shocks, recession modeling\n- **Labeled drift events** — Ground truth labels with magnitude and detection difficulty for ML training\n\n### Machine Learning \u0026 Graph Export\n\n- **Graph formats** — PyTorch Geometric (.pt), Neo4j (CSV + Cypher), DGL, RustGraph JSON\n- **Multi-layer hypergraph** — 3-layer (Governance, Process Events, Accounting Network) with OCPM events as hyperedges\n- **Train/val/test splits** — Configurable data partitioning for ML pipelines\n- **Anomaly labels** — Fraud labels, quality issue labels, and drift labels in standardized format\n- **Counterfactual pairs** — (original, mutated) journal entry pairs for causal ML training\n\n### Process Mining\n\n- **OCEL 2.0** — Object-centric event logs in JSON/XML format\n- **XES 2.0** — XML export compatible with ProM, Celonis, Disco, and pm4py\n- **101+ activity types** across 12 process families with 65+ object types\n- **10 OCPM generators** — S2C, H2R, MFG, BANK, AUDIT, Bank Recon, Tax, Treasury, Project Accounting, ESG\n- **Process variants** — Happy path (75%), exception path (20%), error path (5%)\n\n### Advanced Generation\n\n| Capability | Description |\n|------------|-------------|\n| **LLM enrichment** | Pluggable `LlmProvider` trait (mock/OpenAI-compatible) for vendor names, descriptions, and anomaly explanations |\n| **Diffusion models** | Statistical diffusion with Langevin reverse process; linear/cosine/sigmoid schedules; hybrid blending |\n| **Causal models** | Structural causal models with do-calculus interventions and counterfactual abduction-action-prediction |\n| **Natural language config** | Generate YAML configurations from plain English descriptions |\n| **Scenario engine** | Built-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override, comprehensive |\n| **Counterfactual simulation** | 8 intervention types with causal DAG propagation and diff analysis |\n\n### Production Features\n\n- **REST / gRPC / WebSocket APIs** with streaming generation and backpressure handling\n- **Authentication** — API key (Argon2id), JWT/OIDC (RS256), role-based access control (Admin/Operator/Viewer)\n- **Quality gates** — Configurable pass/fail thresholds (strict/default/lenient) with 8 metrics\n- **Plugin SDK** — `GeneratorPlugin`, `SinkPlugin`, `TransformPlugin` traits with thread-safe registry\n- **Resource guards** — Memory, disk, and CPU monitoring with graceful degradation (Normal → Reduced → Minimal → Emergency)\n- **Deterministic generation** — Seeded ChaCha8 RNG for fully reproducible output\n- **Streaming output** — Async generation with configurable backpressure (block/drop_oldest/drop_newest/buffer)\n- **Data lineage** — Per-file checksums, lineage graph, W3C PROV-JSON export\n- **Country packs** — Pluggable JSON country configuration (US/DE/GB built-in) with holidays, names, tax, addresses\n- **Observability** — OpenTelemetry traces, Prometheus metrics, structured JSON logging\n- **Docker \u0026 Kubernetes** — Multi-stage distroless containers, Helm chart with HPA/PDB, Prometheus ServiceMonitor\n- **CI/CD** — 7-job GitHub Actions pipeline (fmt, clippy, cross-platform test, MSRV, security, coverage, benchmarks)\n- **EU AI Act** — Article 50 synthetic content marking and Article 10 data governance reports\n- **Fuzzing** — cargo-fuzz targets for config parsing, fingerprint loading, and validation\n- **Panic-free** — `#![deny(clippy::unwrap_used)]` enforced across all library crates\n\n### Ecosystem Integrations\n\n| Integration | Capability |\n|-------------|------------|\n| **Apache Airflow** | `DataSynthOperator`, `DataSynthSensor`, `DataSynthValidateOperator` for DAG orchestration |\n| **dbt** | Source YAML generation, seed export, project scaffolding |\n| **MLflow** | Generation runs as experiments with parameter, metric, and artifact logging |\n| **Apache Spark** | DataFrames with schema inference and temp view registration |\n\n---\n\n## Architecture\n\nDataSynth is a Rust workspace organized into 15 modular crates:\n\n```\ndatasynth-cli            CLI binary (generate, validate, init, info, fingerprint, scenario)\ndatasynth-server         REST / gRPC / WebSocket server with auth and rate limiting\ndatasynth-ui             Tauri + SvelteKit desktop application\n                │\ndatasynth-runtime        Generation orchestrator (parallel execution, resource guards, streaming)\n                │\ndatasynth-generators     50+ data generators across all process families\ndatasynth-banking        KYC / AML banking transaction generator\ndatasynth-ocpm           OCEL 2.0 / XES 2.0 process mining\ndatasynth-fingerprint    Privacy-preserving fingerprint extraction and synthesis\ndatasynth-standards      Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)\n                │\ndatasynth-graph          Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)\ndatasynth-eval           Statistical evaluation, quality gates, auto-tuning\n                │\ndatasynth-config         Configuration schema, validation, industry presets\n                │\ndatasynth-core           Domain models, traits, distributions, resource guards\n                │\ndatasynth-output         Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming\ndatasynth-test-utils     Test utilities, fixtures, mocks\n```\n\n---\n\n## Installation\n\n### From Source\n\n```bash\ngit clone https://github.com/ey-asu-rnd/SyntheticData.git\ncd SyntheticData\ncargo build --release\n```\n\nThe binary is available at `target/release/datasynth-data`.\n\n### Requirements\n\n- **Rust 1.88+**\n- **Desktop UI**: Node.js 18+ and platform-specific [Tauri prerequisites](https://tauri.app/start/prerequisites/)\n\n---\n\n## Configuration\n\nDataSynth uses YAML configuration with 30+ top-level sections. Generate a starter config with `init`:\n\n```bash\ndatasynth-data init --industry retail --complexity medium -o config.yaml\n```\n\n**Minimal configuration:**\n\n```yaml\nglobal:\n  seed: 42\n  industry: manufacturing\n  start_date: 2024-01-01\n  period_months: 12\n  group_currency: USD\n\ncompanies:\n  - code: \"1000\"\n    name: \"Headquarters\"\n    currency: USD\n    country: US\n\ntransactions:\n  target_count: 100000\n\noutput:\n  format: csv               # csv, json, parquet\n```\n\n**Enable specific modules by adding their sections:**\n\n```yaml\n# Fraud detection training data\nfraud:\n  enabled: true\n  fraud_rate: 0.005\nanomaly_injection:\n  enabled: true\n  total_rate: 0.02\n  generate_labels: true\n\n# Graph export for GNN training\ngraph_export:\n  enabled: true\n  formats: [pytorch_geometric, neo4j]\n\n# Statistical realism\ndistributions:\n  enabled: true\n  industry_profile: retail\n  amounts:\n    distribution_type: lognormal\n    benford_compliance: true\n  correlations:\n    enabled: true\n    copula_type: gaussian\n\n# Enterprise process chains\ndocument_flows:\n  enabled: true\nsource_to_pay:\n  enabled: true\nhr:\n  enabled: true\nmanufacturing:\n  enabled: true\nfinancial_reporting:\n  enabled: true\nesg:\n  enabled: true\n\n# Accounting standards\naccounting_standards:\n  enabled: true\n  framework: us_gaap         # us_gaap, ifrs, french_gaap, german_gaap, dual_reporting\n\n# Process mining\nocpm:\n  enabled: true\n  output:\n    ocel_json: true\n    xes: true\n```\n\n**Industry presets** (manufacturing, retail, financial_services, healthcare, technology) and **complexity levels** (small ~100 accounts, medium ~400, large ~2500) provide sensible defaults.\n\nSee the [Configuration Guide](docs/configuration.md) for the complete reference.\n\n---\n\n## Output Structure\n\nDataSynth generates 100+ interconnected output tables organized by domain:\n\n```\noutput/\n├── master_data/            Vendors, customers, materials, fixed assets, employees\n├── transactions/           Journal entries, ACDOCA, purchase orders, invoices, payments\n├── sourcing/               S2C pipeline (projects, RFx, bids, contracts, scorecards)\n├── subledgers/             AR, AP, Fixed Assets, Inventory detail records\n├── hr/                     Payroll runs, payslips, time entries, expense reports\n├── manufacturing/          Production orders, routing, quality inspections, cycle counts\n├── period_close/           Trial balances, accruals, depreciation, closing entries\n├── financial_reporting/    Balance sheet, income statement, cash flow, KPIs, budgets\n├── sales/                  Sales quotes and line items\n├── consolidation/          IC eliminations, currency translation\n├── fx/                     Exchange rates, CTA adjustments\n├── banking/                KYC profiles, bank transactions, reconciliation, AML labels\n├── process_mining/         OCEL 2.0 JSON, XES 2.0, process variants, reference models\n├── audit/                  Engagements, workpapers, evidence, risks, findings\n├── graphs/                 PyTorch Geometric, Neo4j, DGL, RustGraph, hypergraph\n├── labels/                 Anomaly, fraud, quality, and drift labels for ML\n├── tax/                    Jurisdictions, codes, returns, provisions, withholding\n├── treasury/               Cash positions, forecasts, hedging, debt, netting\n├── project_accounting/     Projects, WBS, costs, revenue, earned value, change orders\n├── esg/                    Emissions, energy, diversity, safety, disclosures\n├── controls/               Internal controls, COSO mappings, SoD rules\n└── standards/              Accounting contracts/leases/impairment, audit ISA/SOX\n```\n\n---\n\n## Python SDK\n\n```bash\ncd python \u0026\u0026 pip install -e \".[all]\"\n```\n\n```python\nfrom datasynth_py import DataSynth\nfrom datasynth_py import to_pandas, to_polars, list_tables\nfrom datasynth_py.config import blueprints\n\n# Generate with a preset blueprint\nconfig = blueprints.retail_small(companies=4, transactions=10000)\nresult = DataSynth().generate(config=config, output={\"format\": \"csv\", \"sink\": \"temp_dir\"})\n\n# Load as DataFrames\ntables = list_tables(result)                  # ['journal_entries', 'vendors', ...]\ndf = to_pandas(result, \"journal_entries\")\npl_df = to_polars(result, \"vendors\")\n\n# Async generation\nfrom datasynth_py import AsyncDataSynth\nasync with AsyncDataSynth() as synth:\n    result = await synth.generate(config=config)\n\n# Fingerprint operations\nsynth = DataSynth()\nsynth.fingerprint.extract(\"./real_data/\", \"./fingerprint.dsf\", privacy_level=\"standard\")\nreport = synth.fingerprint.evaluate(\"./fingerprint.dsf\", \"./synthetic/\")\n```\n\n**Available blueprints:** `retail_small()`, `banking_medium()`, `manufacturing_large()`, `ml_training()`, `statistical_validation()`, `with_distributions()`, `with_llm_enrichment()`, `with_diffusion()`, `with_causal()`\n\n**Optional dependencies:** `[pandas]`, `[polars]`, `[jupyter]`, `[streaming]`, `[airflow]`, `[dbt]`, `[mlflow]`, `[spark]`, `[all]`\n\n---\n\n## Server \u0026 Deployment\n\n```bash\n# Start REST + gRPC server\ncargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051\n\n# With authentication\ncargo run -p datasynth-server -- --api-keys \"key1,key2\"\n\n# With JWT/OIDC (Keycloak, Auth0, Entra ID)\ncargo run -p datasynth-server --features jwt -- \\\n  --jwt-issuer \"https://auth.example.com\" \\\n  --jwt-audience \"datasynth-api\"\n```\n\n**API endpoints:**\n\n```bash\ncurl http://localhost:3000/health\ncurl http://localhost:3000/ready\ncurl http://localhost:3000/metrics\ncurl -H \"Authorization: Bearer \u003ckey\u003e\" -X POST http://localhost:3000/api/stream/start\n```\n\nWebSocket streaming: `ws://localhost:3000/ws/events`\n\n**Docker:**\n\n```bash\ndocker build -t datasynth:latest .\ndocker run -p 3000:3000 -p 50051:50051 datasynth:latest\n\n# Full stack with Prometheus + Grafana\ndocker compose up -d\n```\n\nSee the [Deployment Guide](deploy/README.md) for Docker, Kubernetes Helm chart, systemd, and reverse proxy configuration.\n\n---\n\n## Desktop UI\n\n```bash\ncd crates/datasynth-ui\nnpm install\nnpm run tauri dev\n```\n\nCross-platform Tauri + SvelteKit application with 40+ configuration pages, real-time streaming visualization, and preset management.\n\n---\n\n## Privacy-Preserving Fingerprinting\n\nExtract statistical fingerprints from real data with formal privacy guarantees, then generate matching synthetic data:\n\n```bash\n# Extract with differential privacy\ndatasynth-data fingerprint extract --input ./real_data.csv --output ./fp.dsf --privacy-level standard\n\n# Validate and evaluate\ndatasynth-data fingerprint validate ./fp.dsf\ndatasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./synthetic/\n```\n\n| Privacy Level | Epsilon (ε) | k-Anonymity | Description |\n|---------------|-------------|-------------|-------------|\n| minimal       | 5.0         | 3           | Higher utility, lower privacy |\n| standard      | 1.0         | 5           | Balanced (default) |\n| high          | 0.5         | 10          | Higher privacy |\n| maximum       | 0.1         | 20          | Maximum privacy |\n\nFeatures include Rényi DP and zCDP composition accounting, privacy budget management, federated fingerprinting for distributed data, membership inference attack testing, and cryptographic synthetic data certificates (HMAC-SHA256).\n\n---\n\n## Use Cases\n\n| Domain | Application |\n|--------|-------------|\n| **Fraud Detection** | Train supervised models with ACFE-aligned labeled fraud patterns and collusion networks |\n| **Graph Neural Networks** | Entity relationship graphs with typed edges for anomaly detection |\n| **AML / KYC Testing** | Banking transactions with structuring, layering, and mule typologies |\n| **Audit Analytics** | Validate audit procedures with known control exceptions and ISA/PCAOB mappings |\n| **Process Mining** | OCEL 2.0 and XES 2.0 event logs for process discovery and conformance checking |\n| **ERP Load Testing** | Realistic transaction volumes with proper document chains |\n| **SOX Compliance** | Internal control monitoring with COSO 2013 mappings and deficiency classification |\n| **Causal ML Research** | Interventional and counterfactual datasets with causal DAG propagation |\n| **Data Quality ML** | Train models to detect missing values, format variations, typos, and duplicates |\n| **ESG Reporting** | GHG emissions, diversity metrics, and GRI/SASB/TCFD disclosure data |\n| **Tax Compliance** | Multi-jurisdiction tax returns, provisions, and withholding records |\n| **Treasury Operations** | Cash positioning, hedging effectiveness, and debt covenant monitoring |\n\n---\n\n## Performance\n\n| Metric | Value |\n|--------|-------|\n| Single-threaded throughput | 200,000+ journal entries/second |\n| Parallel scaling | Linear with available CPU cores |\n| Memory model | Streaming generation with configurable backpressure |\n| Determinism | Fully reproducible via seeded ChaCha8 RNG |\n\n---\n\n## Documentation\n\n- [Configuration Guide](docs/configuration.md)\n- [API Reference](docs/api.md)\n- [Architecture Overview](docs/architecture.md)\n- [Python SDK Guide](docs/src/user-guide/python-wrapper.md)\n- [Deployment Guide](deploy/README.md)\n- [Fingerprinting Guide](docs/fingerprint/)\n- [Compliance \u0026 Regulatory](docs/src/compliance/README.md)\n- [Contributing](CONTRIBUTING.md)\n\n---\n\n## License\n\nCopyright 2024–2026 Michael Ivertowski, Ernst \u0026 Young Ltd., Zurich, Switzerland\n\nLicensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.\n\n---\n\n## Support\n\nCommercial support, custom development, and enterprise licensing are available. Contact [michael.ivertowski@ch.ey.com](mailto:michael.ivertowski@ch.ey.com).\n\n---\n\n*DataSynth is provided \"as is\" without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fey-asu-rnd%2Fsyntheticdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fey-asu-rnd%2Fsyntheticdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fey-asu-rnd%2Fsyntheticdata/lists"}