{"id":40476512,"url":"https://github.com/williajm/forgery","last_synced_at":"2026-01-20T18:24:55.573Z","repository":{"id":330768417,"uuid":"1123353380","full_name":"williajm/forgery","owner":"williajm","description":"Rust-powered fake data generator for Python. ","archived":false,"fork":false,"pushed_at":"2026-01-19T21:43:24.000Z","size":304,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-20T04:11:45.092Z","etag":null,"topics":["test-data-generator"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/williajm.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-26T17:37:04.000Z","updated_at":"2026-01-19T21:43:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/williajm/forgery","commit_stats":null,"previous_names":["williajm/forgery"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/williajm/forgery","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williajm%2Fforgery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williajm%2Fforgery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williajm%2Fforgery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williajm%2Fforgery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/williajm","download_url":"https://codeload.github.com/williajm/forgery/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williajm%2Fforgery/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28609062,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T16:10:39.856Z","status":"ssl_error","status_checked_at":"2026-01-20T16:10:39.493Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["test-data-generator"],"created_at":"2026-01-20T18:24:55.379Z","updated_at":"2026-01-20T18:24:55.503Z","avatar_url":"https://github.com/williajm.png","language":"Rust","readme":"# forgery\n\n[![CI](https://github.com/williajm/forgery/actions/workflows/ci.yml/badge.svg)](https://github.com/williajm/forgery/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/williajm/forgery/branch/main/graph/badge.svg)](https://codecov.io/gh/williajm/forgery)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n**Fake data at the speed of Rust.**\n\nA high-performance fake data generation library for Python, powered by Rust. Designed to be 50-100x faster than Faker for batch operations.\n\n## Installation\n\n```bash\n# Clone and install from source\ngit clone https://github.com/williajm/forgery.git\ncd forgery\npip install maturin\nmaturin develop --release\n```\n\n## Quick Start\n\n```python\nfrom forgery import fake\n\n# Generate 10,000 names in one fast call\nnames = fake.names(10_000)\n\n# Single values work too\nemail = fake.email()\nname = fake.name()\n\n# Deterministic output with seeding\nfake.seed(42)\ndata1 = fake.names(100)\nfake.seed(42)\ndata2 = fake.names(100)\nassert data1 == data2\n```\n\n## Features\n\n- **Batch-first design**: Generate thousands of values in a single call\n- **50-100x faster** than Faker for batch operations\n- **Multi-locale support**: 7 locales with locale-specific data\n- **Deterministic seeding**: Reproducible output for testing\n- **Type hints**: Full type stub support for IDE autocompletion\n- **Familiar API**: Method names match Faker for easy migration\n\n## Locale Support\n\nforgery supports 7 locales with locale-specific names, addresses, phone numbers, and more:\n\n| Locale | Language | Country |\n|--------|----------|---------|\n| `en_US` | English | United States (default) |\n| `en_GB` | English | United Kingdom |\n| `de_DE` | German | Germany |\n| `fr_FR` | French | France |\n| `es_ES` | Spanish | Spain |\n| `it_IT` | Italian | Italy |\n| `ja_JP` | Japanese | Japan |\n\n```python\nfrom forgery import Faker\n\n# Default locale is en_US\nfake = Faker()\nfake.names(5)  # American names\n\n# Use a different locale\ngerman = Faker(\"de_DE\")\ngerman.names(5)  # German names\n\njapanese = Faker(\"ja_JP\")\njapanese.addresses(3)  # Japanese addresses with prefecture\n```\n\nEach locale provides:\n- **Names**: First names, last names, and full names in the local language\n- **Addresses**: Cities, regions/states, postal codes in the correct format\n- **Phone numbers**: Country-specific formats and country codes\n- **Companies**: Local company names and job titles\n- **Colors**: Color names in the local language\n\n## API\n\n### Module-level functions (use default instance)\n\n```python\nfrom forgery import seed, names, emails, integers, uuids\n\nseed(42)  # Seed for reproducibility\n\n# Batch generation (fast path)\nnames(1000)           # list[str] of full names\nemails(1000)          # list[str] of email addresses\nintegers(1000, 0, 100)  # list[int] in range\nuuids(1000)           # list[str] of UUIDv4\n\n# Single values\nname()                # str\nemail()               # str\ninteger(0, 100)       # int\nuuid()                # str\n```\n\n### Faker class (independent instances)\n\n```python\nfrom forgery import Faker\n\n# Each instance has its own RNG state\nfake1 = Faker()\nfake2 = Faker()\n\nfake1.seed(42)\nfake2.seed(99)\n\n# Generate independently\nfake1.names(100)\nfake2.emails(100)\n```\n\n## Available Generators\n\n### Names \u0026 Identity\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `names(n)` | `name()` | Full names (first + last) |\n| `first_names(n)` | `first_name()` | First names |\n| `last_names(n)` | `last_name()` | Last names |\n\n### Contact Information\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `emails(n)` | `email()` | Email addresses |\n| `safe_emails(n)` | `safe_email()` | Safe domain emails (@example.com, etc.) |\n| `free_emails(n)` | `free_email()` | Free provider emails (@gmail.com, etc.) |\n| `phone_numbers(n)` | `phone_number()` | Phone numbers in (XXX) XXX-XXXX format |\n\n### Numbers \u0026 Identifiers\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `integers(n, min, max)` | `integer(min, max)` | Random integers in range |\n| `floats(n, min, max)` | `float_(min, max)` | Random floats in range (Note: `float_` avoids shadowing Python's `float` builtin) |\n| `uuids(n)` | `uuid()` | UUID v4 strings |\n| `md5s(n)` | `md5()` | Random 32-char hex strings (MD5-like format, not cryptographic hashes) |\n| `sha256s(n)` | `sha256()` | Random 64-char hex strings (SHA256-like format, not cryptographic hashes) |\n\n### Dates \u0026 Times\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `dates(n, start, end)` | `date(start, end)` | Random dates (YYYY-MM-DD) |\n| `datetimes(n, start, end)` | `datetime_(start, end)` | Random datetimes (ISO 8601). Note: `datetime_` avoids shadowing Python's `datetime` module |\n| `dates_of_birth(n, min_age, max_age)` | `date_of_birth(min_age, max_age)` | Birth dates for given age range |\n\n### Addresses\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `street_addresses(n)` | `street_address()` | Street addresses (e.g., \"123 Main Street\") |\n| `cities(n)` | `city()` | City names |\n| `states(n)` | `state()` | State names |\n| `countries(n)` | `country()` | Country names |\n| `zip_codes(n)` | `zip_code()` | ZIP codes (5 or 9 digit) |\n| `addresses(n)` | `address()` | Full addresses |\n\n### Company \u0026 Business\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `companies(n)` | `company()` | Company names |\n| `jobs(n)` | `job()` | Job titles |\n| `catch_phrases(n)` | `catch_phrase()` | Business catch phrases |\n\n### Network\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `urls(n)` | `url()` | URLs with https:// |\n| `domain_names(n)` | `domain_name()` | Domain names |\n| `ipv4s(n)` | `ipv4()` | IPv4 addresses |\n| `ipv6s(n)` | `ipv6()` | IPv6 addresses |\n| `mac_addresses(n)` | `mac_address()` | MAC addresses |\n\n### Finance\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `credit_cards(n)` | `credit_card()` | Credit card numbers (valid Luhn) |\n| `ibans(n)` | `iban()` | IBAN numbers (valid checksum) |\n| `bics(n)` | `bic()` | BIC/SWIFT codes (8 or 11 characters) |\n| `bank_accounts(n)` | `bank_account()` | Bank account numbers (8-17 digits) |\n| `bank_names(n)` | `bank_name()` | Bank names (locale-specific) |\n\n### UK Banking\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `sort_codes(n)` | `sort_code()` | UK sort codes (XX-XX-XX format) |\n| `uk_account_numbers(n)` | `uk_account_number()` | UK account numbers (exactly 8 digits) |\n| `transaction_amounts(n, min, max)` | `transaction_amount(min, max)` | Transaction amounts (2 decimal places) |\n| `transactions(n, balance, start, end)` | - | Full transaction records with running balance |\n\n### Passwords\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `passwords(n, ...)` | `password(...)` | Random passwords with configurable character sets |\n\nPassword options:\n- `length`: Password length (default: 12)\n- `uppercase`: Include uppercase letters (default: True)\n- `lowercase`: Include lowercase letters (default: True)\n- `digits`: Include digits (default: True)\n- `symbols`: Include symbols (default: True)\n\n### Text \u0026 Lorem Ipsum\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `sentences(n, word_count)` | `sentence(word_count)` | Lorem ipsum sentences |\n| `paragraphs(n, sentence_count)` | `paragraph(sentence_count)` | Lorem ipsum paragraphs |\n| `texts(n, min_chars, max_chars)` | `text(min_chars, max_chars)` | Text blocks with length limits |\n\n### Colors\n\n| Batch | Single | Description |\n|-------|--------|-------------|\n| `colors(n)` | `color()` | Color names |\n| `hex_colors(n)` | `hex_color()` | Hex color codes (#RRGGBB) |\n| `rgb_colors(n)` | `rgb_color()` | RGB tuples (r, g, b) |\n\n## Unique Value Generation\n\nFor batch methods that select from finite lists (names, cities, countries, etc.), you can request unique values:\n\n```python\nfrom forgery import Faker\n\nfake = Faker()\nfake.seed(42)\n\n# Generate 50 unique names (no duplicates)\nunique_names = fake.names(50, unique=True)\nassert len(unique_names) == len(set(unique_names))\n\n# Generate 20 unique cities\nunique_cities = fake.cities(20, unique=True)\n\n# Generate 50 unique countries\nunique_countries = fake.countries(50, unique=True)\n```\n\n**Important Notes:**\n\n- Unique generation will raise `ValueError` if you request more unique values than are available in the underlying data set.\n- **Performance:** Unique generation uses O(n) memory (stores all outputs in a HashSet) and can be O(n × 100) time in worst case due to retry logic. For very large unique batches, consider whether duplicates are actually problematic for your use case.\n\n## Financial Transaction Generation\n\nGenerate realistic bank transaction data with running balances:\n\n```python\nfrom forgery import Faker\n\nfake = Faker()\nfake.seed(42)\n\n# Generate 50 transactions from Jan to Mar 2024, starting with £1000 balance\ntxns = fake.transactions(50, 1000.0, \"2024-01-01\", \"2024-03-31\")\n\nfor txn in txns[:3]:\n    print(f\"{txn['date']} | {txn['transaction_type']:15} | {txn['amount']:\u003e10.2f} | {txn['balance']:\u003e10.2f}\")\n# 2024-01-03 | Card Payment    |    -42.50 |     957.50\n# 2024-01-05 | Direct Debit    |   -125.00 |     832.50\n# 2024-01-08 | Faster Payment  |   1250.00 |    2082.50\n```\n\nEach transaction dict contains:\n- `reference`: 8-character alphanumeric reference\n- `date`: Transaction date (YYYY-MM-DD)\n- `amount`: Transaction amount (negative for debits)\n- `transaction_type`: e.g., \"Card Payment\", \"Direct Debit\", \"Salary\"\n- `description`: Merchant or payee name\n- `balance`: Running balance after transaction\n\n## Structured Data Generation\n\nGenerate entire datasets with a single call using schema definitions:\n\n### records()\n\nReturns a list of dictionaries:\n\n```python\nfrom forgery import records, seed\n\nseed(42)\ndata = records(1000, {\n    \"id\": \"uuid\",\n    \"name\": \"name\",\n    \"email\": \"email\",\n    \"age\": (\"int\", 18, 65),\n    \"salary\": (\"float\", 30000.0, 150000.0),\n    \"hire_date\": (\"date\", \"2020-01-01\", \"2024-12-31\"),\n    \"bio\": (\"text\", 50, 200),\n    \"status\": (\"choice\", [\"active\", \"inactive\", \"pending\"]),\n})\n\n# data[0] = {\"id\": \"88917925-...\", \"name\": \"Austin Bell\", \"age\": 50, ...}\n```\n\n### records_tuples()\n\nReturns a list of tuples (faster, values in alphabetical key order):\n\n```python\nfrom forgery import records_tuples, seed\n\nseed(42)\ndata = records_tuples(1000, {\n    \"age\": (\"int\", 18, 65),\n    \"name\": \"name\",\n})\n# data[0] = (50, \"Ryan Grant\")  # (age, name) - alphabetical order\n```\n\n### records_arrow()\n\nReturns a PyArrow RecordBatch for high-performance data processing:\n\n```python\nimport pyarrow as pa\nfrom forgery import records_arrow, seed\n\nseed(42)\nbatch = records_arrow(100_000, {\n    \"id\": \"uuid\",\n    \"name\": \"name\",\n    \"age\": (\"int\", 18, 65),\n    \"salary\": (\"float\", 30000.0, 150000.0),\n})\n\n# batch is a pyarrow.RecordBatch\nprint(batch.num_rows)     # 100000\nprint(batch.num_columns)  # 4\nprint(batch.schema)\n# age: int64 not null\n# id: string not null\n# name: string not null\n# salary: double not null\n\n# Convert to pandas DataFrame\ndf = batch.to_pandas()\n\n# Or to Polars DataFrame\nimport polars as pl\ndf_polars = pl.from_arrow(batch)\n```\n\n**Note:** Requires `pyarrow` to be installed: `pip install pyarrow`\n\nThe `records_arrow()` function generates data in columnar format, which is more efficient\nfor large batches and integrates seamlessly with the Arrow ecosystem (PyArrow, Polars,\npandas, DuckDB, etc.).\n\n### Schema Field Types\n\n| Type | Syntax | Example |\n|------|--------|---------|\n| Simple types | `\"type_name\"` | `\"name\"`, `\"email\"`, `\"uuid\"`, `\"int\"`, `\"float\"` |\n| Integer range | `(\"int\", min, max)` | `(\"int\", 18, 65)` |\n| Float range | `(\"float\", min, max)` | `(\"float\", 0.0, 100.0)` |\n| Text with limits | `(\"text\", min_chars, max_chars)` | `(\"text\", 50, 200)` |\n| Date range | `(\"date\", start, end)` | `(\"date\", \"2020-01-01\", \"2024-12-31\")` |\n| Choice | `(\"choice\", [options])` | `(\"choice\", [\"a\", \"b\", \"c\"])` |\n\nAll simple types from the generators above are supported: `name`, `first_name`, `last_name`, `email`, `safe_email`, `free_email`, `phone`, `uuid`, `int`, `float`, `date`, `datetime`, `street_address`, `city`, `state`, `country`, `zip_code`, `address`, `company`, `job`, `catch_phrase`, `url`, `domain_name`, `ipv4`, `ipv6`, `mac_address`, `credit_card`, `iban`, `sentence`, `paragraph`, `text`, `color`, `hex_color`, `rgb_color`, `md5`, `sha256`.\n\n## Async Generation\n\nFor large datasets (millions of records), async methods prevent blocking the Python event loop:\n\n### records_async()\n\n```python\nimport asyncio\nfrom forgery import records_async, seed\n\nasync def main():\n    seed(42)\n    records = await records_async(1_000_000, {\n        \"id\": \"uuid\",\n        \"name\": \"name\",\n        \"email\": \"email\",\n    })\n    print(f\"Generated {len(records)} records\")\n\nasyncio.run(main())\n```\n\n### records_tuples_async()\n\n```python\nimport asyncio\nfrom forgery import records_tuples_async, seed\n\nasync def main():\n    seed(42)\n    records = await records_tuples_async(1_000_000, {\n        \"age\": (\"int\", 18, 65),\n        \"name\": \"name\",\n    })\n    return records\n\nasyncio.run(main())\n```\n\n### records_arrow_async()\n\n```python\nimport asyncio\nfrom forgery import records_arrow_async, seed\n\nasync def main():\n    seed(42)\n    batch = await records_arrow_async(1_000_000, {\n        \"id\": \"uuid\",\n        \"name\": \"name\",\n        \"salary\": (\"float\", 30000.0, 150000.0),\n    })\n    return batch.to_pandas()\n\nasyncio.run(main())\n```\n\nAll async methods accept an optional `chunk_size` parameter (default: 10,000) that controls how frequently control is yielded to the event loop. Smaller chunks yield more frequently but have slightly higher overhead.\n\n**Note:** Async methods use a snapshot of the RNG state at call time. The main Faker instance's RNG is not advanced, so calling the same async method twice with the same seed produces identical results. For unique results across multiple async calls, use different seeds or different Faker instances.\n\n**Arrow async chunking caveat:** For `records_arrow_async()`, when `n \u003e chunk_size`, the output differs from `records_arrow()` due to column-major RNG consumption within each chunk. If you need identical results to the sync version, set `chunk_size \u003e= n`. The `records_async()` and `records_tuples_async()` methods always match their sync counterparts regardless of chunk size.\n\n## Custom Providers\n\nRegister your own data providers for domain-specific generation:\n\n### Basic Custom Provider\n\n```python\nfrom forgery import Faker\n\nfake = Faker()\n\n# Register a uniform (equal probability) provider\nfake.add_provider(\"department\", [\"Engineering\", \"Sales\", \"HR\", \"Marketing\"])\n\n# Generate values\ndept = fake.generate(\"department\")\ndepts = fake.generate_batch(\"department\", 100)\n```\n\n### Weighted Custom Provider\n\n```python\n# Register a weighted provider (higher weights = more likely)\nfake.add_weighted_provider(\"status\", [\n    (\"active\", 80),    # 80% probability\n    (\"inactive\", 20),  # 20% probability\n])\n\n# Generate with weighted distribution\nstatuses = fake.generate_batch(\"status\", 1000)\n# Expect ~800 \"active\", ~200 \"inactive\"\n```\n\n### Custom Providers in Records\n\nCustom providers integrate seamlessly with `records()`:\n\n```python\nfrom forgery import Faker\n\nfake = Faker()\nfake.add_provider(\"department\", [\"Eng\", \"Sales\", \"HR\"])\nfake.add_weighted_provider(\"priority\", [(\"high\", 20), (\"medium\", 50), (\"low\", 30)])\n\ndata = fake.records(1000, {\n    \"id\": \"uuid\",\n    \"name\": \"name\",\n    \"department\": \"department\",  # Custom provider\n    \"priority\": \"priority\",      # Weighted custom provider\n})\n```\n\n### Provider Management\n\n```python\nfake.has_provider(\"department\")  # Check if provider exists\nfake.list_providers()            # List all custom provider names\nfake.remove_provider(\"department\")  # Remove a provider\n```\n\n### Module-level Convenience\n\n```python\nfrom forgery import add_provider, generate, generate_batch, seed\n\nseed(42)\nadd_provider(\"tier\", [\"gold\", \"silver\", \"bronze\"])\ntier = generate(\"tier\")\ntiers = generate_batch(\"tier\", 100)\n```\n\n**Note:** Custom provider names cannot conflict with built-in types (e.g., \"name\", \"email\", \"uuid\").\n\n## Performance\n\nBenchmark generating 100,000 items:\n\n```\n$ python tests/benchmarks/bench_vs_faker.py\n\nNames:\n  forgery.names(): 0.015s\n  Faker.name():    1.523s\n  Speedup: 101.5x\n\nEmails:\n  forgery.emails(): 0.021s\n  Faker.email():    2.134s\n  Speedup: 101.6x\n```\n\n## Seeding Contract\n\n- `seed(n)` affects the default `fake` instance only\n- Each `Faker` instance has its own independent RNG state\n- **Single-threaded determinism only**: Results are reproducible within one thread\n- **No cross-version guarantee**: Output may differ between forgery versions\n\n## Thread Safety\n\n**forgery is NOT thread-safe.** Each `Faker` instance maintains mutable RNG state.\n\nFor multi-threaded applications, create one `Faker` instance per thread:\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor\nfrom forgery import Faker\n\ndef generate_names(seed: int) -\u003e list[str]:\n    fake = Faker()  # Create per-thread instance\n    fake.seed(seed)\n    return fake.names(1000)\n\nwith ThreadPoolExecutor(max_workers=4) as executor:\n    results = list(executor.map(generate_names, range(4)))\n```\n\nDo NOT share a `Faker` instance across threads.\n\n## Development\n\n```bash\n# Install Rust\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\n# Install maturin\npip install maturin\n\n# Build and install locally\nmaturin develop --release\n\n# Run tests\ncargo test          # Rust tests\npytest              # Python tests\n\n# Run benchmarks\npython tests/benchmarks/bench_vs_faker.py\n```\n\n## License\n\nMIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliajm%2Fforgery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwilliajm%2Fforgery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliajm%2Fforgery/lists"}