https://github.com/williajm/forgery

Rust-powered fake data generator for Python.
https://github.com/williajm/forgery
test-data-generator
Last synced: about 1 month ago
JSON representation
Rust-powered fake data generator for Python.
Host: GitHub
URL: https://github.com/williajm/forgery
Owner: williajm
License: mit
Created: 2025-12-26T17:37:04.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-01-19T21:43:24.000Z (4 months ago)
Last Synced: 2026-01-20T04:11:45.092Z (4 months ago)
Topics: test-data-generator
Language: Rust
Homepage:
Size: 297 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # forgery

[![CI](https://github.com/williajm/forgery/actions/workflows/ci.yml/badge.svg)](https://github.com/williajm/forgery/actions/workflows/ci.yml)

[![codecov](https://codecov.io/gh/williajm/forgery/branch/main/graph/badge.svg)](https://codecov.io/gh/williajm/forgery)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

**Fake data at the speed of Rust.**

A high-performance fake data generation library for Python, powered by Rust. Designed to be 50-100x faster than Faker for batch operations.

## Installation

```bash

pip install forgery

```

### From source (for development)

```bash

git clone https://github.com/williajm/forgery.git

cd forgery

pip install maturin

maturin develop --release

```

## Quick Start

```python

from forgery import fake

# Generate 10,000 names in one fast call

names = fake.names(10_000)

# Single values work too

email = fake.email()

name = fake.name()

# Deterministic output with seeding

fake.seed(42)

data1 = fake.names(100)

fake.seed(42)

data2 = fake.names(100)

assert data1 == data2

```

## Features

- **Batch-first design**: Generate thousands of values in a single call

- **50-100x faster** than Faker for batch operations

- **Multi-locale support**: 7 locales with locale-specific data

- **Deterministic seeding**: Reproducible output for testing

- **Type hints**: Full type stub support for IDE autocompletion

- **Familiar API**: Method names match Faker for easy migration

## Locale Support

forgery supports 7 locales with locale-specific names, addresses, phone numbers, and more:

| Locale | Language | Country |

|--------|----------|---------|

| `en_US` | English | United States (default) |

| `en_GB` | English | United Kingdom |

| `de_DE` | German | Germany |

| `fr_FR` | French | France |

| `es_ES` | Spanish | Spain |

| `it_IT` | Italian | Italy |

| `ja_JP` | Japanese | Japan |

```python

from forgery import Faker

# Default locale is en_US

fake = Faker()

fake.names(5)  # American names

# Use a different locale

german = Faker("de_DE")

german.names(5)  # German names

japanese = Faker("ja_JP")

japanese.addresses(3)  # Japanese addresses with prefecture

```

Each locale provides:

- **Names**: First names, last names, and full names in the local language

- **Addresses**: Cities, regions/states, postal codes in the correct format

- **Phone numbers**: Country-specific formats and country codes

- **Companies**: Local company names and job titles

- **Colors**: Color names in the local language

- **SSN/National IDs**: Country-specific formats (US SSN, UK NINO, DE Steuer-ID, etc.)

- **License plates**: Country-specific formats

## API

### Module-level functions (use default instance)

```python

from forgery import seed, names, emails, integers, uuids

seed(42)  # Seed for reproducibility

# Batch generation (fast path)

names(1000)           # list[str] of full names

emails(1000)          # list[str] of email addresses

integers(1000, 0, 100)  # list[int] in range

uuids(1000)           # list[str] of UUIDv4

# Single values

name()                # str

email()               # str

integer(0, 100)       # int

uuid()                # str

```

### Faker class (independent instances)

```python

from forgery import Faker

# Each instance has its own RNG state

fake1 = Faker()

fake2 = Faker()

fake1.seed(42)

fake2.seed(99)

# Generate independently

fake1.names(100)

fake2.emails(100)

```

## Available Generators

### Names & Identity

| Batch | Single | Description |

|-------|--------|-------------|

| `names(n)` | `name()` | Full names (first + last) |

| `first_names(n)` | `first_name()` | First names |

| `last_names(n)` | `last_name()` | Last names |

### Contact Information

| Batch | Single | Description |

|-------|--------|-------------|

| `emails(n)` | `email()` | Email addresses |

| `safe_emails(n)` | `safe_email()` | Safe domain emails (@example.com, etc.) |

| `free_emails(n)` | `free_email()` | Free provider emails (@gmail.com, etc.) |

| `phone_numbers(n)` | `phone_number()` | Phone numbers in (XXX) XXX-XXXX format |

### Numbers & Identifiers

| Batch | Single | Description |

|-------|--------|-------------|

| `integers(n, min, max)` | `integer(min, max)` | Random integers in range |

| `floats(n, min, max)` | `float_(min, max)` | Random floats in range (Note: `float_` avoids shadowing Python's `float` builtin) |

| `uuids(n)` | `uuid()` | UUID v4 strings |

| `md5s(n)` | `md5()` | Random 32-char hex strings (MD5-like format, not cryptographic hashes) |

| `sha256s(n)` | `sha256()` | Random 64-char hex strings (SHA256-like format, not cryptographic hashes) |

### Dates & Times

| Batch | Single | Description |

|-------|--------|-------------|

| `dates(n, start, end)` | `date(start, end)` | Random dates (YYYY-MM-DD) |

| `datetimes(n, start, end)` | `datetime_(start, end)` | Random datetimes (ISO 8601). Note: `datetime_` avoids shadowing Python's `datetime` module |

| `dates_of_birth(n, min_age, max_age)` | `date_of_birth(min_age, max_age)` | Birth dates for given age range |

### Addresses

| Batch | Single | Description |

|-------|--------|-------------|

| `street_addresses(n)` | `street_address()` | Street addresses (e.g., "123 Main Street") |

| `cities(n)` | `city()` | City names |

| `states(n)` | `state()` | State names |

| `countries(n)` | `country()` | Country names |

| `zip_codes(n)` | `zip_code()` | ZIP codes (5 or 9 digit) |

| `addresses(n)` | `address()` | Full addresses |

### Company & Business

| Batch | Single | Description |

|-------|--------|-------------|

| `companies(n)` | `company()` | Company names |

| `jobs(n)` | `job()` | Job titles |

| `catch_phrases(n)` | `catch_phrase()` | Business catch phrases |

### Network

| Batch | Single | Description |

|-------|--------|-------------|

| `urls(n)` | `url()` | URLs with https:// |

| `domain_names(n)` | `domain_name()` | Domain names |

| `ipv4s(n)` | `ipv4()` | IPv4 addresses |

| `ipv6s(n)` | `ipv6()` | IPv6 addresses |

| `mac_addresses(n)` | `mac_address()` | MAC addresses |

### Web & HTML

| Batch | Single | Description |

|-------|--------|-------------|

| `url_paths(n)` | `url_path()` | URL paths (e.g., "/blog/products/42") |

| `url_slugs(n)` | `url_slug()` | URL slugs (e.g., "ultimate-guide-2024") |

| `query_strings(n)` | `query_string()` | Query strings (e.g., "?page=2&sort=date") |

| `meta_descriptions(n)` | `meta_description()` | HTML meta description tags |

| `og_tags_batch(n)` | `og_tags()` | Open Graph meta tag sets (multi-line) |

| `hreflang_tags_batch(n)` | `hreflang_tags()` | Hreflang link tag sets with x-default |

| `img_tags(n, ratio)` | `img_tag(ratio)` | Image tags (configurable missing alt ratio) |

| `content_type_headers(n)` | `content_type_header()` | Content-Type header values |

| `http_headers_batch(n)` | `http_headers()` | HTTP response header dicts |

| `robots_txts(n)` | `robots_txt()` | robots.txt file contents |

| `html_pages(n, ...)` | `html_page(...)` | Full HTML5 pages with configurable SEO elements |

| - | `website(pages, domain)` | Interlinked website (dict of URL → HTML) |

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

# Generate a full HTML page with SEO elements

page = fake.html_page(

    headings=4,

    internal_links=5,

    images=3,

    include_og_tags=True,

    domain="mysite.com",

)

# Generate an interlinked website for crawl testing

site = fake.website(pages=20, domain="example.com")

# site = {"https://example.com/": "...", "https://example.com/blog/guide": "...", ...}

# Every page is reachable from the homepage via link traversal

```

### Finance

| Batch | Single | Description |

|-------|--------|-------------|

| `credit_cards(n)` | `credit_card()` | Credit card numbers (valid Luhn) |

| `credit_card_providers(n)` | `credit_card_provider()` | Card network name (Visa, Mastercard, Amex, Discover) |

| `credit_card_expires(n)` | `credit_card_expire()` | Expiry date in MM/YY format |

| `credit_card_security_codes(n)` | `credit_card_security_code()` | CVV: 3 digits (Visa/MC/Discover) or 4 digits (Amex) |

| `credit_card_fulls(n)` | `credit_card_full()` | Complete card info dict (provider, number, expire, security_code, name) |

| `ibans(n)` | `iban()` | IBAN numbers (valid checksum) |

| `bics(n)` | `bic()` | BIC/SWIFT codes (8 or 11 characters) |

| `bank_accounts(n)` | `bank_account()` | Bank account numbers (8-17 digits) |

| `bank_names(n)` | `bank_name()` | Bank names (locale-specific) |

### Currency

| Batch | Single | Description |

|-------|--------|-------------|

| `currency_codes(n)` | `currency_code()` | ISO 4217 currency codes (e.g., "USD", "EUR") |

| `currency_names(n)` | `currency_name()` | Currency names in English (e.g., "United States Dollar") |

| `currencies(n)` | `currency()` | (code, name) tuples |

| `prices(n, min, max)` | `price(min, max)` | Prices with 2 decimal places |

### UK Banking

| Batch | Single | Description |

|-------|--------|-------------|

| `sort_codes(n)` | `sort_code()` | UK sort codes (XX-XX-XX format) |

| `uk_account_numbers(n)` | `uk_account_number()` | UK account numbers (exactly 8 digits) |

| `transaction_amounts(n, min, max)` | `transaction_amount(min, max)` | Transaction amounts (2 decimal places) |

| `transactions(n, balance, start, end)` | - | Full transaction records with running balance |

### Passwords

| Batch | Single | Description |

|-------|--------|-------------|

| `passwords(n, ...)` | `password(...)` | Random passwords with configurable character sets |

Password options:

- `length`: Password length (default: 12)

- `uppercase`: Include uppercase letters (default: True)

- `lowercase`: Include lowercase letters (default: True)

- `digits`: Include digits (default: True)

- `symbols`: Include symbols (default: True)

### Text & Lorem Ipsum

| Batch | Single | Description |

|-------|--------|-------------|

| `sentences(n, word_count)` | `sentence(word_count)` | Lorem ipsum sentences |

| `paragraphs(n, sentence_count)` | `paragraph(sentence_count)` | Lorem ipsum paragraphs |

| `texts(n, min_chars, max_chars)` | `text(min_chars, max_chars)` | Text blocks with length limits |

### Colors

| Batch | Single | Description |

|-------|--------|-------------|

| `colors(n)` | `color()` | Color names |

| `hex_colors(n)` | `hex_color()` | Hex color codes (#RRGGBB) |

| `rgb_colors(n)` | `rgb_color()` | RGB tuples (r, g, b) |

### Geographic

| Batch | Single | Description |

|-------|--------|-------------|

| `latitudes(n)` | `latitude()` | Random latitude in [-90.0, 90.0] |

| `longitudes(n)` | `longitude()` | Random longitude in [-180.0, 180.0] |

| `coordinates(n)` | `coordinate()` | (latitude, longitude) tuples |

### User Agents

| Batch | Single | Description |

|-------|--------|-------------|

| `user_agents(n)` | `user_agent()` | Random browser user agent string (any browser) |

| `chromes(n)` | `chrome()` | Chrome user agent string |

| `firefoxes(n)` | `firefox()` | Firefox user agent string |

| `safaris(n)` | `safari()` | Safari user agent string |

### Booleans

| Batch | Single | Description |

|-------|--------|-------------|

| `booleans(n, probability)` | `boolean(probability)` | Random booleans (default: 50% True) |

### String Pattern Templates

| Batch | Single | Description |

|-------|--------|-------------|

| `numerify_batch(pattern, n)` | `numerify(pattern)` | Replace `#` with random digits (0-9) |

| `letterify_batch(pattern, n)` | `letterify(pattern)` | Replace `?` with random lowercase letters (a-z) |

| `bothify_batch(pattern, n)` | `bothify(pattern)` | Replace `#` with digits and `?` with lowercase letters |

| `lexify_batch(pattern, n)` | `lexify(pattern)` | Replace `?` with random uppercase letters (A-Z) |

```python

from forgery import Faker

fake = Faker()

fake.numerify("###-###-####")   # "847-321-9056"

fake.letterify("??-??")         # "kx-bp"

fake.bothify("??-####")         # "mz-7314"

fake.lexify("???-###")          # "QWR-###" (only ? is replaced)

```

### Barcode

| Batch | Single | Description |

|-------|--------|-------------|

| `ean13s(n)` | `ean13()` | EAN-13 barcodes (valid check digit) |

| `ean8s(n)` | `ean8()` | EAN-8 barcodes (valid check digit) |

| `upc_as(n)` | `upc_a()` | UPC-A barcodes (valid check digit) |

| `upc_es(n)` | `upc_e()` | UPC-E barcodes (valid check digit) |

### ISBN

| Batch | Single | Description |

|-------|--------|-------------|

| `isbn10s(n)` | `isbn10()` | ISBN-10 with hyphens (valid check digit, may end in X) |

| `isbn13s(n)` | `isbn13()` | ISBN-13 with hyphens (978/979 prefix, valid check digit) |

### File/System

| Batch | Single | Description |

|-------|--------|-------------|

| `file_names(n)` | `file_name()` | File names with extension (e.g., "report.pdf") |

| `file_extensions(n)` | `file_extension()` | File extensions (e.g., "pdf", "csv") |

| `mime_types(n)` | `mime_type()` | MIME types (e.g., "application/pdf") |

| `file_paths(n)` | `file_path_()` | File paths (e.g., "/home/user/documents/report.pdf") |

### Commerce/Product

| Batch | Single | Description |

|-------|--------|-------------|

| `product_names(n)` | `product_name()` | Product names (e.g., "Ergonomic Steel Chair") |

| `product_categories(n)` | `product_category()` | Product categories (e.g., "Electronics") |

| `departments(n)` | `department()` | Store departments (e.g., "Home & Garden") |

| `product_materials(n)` | `product_material()` | Product materials (e.g., "Cotton", "Steel") |

### SSN/National ID

| Batch | Single | Description |

|-------|--------|-------------|

| `ssns(n)` | `ssn()` | Locale-specific national ID numbers |

Formats by locale:

| Locale | Format | Example |

|--------|--------|---------|

| `en_US` | SSN (XXX-XX-XXXX) | `"123-45-6789"` |

| `en_GB` | NI Number (XX 99 99 99 X) | `"AB 12 34 56 C"` |

| `de_DE` | Steuer-ID (11 digits) | `"12345678901"` |

| `fr_FR` | NSS (15 digits with check key) | `"185076923400145"` |

| `es_ES` | DNI (8 digits + letter) | `"12345678Z"` |

| `it_IT` | Codice Fiscale (16 alphanumeric) | `"RSSMRA85M01H501Z"` |

| `ja_JP` | My Number (12 digits with check) | `"123456789012"` |

### Vehicle/Automotive

| Batch | Single | Description |

|-------|--------|-------------|

| `license_plates(n)` | `license_plate()` | Locale-specific license plates |

| `vehicle_makes(n)` | `vehicle_make()` | Vehicle manufacturers (e.g., "Toyota") |

| `vehicle_models(n)` | `vehicle_model()` | Vehicle models (e.g., "Camry") |

| `vehicle_years(n)` | `vehicle_year()` | Model years (1990-2026) |

| `vins(n)` | `vin()` | 17-character VINs (valid check digit, no I/O/Q) |

License plate formats by locale:

| Locale | Format | Example |

|--------|--------|---------|

| `en_US` | ABC-1234 | `"KHX-4829"` |

| `en_GB` | AB12 CDE | `"LM65 NXR"` |

| `de_DE` | X AB 1234 | `"B KL 3847"` |

| `fr_FR` | AB-123-CD | `"FG-482-HJ"` |

| `es_ES` | 1234 ABC | `"4829 FKH"` |

| `it_IT` | AB 123 CD | `"FG 482 HJ"` |

| `ja_JP` | 300 12-34 | `"500 38-47"` |

### Package Registry Data

For seeding test databases of package registries (PyPI, npm, Maven, Cargo, RubyGems).

Cross-ecosystem primitives share one API; ecosystem-specific shapes have their own

methods.

**Cross-ecosystem primitives**

| Batch | Single | Description |

|-------|--------|-------------|

| `commit_shas(n)` | `commit_sha()` | 40-hex-char git commit SHA |

| `short_commit_shas(n)` | `short_commit_sha()` | 7-hex-char short SHA |

| `semvers(n)` | `semver()` | SemVer `MAJOR.MINOR.PATCH` |

| `semver_prereleases(n)` | `semver_prerelease()` | Pre-release (e.g. `1.2.3-alpha.1+build.5`) |

| `calvers(n)` | `calver()` | CalVer in mixed schemes (`YYYY.MM.DD`, `YY.MM`, ...) |

| `spdx_licenses(n)` | `spdx_license()` | SPDX identifier (50 common IDs) |

| `git_usernames(n)` | `git_username()` | GitHub/GitLab/Bitbucket-compatible username |

**Ecosystem-specific versions** (where SemVer alone doesn't cover the format)

| Batch | Single | Description |

|-------|--------|-------------|

| `pypi_versions(n)` | `pypi_version()` | PEP 440 (pre/post/dev releases) |

| `maven_versions(n)` | `maven_version()` | Maven version with qualifiers (`-SNAPSHOT`, `.RELEASE`, ...) |

**Version constraints**

| Batch | Single | Description |

|-------|--------|-------------|

| `pypi_version_specifiers(n)` | `pypi_version_specifier()` | PEP 440 (e.g. `>=1.2,<2.0`, `~=1.0`) |

| `npm_version_ranges(n)` | `npm_version_range()` | npm (e.g. `^1.2.3`, `~1.2.3`, `1.x`) |

| `cargo_version_reqs(n)` | `cargo_version_req()` | Cargo (e.g. `^1.0`, `~1.2`) |

| `maven_version_ranges(n)` | `maven_version_range()` | Maven (e.g. `[1.0,2.0)`) |

| `gem_version_requirements(n)` | `gem_version_requirement()` | RubyGems (e.g. `~> 1.2`) |

**Package identity**

| Batch | Single | Description |

|-------|--------|-------------|

| `pypi_package_names(n)` | `pypi_package_name()` | PEP 503 normalised (lowercase `[a-z0-9-]`) |

| `npm_package_names(n)` | `npm_package_name()` | Plain or `@scope/pkg` (~30% scoped) |

| `cargo_package_names(n)` | `cargo_package_name()` | Rust-ident flavour |

| `gem_names(n)` | `gem_name()` | RubyGems gem name |

| `maven_group_ids(n)` | `maven_group_id()` | Reverse domain (e.g. `com.example.tools`) |

| `maven_artifact_ids(n)` | `maven_artifact_id()` | Lowercase with hyphens |

| `maven_coordinates(n)` | `maven_coordinate()` | GAV (`group:artifact:version`) |

**Full requirement lines**

| Batch | Single | Description |

|-------|--------|-------------|

| `pypi_requirements(n)` | `pypi_requirement()` | e.g. `requests>=2.0.0,<3.0.0` |

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

fake.pypi_requirement()       # 'requests>=2.0.0,<3.0.0'

fake.maven_coordinate()       # 'com.example.tools:widget-core:1.2.3-SNAPSHOT'

fake.npm_package_name()       # '@types/fast-parser'

fake.spdx_license()           # 'Apache-2.0'

fake.git_username()           # 'tiny-logger42'

fake.commit_sha()             # 'a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2'

```

The nine batch methods below accept `unique=True` for no-duplicate output,

matching the `names(n, unique=True)` pattern — useful when seeding registry

tables that have a unique-name constraint. Exhausting the combinatorial pool

raises `ValueError`:

```python

fake.pypi_package_names(100, unique=True)   # 100 distinct package names

fake.maven_coordinates(500, unique=True)    # 500 distinct GAVs

fake.spdx_licenses(60, unique=True)         # ValueError: only 50 SPDX IDs available

```

Methods with `unique` support: `pypi_package_names`, `npm_package_names`,

`cargo_package_names`, `gem_names`, `maven_group_ids`, `maven_artifact_ids`,

`maven_coordinates`, `git_usernames`, `spdx_licenses`.

### Profile

| Batch | Single | Description |

|-------|--------|-------------|

| `profiles(n)` | `profile()` | Complete personal profiles (returns dict) |

Each profile dict contains: `first_name`, `last_name`, `name`, `email`, `phone`, `address`, `city`, `state`, `zip_code`, `country`, `company`, `job`, `date_of_birth`.

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

p = fake.profile()

# {"first_name": "Ryan", "last_name": "Grant", "name": "Ryan Grant",

#  "email": "rgrant@example.com", "phone": "(555) 123-4567", ...}

```

## Unique Value Generation

For batch methods that select from finite lists (names, cities, countries, etc.), you can request unique values:

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

# Generate 50 unique names (no duplicates)

unique_names = fake.names(50, unique=True)

assert len(unique_names) == len(set(unique_names))

# Generate 20 unique cities

unique_cities = fake.cities(20, unique=True)

# Generate 50 unique countries

unique_countries = fake.countries(50, unique=True)

```

**Important Notes:**

- Unique generation will raise `ValueError` if you request more unique values than are available in the underlying data set.

- **Performance:** Unique generation uses O(n) memory (stores all outputs in a HashSet) and can be O(n × 100) time in worst case due to retry logic. For very large unique batches, consider whether duplicates are actually problematic for your use case.

## Financial Transaction Generation

Generate realistic bank transaction data with running balances:

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

# Generate 50 transactions from Jan to Mar 2024, starting with £1000 balance

txns = fake.transactions(50, 1000.0, "2024-01-01", "2024-03-31")

for txn in txns[:3]:

    print(f"{txn['date']} | {txn['transaction_type']:15} | {txn['amount']:>10.2f} | {txn['balance']:>10.2f}")

# 2024-01-03 | Card Payment    |    -42.50 |     957.50

# 2024-01-05 | Direct Debit    |   -125.00 |     832.50

# 2024-01-08 | Faster Payment  |   1250.00 |    2082.50

```

Each transaction dict contains:

- `reference`: 8-character alphanumeric reference

- `date`: Transaction date (YYYY-MM-DD)

- `amount`: Transaction amount (negative for debits)

- `transaction_type`: e.g., "Card Payment", "Direct Debit", "Salary"

- `description`: Merchant or payee name

- `balance`: Running balance after transaction

## Structured Data Generation

Generate entire datasets with a single call using schema definitions:

### records()

Returns a list of dictionaries:

```python

from forgery import records, seed

seed(42)

data = records(1000, {

    "id": "uuid",

    "name": "name",

    "email": "email",

    "age": ("int", 18, 65),

    "salary": ("float", 30000.0, 150000.0),

    "hire_date": ("date", "2020-01-01", "2024-12-31"),

    "bio": ("text", 50, 200),

    "status": ("choice", ["active", "inactive", "pending"]),

})

# data[0] = {"id": "88917925-...", "name": "Austin Bell", "age": 50, ...}

```

### records_tuples()

Returns a list of tuples (faster, values in alphabetical key order):

```python

from forgery import records_tuples, seed

seed(42)

data = records_tuples(1000, {

    "age": ("int", 18, 65),

    "name": "name",

})

# data[0] = (50, "Ryan Grant")  # (age, name) - alphabetical order

```

### records_arrow()

Returns a PyArrow RecordBatch for high-performance data processing:

```python

import pyarrow as pa

from forgery import records_arrow, seed

seed(42)

batch = records_arrow(100_000, {

    "id": "uuid",

    "name": "name",

    "age": ("int", 18, 65),

    "salary": ("float", 30000.0, 150000.0),

})

# batch is a pyarrow.RecordBatch

print(batch.num_rows)     # 100000

print(batch.num_columns)  # 4

print(batch.schema)

# age: int64 not null

# id: string not null

# name: string not null

# salary: double not null

# Convert to pandas DataFrame

df = batch.to_pandas()

# Or to Polars DataFrame

import polars as pl

df_polars = pl.from_arrow(batch)

```

**Note:** Requires `pyarrow` to be installed: `pip install pyarrow`

The `records_arrow()` function generates data in columnar format, which is more efficient

for large batches and integrates seamlessly with the Arrow ecosystem (PyArrow, Polars,

pandas, DuckDB, etc.).

### Serialized Output Formats

Generate records directly as serialized strings or bytes, avoiding the overhead of

creating Python objects just to serialize them.

#### records_csv()

Returns a CSV string with a header row (fields in alphabetical order):

```python

from forgery import records_csv, seed

seed(42)

csv_str = records_csv(1000, {

    "name": "name",

    "email": "email",

    "age": ("int", 18, 65),

})

# age,email,name

# 50,austin.bell@example.com,Austin Bell

# ...

```

#### records_json()

Returns a JSON array of objects:

```python

from forgery import records_json, seed

seed(42)

json_str = records_json(1000, {

    "name": "name",

    "age": ("int", 18, 65),

    "active": "boolean",

})

# [{"active":true,"age":50,"name":"Austin Bell"},...]

```

Integer and float values are JSON numbers, booleans are JSON booleans, and

tuples (e.g., RGB colors, coordinates) become JSON arrays.

#### records_ndjson()

Returns newline-delimited JSON (one JSON object per line, no trailing newline):

```python

from forgery import records_ndjson, seed

seed(42)

ndjson_str = records_ndjson(1000, {

    "id": "uuid",

    "name": "name",

})

# {"id":"88917925-...","name":"Austin Bell"}

# {"id":"a3c1e7f2-...","name":"Maria Garcia"}

# ...

```

#### records_parquet()

Returns Parquet file content as bytes (uses the Arrow path internally).

**Note:** Like `records_arrow()`, this uses column-major generation. With a fixed seed

and multi-column schema, the row data will differ from the row-major methods

(`records_csv`, `records_json`, `records_ndjson`, `records_sql`).

```python

from forgery import records_parquet, seed

seed(42)

parquet_bytes = records_parquet(100_000, {

    "id": "uuid",

    "name": "name",

    "salary": ("float", 30000.0, 150000.0),

})

# Write to disk

with open("data.parquet", "wb") as f:

    f.write(parquet_bytes)

# Or load directly with PyArrow

import pyarrow.parquet as pq

import io

table = pq.read_table(io.BytesIO(parquet_bytes))

```

#### records_sql()

Returns ANSI SQL INSERT statements with properly escaped values:

```python

from forgery import records_sql, seed

seed(42)

sql = records_sql(1000, {

    "name": "name",

    "email": "email",

    "age": ("int", 18, 65),

}, "users")

# INSERT INTO "users" ("age", "email", "name") VALUES

# (50, 'austin.bell@example.com', 'Austin Bell'),

# ...

# (34, 'maria.garcia@gmail.com', 'Maria Garcia');

```

For large batches, multiple INSERT statements are generated with up to 1000 rows

each. Column names are double-quoted and string values use single-quote escaping.

### Streaming File Writer

For datasets that exceed available memory, `records_to_file()` generates records

in bounded-memory chunks and writes each chunk to disk before generating the next.

Memory usage is proportional to `chunk_size`, not total `n`.

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

# Generate 100 million records — memory stays at ~500-800 MB

fake.records_to_file(

    100_000_000,

    {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)},

    "transactions.parquet",

    chunk_size=1_000_000,  # records per chunk (default: 1M, max: 10M)

)

```

**Supported formats:** CSV (`.csv`), NDJSON (`.ndjson`/`.jsonl`), SQL (`.sql`),

Parquet (`.parquet`). Format is auto-detected from the file extension, or set

explicitly with `format="csv"`.

SQL format requires a `table` parameter:

```python

from forgery import records_to_file, seed

seed(42)

records_to_file(

    50_000_000,

    {"name": "name", "email": "email"},

    "users.sql",

    table="users",

    chunk_size=500_000,

)

```

**Progress callback** — track progress with an optional callback:

```python

from forgery import records_to_file, seed

seed(42)

records_to_file(

    10_000_000,

    {"name": "name", "email": "email"},

    "users.csv",

    on_progress=lambda written, total: print(f"\r{written/total:.0%}", end=""),

)

```

**Memory estimation** — plan chunk sizes based on available RAM:

```python

from forgery import Faker

schema = {"id": "uuid", "name": "name", "amount": ("float", 0.01, 9999.99)}

est = Faker.estimate_memory(1_000_000, schema)

print(f"~{est / 1024**2:.0f} MB per 1M records")

```

All streaming formats use row-major generation, so the same seed produces

identical data across CSV, NDJSON, SQL, and Parquet output.

### Schema Field Types

| Type | Syntax | Example |

|------|--------|---------|

| Simple types | `"type_name"` | `"name"`, `"email"`, `"uuid"`, `"int"`, `"float"` |

| Integer range | `("int", min, max)` | `("int", 18, 65)` |

| Float range | `("float", min, max)` | `("float", 0.0, 100.0)` |

| Text with limits | `("text", min_chars, max_chars)` | `("text", 50, 200)` |

| Date range | `("date", start, end)` | `("date", "2020-01-01", "2024-12-31")` |

| Choice | `("choice", [options])` | `("choice", ["a", "b", "c"])` |

All simple types from the generators above are supported: `name`, `first_name`, `last_name`, `email`, `safe_email`, `free_email`, `phone`, `uuid`, `int`, `float`, `date`, `datetime`, `street_address`, `city`, `state`, `country`, `zip_code`, `address`, `company`, `job`, `catch_phrase`, `url`, `domain_name`, `ipv4`, `ipv6`, `mac_address`, `credit_card`, `iban`, `sentence`, `paragraph`, `text`, `color`, `hex_color`, `rgb_color`, `md5`, `sha256`, `latitude`, `longitude`, `coordinate`, `boolean`, `ssn`, `file_name`, `file_extension`, `mime_type`, `file_path`, `license_plate`, `vehicle_make`, `vehicle_model`, `vehicle_year`, `vin`, `ean13`, `ean8`, `upc_a`, `upc_e`, `isbn10`, `isbn13`, `product_name`, `product_category`, `department`, `product_material`, `url_path`, `url_slug`, `query_string`.

## Async Generation

For large datasets (millions of records), async methods prevent blocking the Python event loop:

### records_async()

```python

import asyncio

from forgery import records_async, seed

async def main():

    seed(42)

    records = await records_async(1_000_000, {

        "id": "uuid",

        "name": "name",

        "email": "email",

    })

    print(f"Generated {len(records)} records")

asyncio.run(main())

```

### records_tuples_async()

```python

import asyncio

from forgery import records_tuples_async, seed

async def main():

    seed(42)

    records = await records_tuples_async(1_000_000, {

        "age": ("int", 18, 65),

        "name": "name",

    })

    return records

asyncio.run(main())

```

### records_arrow_async()

```python

import asyncio

from forgery import records_arrow_async, seed

async def main():

    seed(42)

    batch = await records_arrow_async(1_000_000, {

        "id": "uuid",

        "name": "name",

        "salary": ("float", 30000.0, 150000.0),

    })

    return batch.to_pandas()

asyncio.run(main())

```

All async methods accept an optional `chunk_size` parameter (default: 10,000) that controls how frequently control is yielded to the event loop. Smaller chunks yield more frequently but have slightly higher overhead.

**Note:** Async methods use a snapshot of the RNG state at call time. The main Faker instance's RNG is not advanced, so calling the same async method twice with the same seed produces identical results. For unique results across multiple async calls, use different seeds or different Faker instances.

**Arrow async chunking caveat:** For `records_arrow_async()`, when `n > chunk_size`, the output differs from `records_arrow()` due to column-major RNG consumption within each chunk. If you need identical results to the sync version, set `chunk_size >= n`. The `records_async()` and `records_tuples_async()` methods always match their sync counterparts regardless of chunk size.

## Custom Providers

Register your own data providers for domain-specific generation:

### Basic Custom Provider

```python

from forgery import Faker

fake = Faker()

# Register a uniform (equal probability) provider

fake.add_provider("team", ["Engineering", "Sales", "HR", "Marketing"])

# Generate values

team = fake.generate("team")

teams = fake.generate_batch("team", 100)

```

### Weighted Custom Provider

```python

# Register a weighted provider (higher weights = more likely)

fake.add_weighted_provider("status", [

    ("active", 80),    # 80% probability

    ("inactive", 20),  # 20% probability

])

# Generate with weighted distribution

statuses = fake.generate_batch("status", 1000)

# Expect ~800 "active", ~200 "inactive"

```

### Custom Providers in Records

Custom providers integrate seamlessly with `records()`:

```python

from forgery import Faker

fake = Faker()

fake.add_provider("team", ["Eng", "Sales", "HR"])

fake.add_weighted_provider("priority", [("high", 20), ("medium", 50), ("low", 30)])

data = fake.records(1000, {

    "id": "uuid",

    "name": "name",

    "team": "team",              # Custom provider

    "priority": "priority",      # Weighted custom provider

})

```

### Provider Management

```python

fake.has_provider("team")  # Check if provider exists

fake.list_providers()      # List all custom provider names

fake.remove_provider("team")  # Remove a provider

```

### Module-level Convenience

```python

from forgery import add_provider, generate, generate_batch, seed

seed(42)

add_provider("tier", ["gold", "silver", "bronze"])

tier = generate("tier")

tiers = generate_batch("tier", 100)

```

**Note:** Custom provider names cannot conflict with built-in types (e.g., "name", "email", "uuid").

## Performance

Benchmark generating 100,000 items:

```

Names:

  forgery.names():  0.015s

  Faker.name():     1.523s

  Speedup: 101x

Emails:

  forgery.emails():  0.021s

  Faker.email():     2.134s

  Speedup: 101x

```

Benchmark generating 1,000,000 items:

```

Names:

  forgery.names():   0.108s

  Faker.name():     47.111s

  Speedup: 436x

Emails:

  forgery.emails():   0.167s

  Faker.email():     46.984s

  Speedup: 281x

```

## Seeding Contract

- `seed(n)` affects the default `fake` instance only

- Each `Faker` instance has its own independent RNG state

- **Single-threaded determinism only**: Results are reproducible within one thread

- **No cross-version guarantee**: Output may differ between forgery versions

## Parallel Generation

For large batches, enable parallel generation to split work across multiple CPU cores:

```python

from forgery import Faker

fake = Faker()

fake.seed(42)

fake.set_parallel(True)  # Auto-detect thread count

# All batch methods now run in parallel

names = fake.names(1_000_000)      # ~3.3x faster than sequential

emails = fake.emails(1_000_000)

uuids = fake.uuids(1_000_000)

# Explicit thread count (useful for reproducibility across machines)

fake.set_parallel(True, num_threads=4)

# Check current settings

fake.get_parallel()      # True

fake.get_num_threads()   # 4

# Disable parallel

fake.set_parallel(False)

```

**Determinism contract:**

- Same seed + same `num_threads` = identical output

- Changing `num_threads` produces different output

- `unique=True` always uses the sequential path

**Performance (names benchmark):**

| Batch Size | Sequential | Parallel | Speedup |

|-----------|-----------|---------|---------|

| 10,000 | 443 µs | 753 µs | 0.6x (overhead) |

| 100,000 | 8.5 ms | 2.5 ms | **3.4x** |

| 1,000,000 | 83 ms | 25 ms | **3.3x** |

Auto-detection ensures parallelism is only used when beneficial (minimum 1,000 items per thread).

## Thread Safety

**forgery is NOT thread-safe.** Each `Faker` instance maintains mutable RNG state.

For multi-threaded applications, create one `Faker` instance per thread:

```python

from concurrent.futures import ThreadPoolExecutor

from forgery import Faker

def generate_names(seed: int) -> list[str]:

    fake = Faker()  # Create per-thread instance

    fake.seed(seed)

    return fake.names(1000)

with ThreadPoolExecutor(max_workers=4) as executor:

    results = list(executor.map(generate_names, range(4)))

```

Do NOT share a `Faker` instance across threads.

**Note:** `set_parallel(True)` uses Rayon's internal thread pool for parallel generation within a single `Faker` instance. This is different from sharing a `Faker` across Python threads, which remains unsafe.

## Development

```bash

# Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin

pip install maturin

# Build and install locally

maturin develop --release

# Run tests

cargo test          # Rust tests

pytest              # Python tests

# Run benchmarks

python tests/benchmarks/bench_vs_faker.py

```

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/williajm/forgery

Awesome Lists containing this project

README