{"id":50338835,"url":"https://github.com/dev360/crease","last_synced_at":"2026-05-29T15:30:42.340Z","repository":{"id":359142266,"uuid":"1242945326","full_name":"dev360/crease","owner":"dev360","description":"Excel parser and validator","archived":false,"fork":false,"pushed_at":"2026-05-20T18:36:28.000Z","size":181,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-20T19:50:04.487Z","etag":null,"topics":["data-engineering","etl","excel","pandas","pydantic","python","validation","xlsx","yaml"],"latest_commit_sha":null,"homepage":"https://dev360.github.io/crease/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dev360.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-18T22:59:35.000Z","updated_at":"2026-05-20T18:36:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dev360/crease","commit_stats":null,"previous_names":["dev360/crease"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/dev360/crease","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dev360%2Fcrease","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dev360%2Fcrease/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dev360%2Fcrease/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dev360%2Fcrease/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dev360","download_url":"https://codeload.github.com/dev360/crease/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dev360%2Fcrease/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33659872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","etl","excel","pandas","pydantic","python","validation","xlsx","yaml"],"created_at":"2026-05-29T15:30:39.497Z","updated_at":"2026-05-29T15:30:42.330Z","avatar_url":"https://github.com/dev360.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# crease\n\nDeclarative Excel-to-JSON extraction **and** validation. Apply a compact YAML\ntemplate to a spreadsheet file — `.xlsx`, `.xls`, `.xlsb`, or `.ods` — get\ncanonical JSON out plus structured per-cell errors. No spreadsheet-specific\ncode in your pipeline.\n\n📖 **Docs:** [dev360.github.io/crease](https://dev360.github.io/crease/)\n\n```bash\npip install crease\n```\n\n\u003e **Status: 1.0** — published to PyPI as [`crease`](https://pypi.org/project/crease/). API surface is stable; breaking changes will be marked `feat!:` and bump the major version. See [ROADMAP.md](ROADMAP.md) for what's next.\n\n---\n\n## Why this exists\n\nExcel parsing has well-documented failure modes that quietly cost money and\ntime. Crease is designed to make these *visible and structured* rather than\nsilent:\n\n| Failure mode | Cost when it happens | How Crease handles it |\n|---|---|---|\n| Excel autoconverts `SEPT2` (gene) to `2-Sep` (date) — affects ~20% of genomics papers | Wrong data downstream, no warning | `treat_as_text` on the field; validator emits `wrong_type` with `likely_cause: excel_autoconvert` |\n| Public Health England loses 15,706 COVID cases to silent row-overflow | 8-day contact-tracing gap during a pandemic | Per-template `min_data_rows` + `column_count_mismatch` detection; nothing fails silently |\n| JPMorgan loses $6B because a VaR model required manual copy-paste between sheets | The whole loss | Canonical JSON flows from xlsx to downstream pipelines — the copy step disappears |\n| Operator gets `#N/A` from VLOOKUP because `\"Acme Corp\"` had a trailing space | Hours of debugging | Always-on header normalization; per-field `normalize: trim` |\n| `N/A`, `TBD`, `-` in cells trigger `wrong_type` everywhere | False positives bury real issues | Layered `null_tokens` — library defaults handle the common ones, templates and fields tighten or loosen |\n| Headers move down a row when the customer adds a title line | Every subsequent file fails | `locate.header_anchor: \"Order ID\"` instead of `header_row: 3` |\n| Operator hides \"soft-deleted\" rows; downstream consumers still process them | Cf. Lehman/Barclays' 179 unwanted trading contracts (2008) | `locate.skip_hidden_rows: true` |\n| Customer sends `.xls` / `.xlsb` / `.ods` instead of `.xlsx` | Pipeline rejects the file; manual re-export | All four formats read out of the box — calamine handles the legacy ones |\n\nThe full catalog of patterns and supporting sources lives in\n[ANECDOTES.md](ANECDOTES.md). The design philosophy: **fail loudly with row\nand field coordinates rather than swallowing the failure into the canonical\noutput.**\n\n---\n\n## Quick start\n\nA template describes *where* the data lives and *what* the fields mean. The\nsame template drives both extraction (cells → canonical JSON) and validation\n(constraints → structured errors).\n\nThe API splits into three composable steps:\n\n```python\nimport crease\n\ntemplate = crease.Template.load(\"templates/orders.crease.yml\")\n\n# 1. Extract — turn cells into canonical JSON\nresult = crease.extract(\"incoming.xlsx\", template)\nresult.canonical[\"orders\"][0]\n# {\"order_id\": \"ORD-1001\", \"customer_email\": \"a@acme.com\",\n#  \"order_date\": \"2025-01-15\", \"quantity\": 10, \"unit_price\": 25.50}\n\n# 2. Validate — independent inspection step\nreport = crease.validate(result, template)\nreport.is_valid              # bool — true iff zero errors\nreport.errors()              # list[Error] — pydantic-shaped\n\n# Or do extract + validate together\nresult, report = crease.check(\"incoming.xlsx\", template)\n```\n\nTemplate paths are plain relative paths resolved from your working directory\n— no implicit \"same folder as the xlsx\" convention.\n\nThe template that produced the output above (note: `pattern:`, `minimum:` etc.\nare both *coercion hints* for extraction **and** *constraints* for validation):\n\n```yaml\n# templates/orders.crease.yml\ntemplate_id: orders\ndescription: Order export from acme.\n\nentities:\n  - name: order\n    cardinality: many\n    locate:\n      tab: Orders\n      orientation: flat\n      header_row: 0\n    fields:\n      - { name: order_id,       source_column: order_id,       type: string,  pattern: ^ORD-\\d{4}$ }\n      - { name: customer_email, source_column: customer_email, type: email }\n      - { name: order_date,     source_column: order_date,     type: date }\n      - { name: quantity,       source_column: quantity,       type: integer, minimum: 1 }\n      - { name: unit_price,     source_column: unit_price,     type: number,  minimum: 0 }\n```\n\n---\n\n## Getting your data out\n\n`result.canonical` is a plain dict — no extra dependencies, no opinions. When\nyou want something richer, opt in:\n\n```python\n# Iterate as dicts\nfor order in result.iter(\"order\"):\n    pipeline.send(order)\n\n# Project into a Pydantic model. Field matching is opportunistic by attribute\n# name: fields the model doesn't declare are dropped silently; type mismatches\n# raise crease.ValidationError.\nfrom pydantic import BaseModel\n\nclass Order(BaseModel):\n    order_id: str\n    quantity: int      # the model can be a subset of the template's fields\n\norders: list[Order] = result.to_pydantic(\"order\", model=Order)\n\n# Project into a pandas DataFrame\ndf = result.to_pandas(\"order\")\n```\n\nBy default, every projection method **halts** if extraction produced any\nerrors — the library's whole pitch is \"fail loudly with coordinates.\" To\nopportunistically recover and keep the rows that did map cleanly:\n\n```python\norders = result.to_pydantic(\"order\", model=Order, allow_partial=True)\n# rows that didn't validate are absent from `orders`.\n# they're listed in result.report.errors() with row/field coordinates.\n```\n\nFor `cardinality: one` entities, use `result.get(\"company\")` /\n`result.get(\"company\", model=Company)` instead — iteration over a single\nrecord is a category error.\n\n---\n\n## Streaming large files\n\nFor multi-hundred-thousand-row files, stream instead of materializing.\nStreaming takes the same `model=` and `allow_partial=` arguments as the\nmaterialized projections, so the shape stays symmetric:\n\n```python\n# Yields dicts\nfor order in crease.stream(\"big.xlsx\", template, entity=\"order\"):\n    pipeline.send(order)\n\n# Yields validated Pydantic instances\nfor order in crease.stream(\"big.xlsx\", template, entity=\"order\", model=Order):\n    pipeline.send(order)\n```\n\nMemory stays bounded (~10MB) regardless of file size. Errors accumulate on\nthe session report rather than being yielded inline — the iterator returns\nthe happy path; the report owns the sad path.\n\n---\n\n## Multi-entity files\n\nWhen one file has multiple shapes (cover sheet + per-region data tabs +\ntotals), declare each as its own entity:\n\n```yaml\nentities:\n  - name: company                       # one record from the cover tab\n    cardinality: one\n    locate: { tab: Cover, orientation: property_sheet, label_col: 0, value_col: 1 }\n    fields:\n      - { name: company_name,  source_label: Company,       type: string }\n      - { name: period,        source_label: Period,        type: string, pattern: ^Q[1-4]\\s\\d{4}$ }\n      - { name: contact_email, source_label: Contact,       type: email }\n\n  - name: order                         # many records from every \"Region - X\" tab\n    cardinality: many\n    locate:\n      tab_pattern: ^Region - (.+)$\n      orientation: flat\n      header_row: 3\n    fields:\n      - { name: order_id, source_column: Order ID, type: string,  pattern: ^ORD-\\d{4}$ }\n      - { name: customer, source_column: Customer, type: string }\n      - { name: total,    source_column: Total,    type: number,  minimum: 0 }\n    enrich:\n      - { field: region, source: tab_name_regex_group, group: 1 }\n\nignore_tabs: [Notes]\n```\n\nUse a session when you want both eager and streaming reads against the same file:\n\n```python\nwith crease.open(\"incoming.xlsx\", template) as session:\n    company = session.get(\"company\")                     # cardinality: one (eager)\n    for order in session.stream(\"order\", model=Order):   # cardinality: many (streaming)\n        pipeline.send({**order.model_dump(), \"_company\": company[\"company_name\"]})\n\n    if not session.report().is_valid:\n        log.warning(session.report().errors())\n```\n\n---\n\n## Repeating sections within one tab\n\nSome reports pack multiple sub-sections into a single tab — a weekly\nschedule with one sub-table per day, separated by a `=====` row or a\nrecurring title. The `blocks:` grammar (template `version: 2`) lets you\ndeclare the repeating region once, anchor each instance with start /\nend patterns, and capture per-section metadata that gets merged onto\nevery row in that section:\n\n```yaml\ntemplate_id: weekly_orders\nversion: 2\n\nblocks:\n  - name: daily_section\n    tab_pattern: ^W-\\d+$\n    starts_at: { column: D, cell_pattern: ^ORDER SCHEDULE$ }\n    ends_at:   { column: A, cell_pattern: ^={3,}$ }\n    captures:\n      - field: order_date\n        from: { column: D, cell_pattern: ^DAY (\\d+-\\d+-\\d+)$, regex_group: 1 }\n        type: date\n        date_formats: ['%m-%d-%Y']\n\nentities:\n  - name: order\n    block: daily_section                # ← scope this entity to each block instance\n    cardinality: many\n    locate:\n      orientation: flat\n      header_anchor: { text: ORDER_ID, match_mode: exact }\n    fields:\n      - { name: order_id, source_column: ORDER_ID, type: string, pattern: ^ORD-\\d{4}$ }\n      - { name: customer, source_column: CUSTOMER, type: string }\n      - { name: quantity, source_column: QUANTITY, type: integer, minimum: 1 }\n```\n\nOutput is flat — `order_date` from each section's DAY-row is merged\nonto every order row from that section. See\n[Repeating sections](docs/guides/blocks.md) for the full grammar.\n\n---\n\n## Scattered metadata (anchored layout)\n\nSome cover sheets sprinkle properties at irregular positions. Anchor each\nfield by the label text near it:\n\n```yaml\nentities:\n  - name: report\n    cardinality: one\n    locate: { tab: Cover, orientation: anchored }\n    fields:\n      - name: period\n        type: string\n        anchor: { label_match: \"Reporting Period\", value_at: right, offset: 1 }\n      - name: contact_email\n        type: email\n        anchor: { label_match: \"Contact\", value_at: right, offset: 1 }\n      - name: submitted_on\n        type: date\n        anchor: { label_match: \"Date sent\", value_at: right, offset: 1 }\n```\n\nSurvives the customer adding or removing rows between properties.\n\n---\n\n## Field types and constraints\n\n| Type | Notes |\n|---|---|\n| `string` | Free text. Add `pattern:` for regex enforcement |\n| `integer` | Coerced from int or float-with-no-fractional |\n| `number` | int or float |\n| `boolean` | Customize with `true_values: [Yes, Y, 1]`, `false_values: [No, N, 0]` |\n| `date` | Use `date_format: \"%m/%d/%Y\"` for ambiguous formats |\n| `datetime` | Same |\n| `email` | Built-in regex |\n| `uuid` | Built-in regex |\n| `url` | Built-in regex |\n\nPer-field options:\n\n```yaml\nfields:\n  - name: customer_email\n    source_column: Email\n    type: email\n    nullable: true                              # blanks allowed\n    null_tokens: [N/A, TBD, \"-\"]                # also treat these strings as null\n    normalize: trim                             # trim | lower | trim_lower\n```\n\n`null_tokens` is layered: library defaults (`N/A`, `TBD`, `-`, `—`, `(blank)`,\n`n/a`, `NaN`) → template-level → field-level. Override any layer, including\nsetting `null_tokens: []` to disable.\n\n---\n\n## CLI\n\n```bash\n# Extract to JSON\ncrease extract incoming.xlsx --template templates/orders.crease.yml \u003e out.json\n\n# Validate only (exit 0 if valid, 1 if cell-level errors, 2 if structural; tune with --fail-on)\ncrease validate incoming.xlsx --template templates/orders.crease.yml\n\n# Extract + validate together\ncrease check incoming.xlsx --template templates/orders.crease.yml --json\n\n# Stream a single entity to JSONL (true streaming, low memory)\ncrease stream incoming.xlsx --template templates/orders.crease.yml --entity order \u003e orders.jsonl\n\n# Batch over a folder — emits per-file JSON outputs plus an error report\ncrease batch ./inbox/ --template templates/orders.crease.yml \\\n  --out ./extracted/ --report ./report.csv\n\n# Run the test corpus (developer command)\ncrease test test_cases/\n```\n\n---\n\n## Installation\n\n```bash\npip install crease                  # core: extract + validate, returns dicts\npip install crease[pandas]          # adds result.to_pandas()\n```\n\nCore deps: python-calamine, openpyxl, pydantic, pyyaml. Pandas is an\n**optional extra** — if you only use `extract` and `to_pydantic`,\nyou don't pay for pandas. No LLM, no network calls at runtime.\n\n### Read backends\n\nCrease reads spreadsheets through two interchangeable backends:\n\n| Backend | Formats | When it's used |\n|---|---|---|\n| **calamine** (default) | `.xlsx`, `.xls`, `.xlsb`, `.ods` | Picked automatically. Fast (Rust under the hood) and GIL-releasing, so a `ThreadPoolExecutor` parallelizes multi-file reads. |\n| **openpyxl** | `.xlsx` only | Picked automatically when the template declares `locate.skip_hidden_rows: true` — only openpyxl exposes row-hidden cell metadata. |\n\nOverride the auto-selection with `engine=\"calamine\"` or `engine=\"openpyxl\"`\non `extract`, `get`, `stream`, `check`, and `crease.open`. Forcing calamine\non a `skip_hidden_rows` template emits a `UserWarning` and silently\ndegrades that feature to a no-op (calamine can't see the flag); use this\nonly when you're reading a non-xlsx file and have already verified hidden\nrows aren't present.\n\n### Local development\n\nThe repo uses [uv](https://docs.astral.sh/uv/) and the `src/` layout.\n\n```bash\n# 1. Clone\ngit clone git@github.com:dev360/crease.git\ncd crease\n\n# 2. Install (core + extras + test deps) into a uv-managed venv\nuv sync --all-extras --group test\n\n# 3. Run the corpus\nuv run pytest\n\n# 4. Optional: build the docs site locally\nuv run mkdocs serve     # http://localhost:8000\n\n# 5. Hook up pre-commit (runs ruff + conventional-commit on every commit)\nuv run pre-commit install\nuv run pre-commit run --all-files\n```\n\nIf you're not using uv, plain `pip` works fine against the venv of\nyour choice. PEP 735 `[dependency-groups]` requires pip ≥ 25.1, so we\ninstall the dev/test tools by name instead:\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\npip install -e \".[pandas]\"                       # editable install with the pandas extra\npip install pytest faker pre-commit ruff         # test + dev tools\npytest\n```\n\nTemplate authoring (by hand, by an LLM tool you build, by import from\nanother schema language) is out of scope for this library.\n\n---\n\n## Errors and validation\n\nErrors are pydantic-shaped — the same vocabulary anyone using Pydantic\nalready knows. Every constraint declared on a field is enforced at validation\ntime, with row and field coordinates attached.\n\n```python\nreport = crease.validate(result, template)\n\nreport.is_valid                   # bool — true iff zero errors\nreport.error_count()              # int\nreport.errors()                   # list[Error]\n\nerr = report.errors()[0]\nerr.type        # \"wrong_type\" — stable machine code, safe to route on\nerr.loc         # (\"order\", 47, \"customer_email\") — (entity, row, field)\nerr.msg         # human-readable\nerr.input       # the offending value\nerr.ctx         # extra context, e.g. {\"likely_cause\": \"excel_autoconvert\"}\nerr.severity    # \"cell\" | \"structural\"\n```\n\nThe halt-by-default projection methods raise `crease.ValidationError`, which\ncarries the same data:\n\n```python\ntry:\n    orders = result.to_pydantic(\"order\", model=Order)\nexcept crease.ValidationError as e:\n    e.errors()         # same list as report.errors() would have produced\n    e.error_count()\n```\n\n### Severity\n\n| Severity | Meaning | What you typically do |\n|---|---|---|\n| `structural` | The template can't even map the file (missing tab, header mapping failed, column count mismatch). | Bounce back to sender — the file is unusable as-is. |\n| `cell` | Per-row problem (missing value, wrong type, constraint violation). | Send to a human review queue with bad rows highlighted, or recover with `allow_partial=True`. |\n\n### Error type codes\n\n**Cell-level** (`severity: \"cell\"`):\n\n| `error.type` | Triggers when |\n|---|---|\n| `missing_required` | A non-nullable field has a blank value (after `null_tokens` collapse) |\n| `wrong_type` | Value can't coerce to the declared type. Includes `ctx.likely_cause: excel_autoconvert` when applicable |\n| `pattern_mismatch` | String doesn't match `pattern:` |\n| `enum_violation` | Value not in declared `enum:` |\n| `below_minimum`, `above_maximum` | Numeric range violation |\n| `empty_row` | Mid-data blank row |\n| `duplicate_row` | Row identical to a previous one |\n| `anchor_not_found` | Anchored field's label text not present in tab. `ctx.label_was: \"absent\"`. |\n| `anchor_value_blank` | Anchored field's label is present but the value cell is blank. Informational; only fires on `nullable: true` fields. `ctx.label_was: \"present\"`. |\n| `anchor_value_type_mismatch` | Anchor's label matched but the neighbor cell's shape didn't fit `anchor.value_type`. Surfaces the case where the operator put the wrong thing next to the label. |\n| `header_duplicated` | `source_column` matches multiple header cells in the same row; bind picked the first. Set `source_column_index:` on the field to choose a specific occurrence (0-indexed across the matches). |\n| `header_above_nonblank` | The row immediately above `header_row` has non-blank text in a column that also has a header. Surfaces the case where the operator pointed at the bottom of a two-row header. The column geometry is in `ctx.columns`. |\n| `low_data_density` | Entity's `locate.min_data_density` threshold not met across the extracted records. `ctx.density` and `ctx.threshold` carry the numbers. |\n| `boolean_alias_unknown` | Value didn't match `true_values`/`false_values` |\n| `model_field_missing_in_canonical` | A Pydantic model passed to `to_pydantic` requires a field the template doesn't produce |\n| `model_type_mismatch` | A Pydantic model's field type doesn't match the canonical value's type |\n\n**Structural** (`severity: \"structural\"`):\n\n| `error.type` | Triggers when |\n|---|---|\n| `missing_tab` | Template's `tab:` doesn't exist |\n| `tab_pattern_no_match` | `tab_pattern:` matched zero tabs |\n| `column_count_mismatch` | Header row has wrong number of columns |\n| `header_mapping_failed` | `source_column`/`source_label` not found |\n| `entity_missing` | Locate found nothing |\n| `multiple_rows_for_cardinality_one` | `cardinality: one` entity found \u003e1 row |\n| `unreadable_source` | Source file could not be opened. `extract()` does not raise for this; the failure lands in `report.errors()` so callers can handle it the same way they handle template-mapping failures. |\n\n---\n\n## Documentation\n\n- [`ROADMAP.md`](ROADMAP.md) — what's in v1, what's deferred\n- [`COVERAGE.md`](COVERAGE.md) — layouts and validation errors supported\n- [`CONVENTIONS.md`](CONVENTIONS.md) — Excel patterns we handle, with examples\n- [`test_cases/`](test_cases/) — labeled fixtures that double as the spec\n\n## License\n\nBSD 3-Clause. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdev360%2Fcrease","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdev360%2Fcrease","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdev360%2Fcrease/lists"}