{"id":16616881,"url":"https://github.com/sebastienrousseau/bankstatementparser","last_synced_at":"2026-04-11T01:29:15.936Z","repository":{"id":206291328,"uuid":"716009113","full_name":"sebastienrousseau/bankstatementparser","owner":"sebastienrousseau","description":"A comprehensive toolkit for finance and treasury specialists, featuring robust parsers for bank statements in CAMT (ISO 20022) format, enabling seamless processing and analysis of financial transactions.🐍","archived":false,"fork":false,"pushed_at":"2025-02-12T03:40:51.000Z","size":170,"stargazers_count":9,"open_issues_count":2,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-15T14:08:13.993Z","etag":null,"topics":["banking","camt","finance","iso20022","pain001","reporting","sepa","transactions","treasury"],"latest_commit_sha":null,"homepage":"http://bankstatementparser.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sebastienrousseau.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/funding.yml","license":"LICENSE","code_of_conduct":".github/CODE-OF-CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"sebastienrousseau","custom":"https://paypal.me/wwdseb"}},"created_at":"2023-11-08T09:38:15.000Z","updated_at":"2025-06-01T06:30:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"826346cf-5fde-4a48-86f7-7383d4006710","html_url":"https://github.com/sebastienrousseau/bankstatementparser","commit_stats":{"total_commits":3,"total_committers":1,"mean_commits":3.0,"dds":0.0,"last_synced_commit":"36742ae49fe2e42136639934cacbc18dfdb6f81a"},"previous_names":["sebastienrousseau/bankstatementparser"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/sebastienrousseau/bankstatementparser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastienrousseau%2Fbankstatementparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastienrousseau%2Fbankstatementparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastienrousseau%2Fbankstatementparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastienrousseau%2Fbankstatementparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sebastienrousseau","download_url":"https://codeload.github.com/sebastienrousseau/bankstatementparser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastienrousseau%2Fbankstatementparser/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265711012,"owners_count":23815458,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["banking","camt","finance","iso20022","pain001","reporting","sepa","transactions","treasury"],"created_at":"2024-10-12T02:14:31.383Z","updated_at":"2026-04-11T01:29:15.927Z","avatar_url":"https://github.com/sebastienrousseau.png","language":"Python","readme":"# Bank Statement Parser\n\nParse bank statements across **six structured formats** (CAMT, PAIN.001, CSV, OFX/QFX, MT940) **and PDFs** — both digital and scanned — into a single unified `Transaction` model. ISO 20022 files take the deterministic path; PDFs fall through to a configurable LLM (Ollama by default, any LiteLLM-supported provider) and finally to a multimodal vision model for scanned/photocopied statements.\n\nBuilt for finance teams, treasury analysts, and fintech developers who need reliable, auditable extraction across the full spectrum of bank statement formats — without sending data to external services unless they explicitly opt in.\n\n[![PyPI](https://img.shields.io/pypi/pyversions/bankstatementparser.svg?style=for-the-badge\u0026v=0.0.6)](https://pypi.org/project/bankstatementparser/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/bankstatementparser.svg?style=for-the-badge)](https://pypi.org/project/bankstatementparser/)\n[![Codecov](https://img.shields.io/codecov/c/github/sebastienrousseau/bankstatementparser?style=for-the-badge)](https://codecov.io/github/sebastienrousseau/bankstatementparser?branch=main)\n[![License](https://img.shields.io/github/license/sebastienrousseau/bankstatementparser?style=for-the-badge)](LICENSE)\n\n## How it works\n\n`smart_ingest()` routes any input file through the cheapest viable extraction path. Deterministic parsers always run first ($0 cost). Text and vision LLMs are fallbacks for unstandardized PDFs — both are opt-in via separate install extras and can be swapped between any LiteLLM-supported provider (Ollama, Anthropic, OpenAI, Gemini, …).\n\n```mermaid\nflowchart TD\n    A[smart_ingest\u0026lpar;path\u0026rpar;] --\u003e B{detect_statement_format}\n    B -- CAMT/PAIN/OFX/MT940/CSV --\u003e C[Path A: deterministic parser\u003cbr/\u003e$0, fastest]\n    C --\u003e Z[IngestResult\u003cbr/\u003esource_method='deterministic']\n\n    B -- pdf or unknown --\u003e D[pypdf extract_text]\n    D --\u003e E{text len \u0026gt;= 50?}\n\n    E -- yes --\u003e F[Path B: text-LLM\u003cbr/\u003edefault ollama/llama3]\n    F --\u003e Y[IngestResult\u003cbr/\u003esource_method='llm']\n\n    E -- no --\u003e G[Path C: vision-LLM\u003cbr/\u003eopt-in via BSP_HYBRID_VISION_MODEL]\n    G --\u003e X[IngestResult\u003cbr/\u003esource_method='vision']\n\n    Z --\u003e V[verify_balance\u003cbr/\u003eGolden Rule]\n    Y --\u003e V\n    X --\u003e V\n    V --\u003e R[VERIFIED / DISCREPANCY / FAILED]\n```\n\nEvery extracted row carries an immutable `transaction_hash`, an audit-trail `source_method` tag, and (for LLM rows) a `confidence` score — see [Hybrid extraction](#hybrid-extraction-pdfs-included-v005) below for the full surface.\n\n## Key Features\n\n| Feature | Description |\n|---|---|\n| **6 structured formats** | CAMT.053, PAIN.001, CSV, OFX, QFX, MT940 |\n| **Hybrid PDF pipeline** *(v0.0.5)* | `smart_ingest()` routes digital PDFs through a text-LLM and scanned PDFs through a multimodal vision model. Deterministic parsers always tried first ($0 cost). |\n| **Local-first LLM** *(v0.0.5)* | Ollama is the default backend; switch to Anthropic, OpenAI, or any LiteLLM provider via `BSP_HYBRID_MODEL`. Vision is opt-in via `BSP_HYBRID_VISION_MODEL` — no surprise downloads. |\n| **Golden Rule verification** *(v0.0.5)* | Every result carries `opening + credits − debits == closing` status: `VERIFIED`, `DISCREPANCY`, or `FAILED`. |\n| **Idempotent dedup** *(v0.0.5)* | Every `Transaction` carries a stable `transaction_hash` (MD5 of date + normalized description + amount). `Deduplicator.dedupe_by_hash()` makes incremental ingestion safe to re-run. |\n| **Auto-detection** | `detect_statement_format()` identifies the format; `create_parser()` returns the right parser |\n| **PII redaction** | Names, IBANs, and addresses masked by default — opt in with `--show-pii` |\n| **Streaming** | `parse_streaming()` at 27,000+ tx/s (CAMT) and 52,000+ tx/s (PAIN.001) with bounded memory |\n| **Parallel** | `parse_files_parallel()` for multi-file batch processing across CPU cores |\n| **Secure ZIP** | `iter_secure_xml_entries()` rejects zip bombs, encrypted entries, and suspicious compression ratios |\n| **In-memory parsing** | `from_string()` and `from_bytes()` parse XML without touching disk |\n| **Export** | CSV, JSON, Excel (`.xlsx`), and optional Polars DataFrames |\n| **100% coverage** | 644 tests, 100% branch coverage, property-based fuzzing with Hypothesis |\n\n## Requirements\n\n- Python **3.10** through **3.14** (Python 3.9 was dropped in v0.0.6 — pin to v0.0.5 if you cannot upgrade your interpreter)\n- Poetry (for local development)\n\n## Install\n\n```bash\n# Core install — deterministic parsers only (CAMT, PAIN.001, CSV, OFX, QFX, MT940)\npip install bankstatementparser\n\n# Add the text-LLM path for digital PDFs (litellm + pypdf)\npip install 'bankstatementparser[hybrid]'\n\n# Add higher-fidelity table extraction (adds pdfplumber)\npip install 'bankstatementparser[hybrid-plus]'\n\n# Add the multimodal vision path for scanned/photocopied PDFs (adds pypdfium2)\npip install 'bankstatementparser[hybrid-vision]'\n```\n\nThe core install has zero AI dependencies. Every `[hybrid*]` extra is opt-in and pure-Python — no `poppler`, no system libraries, no GPU required.\n\n### Local Development\n\nClone and install on **macOS, Linux, or WSL**:\n\n```bash\ngit clone https://github.com/sebastienrousseau/bankstatementparser.git\ncd bankstatementparser\npython3 -m venv .venv\nsource .venv/bin/activate\npip install poetry\npoetry install --with dev\nmake install-hooks   # pre-commit hook runs `make verify` before every commit\n```\n\n## Quick Start\n\n### Parse a CAMT statement\n\n```python\nfrom bankstatementparser import CamtParser\n\nparser = CamtParser(\"statement.xml\")\ntransactions = parser.parse()\nprint(transactions)\n```\n\n```text\n   Amount Currency DrCr  Debtor Creditor      ValDt      AccountId\n 105678.5      SEK CRDT MUELLER          2010-10-18 50000000054910\n-200000.0      SEK DBIT                  2010-10-18 50000000054910\n  30000.0      SEK CRDT                  2010-10-18 50000000054910\n```\n\n### Parse a PAIN.001 payment file\n\n```python\nfrom bankstatementparser import Pain001Parser\n\nparser = Pain001Parser(\"payment.xml\")\npayments = parser.parse()\nprint(payments)\n```\n\n```text\n  PmtInfId PmtMtd  InstdAmt Currency  CdtrNm         EndToEndId\n  PMT-001  TRF     1500.00  EUR       ACME Corp      E2E-001\n  PMT-001  TRF     2300.50  EUR       Global Ltd     E2E-002\n```\n\n### Auto-detect the format\n\n```python\nfrom bankstatementparser import create_parser, detect_statement_format\n\nfmt = detect_statement_format(\"transactions.ofx\")\nparser = create_parser(\"transactions.ofx\", fmt)\nrecords = parser.parse()\n```\n\nWorks with `.xml`, `.csv`, `.ofx`, `.qfx`, and `.mt940` files.\n\n### Hybrid extraction (PDFs included) *(v0.0.5)*\n\n`smart_ingest()` is the single entry point that routes any file through the cheapest viable extraction path:\n\n```python\nfrom bankstatementparser.hybrid import smart_ingest\n\n# Path A — deterministic parser (free, fastest, $0)\nresult = smart_ingest(\"statement.xml\")\nprint(result.source_method)         # \"deterministic\"\n\n# Path B — text-LLM for digital PDFs (set BSP_HYBRID_MODEL=ollama/llama3)\nresult = smart_ingest(\"statement.pdf\")\nprint(result.source_method)         # \"llm\"\nprint(result.verification.status)   # VERIFIED | DISCREPANCY | FAILED\n\n# Path C — multimodal vision for scanned PDFs (set BSP_HYBRID_VISION_MODEL)\n# auto-routed when pypdf cannot extract enough text\nresult = smart_ingest(\"scan.pdf\")\nprint(result.source_method)         # \"vision\"\n```\n\nEvery row carries:\n\n- `source_method` — `\"deterministic\"`, `\"llm\"`, or `\"vision\"` for full audit provenance\n- `transaction_hash` — MD5 fingerprint of `date | normalized_description | amount`, ready for idempotent re-ingestion\n- `confidence` — float between 0 and 1 for LLM rows, `None` for deterministic\n- `raw_source_text` — best-effort source-text slice for the v0.0.6 review-mode UI\n\nA complete walkthrough with synthetic UK-bank PDFs, mock vs. live mode, and a Mermaid flow diagram lives in [`examples/hybrid/README.md`](examples/hybrid/README.md).\n\n### Parse from memory (no disk I/O)\n\n```python\nfrom bankstatementparser import CamtParser\n\nxml_bytes = download_from_sftp()  # your own function\nparser = CamtParser.from_bytes(xml_bytes, source_name=\"daily.xml\")\ntransactions = parser.parse()\n```\n\nPass only decompressed XML to `from_string()` or `from_bytes()`. For ZIP archives, use `iter_secure_xml_entries()`.\n\n### Parse XML files inside a ZIP archive\n\n```python\nfrom bankstatementparser import CamtParser, iter_secure_xml_entries\n\nfor entry in iter_secure_xml_entries(\"statements.zip\"):\n    parser = CamtParser.from_bytes(entry.xml_bytes, source_name=entry.source_name)\n    transactions = parser.parse()\n    print(entry.source_name, len(transactions), \"transactions\")\n```\n\nThe iterator enforces size limits, blocks encrypted entries, and rejects suspicious compression ratios before any XML parsing occurs.\n\n## PII Redaction\n\nPII (names, IBANs, addresses) is **redacted by default** in console output and streaming mode.\n\n```python\n# Redacted by default\nfor tx in parser.parse_streaming(redact_pii=True):\n    print(tx)  # Names and addresses show as ***REDACTED***\n\n# Opt in to see full data\nfor tx in parser.parse_streaming(redact_pii=False):\n    print(tx)\n```\n\nFile exports (CSV, JSON, Excel) always contain the full unredacted data.\n\n## Streaming\n\nProcess large files incrementally. Memory stays bounded regardless of file size — tested at 50,000 transactions with sub-2x memory scaling.\n\n```python\nfrom bankstatementparser import CamtParser\n\nparser = CamtParser(\"large_statement.xml\")\nfor transaction in parser.parse_streaming():\n    process(transaction)  # each transaction is a dict\n```\n\nWorks with both `CamtParser` and `Pain001Parser`. PAIN.001 files over 50 MB use chunk-based namespace stripping via a temporary file — the full document is never loaded into memory.\n\n## Performance\n\n| Metric | CAMT | PAIN.001 |\n|---|---|---|\n| **Throughput** | 27,000+ tx/s | 52,000+ tx/s |\n| **Per-transaction latency** | 37 us | 19 us |\n| **Time to first result** | \u003c 1 ms | \u003c 2 ms |\n| **Memory scaling** | Constant (1K–50K) | Constant (1K–50K) |\n\nPerformance is flat from 1,000 to 50,000 transactions. CI enforces minimum TPS and latency thresholds.\n\n## Parallel Parsing\n\nProcess multiple files simultaneously across CPU cores:\n\n```python\nfrom bankstatementparser import parse_files_parallel\n\nresults = parse_files_parallel([\n    \"statements/jan.xml\",\n    \"statements/feb.xml\",\n    \"statements/mar.xml\",\n])\n\nfor r in results:\n    print(r.path, r.status, len(r.transactions), \"rows\")\n```\n\nUses `ProcessPoolExecutor` to bypass the GIL. Each file is parsed in its own worker process. Auto-detects format per file, or force with `format_name=\"camt\"`.\n\n## Command Line\n\nAfter installation a `bankstatementparser` console script is available on `PATH`:\n\n```bash\n# Parse and display\nbankstatementparser --type camt --input statement.xml\n\n# Export to CSV\nbankstatementparser --type camt --input statement.xml --output transactions.csv\n\n# Stream with PII visible\nbankstatementparser --type camt --input statement.xml --streaming --show-pii\n\n# v0.0.5 — hybrid pipeline (auto-routes deterministic / text-LLM / vision)\nbankstatementparser --type ingest --input statement.pdf\nbankstatementparser --type ingest --input statement.pdf --output ledger.csv\n```\n\nSupports `--type camt`, `--type pain001`, and `--type ingest` (v0.0.5). The `python -m bankstatementparser.cli ...` invocation form continues to work for parity with older releases.\n\n## Deduplication\n\nDetect duplicate transactions across multiple sources:\n\n```python\nfrom bankstatementparser import CamtParser, Deduplicator\n\nparser = CamtParser(\"statement.xml\")\ndedup = Deduplicator()\nresult = dedup.deduplicate(dedup.from_dataframe(parser.parse()))\n\nprint(f\"Unique: {len(result.unique_transactions)}\")\nprint(f\"Exact duplicates: {len(result.exact_duplicates)}\")\nprint(f\"Suspected matches: {len(result.suspected_matches)}\")\n```\n\nThe `Deduplicator` uses deterministic hashing for exact matches and configurable similarity thresholds for suspected matches. Each match group includes a confidence score and reason for auditability.\n\n## Export\n\n```python\nparser = CamtParser(\"statement.xml\")\nparser.parse()\n\n# CSV\nparser.export_csv(\"output.csv\")\n\n# JSON (includes summary + transactions)\nparser.export_json(\"output.json\")\n\n# Excel\nparser.camt_to_excel(\"output.xlsx\")\n```\n\n### Polars (optional)\n\nConvert any parser output to a Polars DataFrame:\n\n```python\npolars_df = parser.to_polars()\nlazy_df = parser.to_polars_lazy()\n```\n\nInstall with `pip install bankstatementparser[polars]`.\n\n## Examples\n\nSee [`examples/`](examples/README.md) for 22 runnable scripts (14 deterministic + 8 hybrid):\n\n### Deterministic parsers\n\n| Example | What it demonstrates |\n|---|---|\n| `parse_camt_basic.py` | Load a CAMT.053 file and print transactions |\n| `parse_camt_from_string.py` | Parse CAMT from an in-memory XML string |\n| `inspect_camt.py` | Extract balances, stats, and summaries |\n| `export_camt.py` | Export to CSV and JSON |\n| `export_camt_excel.py` | Export to Excel workbook |\n| `stream_camt.py` | Stream transactions incrementally |\n| `parse_camt_zip.py` | Secure ZIP archive processing |\n| `parse_detected_formats.py` | Auto-detect CSV, OFX, MT940, and XML formats |\n| `parse_pain001_basic.py` | Parse a PAIN.001 payment file |\n| `export_pain001.py` | Export PAIN.001 to CSV and JSON |\n| `stream_pain001.py` | Stream payments incrementally |\n| `validate_input.py` | Validate file paths with InputValidator |\n| `compatibility_wrappers.py` | Legacy API wrappers |\n| `cli_examples.sh` | CLI commands for CAMT and PAIN.001 |\n\n### Hybrid pipeline *(v0.0.5)*\n\n| Example | What it demonstrates |\n|---|---|\n| `hybrid/generate_sample_pdfs.py` | Produce reproducible synthetic UK-bank PDFs (digital + scanned) |\n| `hybrid/01_smart_ingest_deterministic.py` | Path A — `smart_ingest()` against a CAMT.053 fixture, $0 cost |\n| `hybrid/02_smart_ingest_text_llm.py` | Path B — text-LLM extraction from a digital PDF (mock or live Ollama) |\n| `hybrid/03_smart_ingest_vision.py` | Path C — multimodal vision extraction with `LOW_TEXT_DENSITY` auto-routing |\n| `hybrid/04_golden_rule.py` | All three `verify_balance()` outcomes |\n| `hybrid/05_dedupe_recurring.py` | `normalize_description()` + `dedupe_by_hash()` for idempotent batching |\n| `hybrid/06_cli_walkthrough.sh` | Four flavours of the new `--type ingest` CLI subcommand |\n\nSee [`examples/hybrid/README.md`](examples/hybrid/README.md) for the full walkthrough including a Mermaid flow diagram, the cross-platform verification matrix, and the Ollama smoke-test results.\n\n## XML Tag Mapping\n\nSee [`docs/MAPPING.md`](docs/MAPPING.md) for a complete reference of ISO 20022 XML tags to DataFrame columns across all six formats. Use this when integrating with ERP systems or building reconciliation pipelines.\n\n## Project Layout\n\n```text\nbankstatementparser/   Source code (23 modules: deterministic core + hybrid + enrichment subpackages, 100% branch coverage)\nbankstatementparser/hybrid/   v0.0.5 PDF pipeline: orchestrator, llm_extractor, vision, pdf_text, prompts, verification\ndocs/compliance/       ISO 13485 validation, risk register, traceability matrix\nexamples/              14 deterministic + 8 hybrid runnable example scripts\nscripts/               SBOM generation, checksums, signature verification\ntests/                 644 tests (unit, integration, property-based, security, hybrid mocks)\n```\n\n## Security\n\nBank statement files contain sensitive financial and personal data. This library is designed with security as a primary constraint:\n\n- **XXE protection** — `resolve_entities=False`, `no_network=True`, `load_dtd=False`\n- **ZIP bomb protection** — compression ratio limits, entry size caps, encrypted entry rejection\n- **Path traversal prevention** — dangerous pattern blocklist, symlink resolution\n- **PII redaction** — default masking of names, IBANs, and addresses\n- **Signed commits** — enforced in CI via GitHub API verification\n- **Supply chain** — SHA-256 hash-locked dependencies, CycloneDX SBOM, build provenance attestation\n\nFor vulnerability reports, see [SECURITY.md](.github/SECURITY.md).\n\nFor the full compliance suite, see [`docs/compliance/`](docs/compliance/).\n\n## Verify the Repository\n\nRun the full validation suite locally:\n\n```bash\nruff check bankstatementparser tests examples scripts\npython -m mypy bankstatementparser\npython -m pytest\nbandit -r bankstatementparser examples scripts -q\n```\n\n## Contributing\n\nSigned commits required. See [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\nApache License 2.0. See [LICENSE](LICENSE).\n\n## FAQ\n\n**What formats are supported?**\nCAMT.053, PAIN.001, CSV, OFX, QFX, and MT940.\n\n**Does any data leave my infrastructure?**\nNo. Zero network calls. XML parsers enforce `no_network=True`. No cloud, no telemetry.\n\n**Is PII redacted automatically?**\nYes. Names, IBANs, and addresses are masked by default in console output and streaming. File exports retain full data.\n\n**Is the extraction deterministic?**\nYes. Same input produces byte-identical output. Critical for financial auditing.\n\n**Can it handle large files?**\nYes. `parse_streaming()` is tested at 50,000 transactions (~25 MB) with bounded memory. Files over 50 MB use chunk-based streaming.\n\nSee [FAQ.md](FAQ.md) for the complete FAQ covering data privacy, technical specs, and treasury workflows.\n\n---\n\nTHE ARCHITECT ᛫ Sebastien Rousseau ᛫ https://sebastienrousseau.com\nTHE ENGINE ᛞ EUXIS ᛫ Enterprise Unified Execution Intelligence System ᛫ https://euxis.co\n","funding_links":["https://github.com/sponsors/sebastienrousseau","https://paypal.me/wwdseb"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebastienrousseau%2Fbankstatementparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsebastienrousseau%2Fbankstatementparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebastienrousseau%2Fbankstatementparser/lists"}