{"id":19511478,"url":"https://github.com/datafog/datafog-python","last_synced_at":"2026-05-31T23:00:48.281Z","repository":{"id":222876364,"uuid":"657135455","full_name":"DataFog/datafog-python","owner":"DataFog","description":"Lightweight Python SDK for PII detection, redaction, and LLM guardrails, with fast regex defaults and optional NLP/OCR/Spark extras.","archived":false,"fork":false,"pushed_at":"2026-05-27T23:01:34.000Z","size":83215,"stargazers_count":61,"open_issues_count":11,"forks_count":14,"subscribers_count":1,"default_branch":"dev","last_synced_at":"2026-05-28T00:22:30.628Z","etag":null,"topics":["anonymization","compliance","data-privacy","data-protection","de-identification","gliner","guardrails","llm","ner","nlp","ocr","pii","pii-detection","pii-redaction","privacy-engineering","python","redaction","sdk","spacy"],"latest_commit_sha":null,"homepage":"https://datafog.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"sidmohan0/datafog-python","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataFog.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.MD","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/roadmap.rst","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-06-22T11:50:50.000Z","updated_at":"2026-05-27T16:19:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"45d59fde-2796-4278-bb5f-29ebed8c6a05","html_url":"https://github.com/DataFog/datafog-python","commit_stats":null,"previous_names":["datafog/datafog-python"],"tags_count":56,"template":false,"template_full_name":null,"purl":"pkg:github/DataFog/datafog-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataFog%2Fdatafog-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataFog%2Fdatafog-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataFog%2Fdatafog-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataFog%2Fdatafog-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataFog","download_url":"https://codeload.github.com/DataFog/datafog-python/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataFog%2Fdatafog-python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33752286,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anonymization","compliance","data-privacy","data-protection","de-identification","gliner","guardrails","llm","ner","nlp","ocr","pii","pii-detection","pii-redaction","privacy-engineering","python","redaction","sdk","spacy"],"created_at":"2024-11-10T23:21:09.608Z","updated_at":"2026-05-31T23:00:48.271Z","avatar_url":"https://github.com/DataFog.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿# DataFog Python\n\nDataFog is a Python library for detecting and redacting personally identifiable information (PII).\n\nIt provides:\n\n- Fast structured PII detection via regex\n- Optional NER support via spaCy and GLiNER\n- A simple agent-oriented API for LLM applications\n- Backward-compatible `DataFog` and `TextService` classes\n\n## 4.5 Focus\n\nDataFog 4.5 is focused on lightweight text PII screening: a small core install,\nfast regex-based scan/redact helpers, explicit optional extras, and a clearer\npath toward future middleware use cases. Dedicated Sentry, OpenTelemetry,\nlogging-framework, and cloud DLP adapters are future-facing work and are not\npart of the 4.5 release.\n\n## Installation\n\n```bash\n# Core install (regex engine)\npip install datafog\n\n# Add spaCy support\npip install datafog[nlp]\n\n# Add GLiNER + spaCy support\npip install datafog[nlp-advanced]\n\n# Add local OCR support\npip install datafog[ocr]\n\n# Add Spark/distributed support\npip install datafog[distributed]\n\n# Everything\npip install datafog[all]\n```\n\nPython 3.13 support is certified for the core SDK, CLI, `nlp`,\n`nlp-advanced`, and `ocr` install profiles. Donut OCR still requires a model\nthat is available locally before runtime use. `distributed` and `all` are not\nnewly certified on Python 3.13 in the 4.5 line.\n\n## Quick Start\n\n```python\nimport datafog\n\ntext = \"Contact john@example.com or call (555) 123-4567\"\nclean = datafog.sanitize(text, engine=\"regex\")\nprint(clean)\n# Contact [EMAIL_1] or call [PHONE_1]\n```\n\n## For LLM Applications\n\n```python\nimport datafog\n\n# 1) Scan prompt text before sending to an LLM\nprompt = \"My SSN is 123-45-6789\"\nscan_result = datafog.scan_prompt(prompt, engine=\"regex\")\nif scan_result.entities:\n    print(f\"Detected {len(scan_result.entities)} PII entities\")\n\n# 2) Redact model output before returning it\noutput = \"Email me at jane.doe@example.com\"\nsafe_result = datafog.filter_output(output, engine=\"regex\")\nprint(safe_result.redacted_text)\n# Email me at [EMAIL_1]\n\n# 3) One-liner redaction\nprint(datafog.sanitize(\"Card: 4111-1111-1111-1111\", engine=\"regex\"))\n# Card: [CREDIT_CARD_1]\n```\n\n## German Structured PII\n\nGerman structured PII is country-specific and opt-in. Use explicit locale\nselection or entity-type filtering when you want German VAT IDs, German IBANs,\ntax IDs, postal codes, passports, or residence permits.\n\n```python\nimport datafog\n\ntext = \"Steuer-ID 12345678901 liegt vor.\"\n\nprint(datafog.scan(text, engine=\"regex\").entities)\n# []\n\nprint(datafog.scan(text, engine=\"regex\", locales=[\"de\"]).entities)\n# [Entity(type='DE_TAX_ID', text='12345678901', ...)]\n```\n\n### Guardrails\n\n```python\nimport datafog\n\n# Reusable guardrail object\nguard = datafog.create_guardrail(engine=\"regex\", on_detect=\"redact\")\n\n@guard\ndef call_llm() -\u003e str:\n    return \"Send to admin@example.com\"\n\nprint(call_llm())\n# Send to [EMAIL_1]\n```\n\n## Engines\n\nUse the engine that matches your accuracy and dependency constraints:\n\n- `regex`:\n  - Fastest and always available.\n  - Best for default structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE`.\n  - Use `locales=[\"de\"]` for German structured IDs such as `DE_VAT_ID`, `DE_IBAN`, `DE_TAX_ID`, `DE_POSTAL_CODE`, and passport or residence permit numbers.\n- `spacy`:\n  - Requires `pip install datafog[nlp]`.\n  - Useful for unstructured entities like person and organization names.\n- `gliner`:\n  - Requires `pip install datafog[nlp-advanced]`.\n  - Stronger NER coverage than regex for unstructured text.\n- `smart`:\n  - Cascades regex with optional NER engines.\n  - If optional deps are missing, it degrades gracefully and warns.\n\n## Optional OCR And Spark Surfaces\n\nDataFog 4.5 keeps the main package story centered on lightweight text PII\nscreening. OCR and Spark remain supported optional surfaces for users who\nalready rely on them, but they are not required for the core import, default\nscan/redact helpers, or guardrail helpers.\n\n- OCR:\n  - Install `datafog[ocr]` for local image OCR helpers.\n  - URL-based image downloading also needs `datafog[web,ocr]`.\n  - Tesseract usage requires the system `tesseract` binary.\n  - Python 3.13 is validated for the OCR install profile, Pillow,\n    pytesseract, and system Tesseract smoke checks.\n  - Donut OCR requires `datafog[nlp-advanced,ocr]` and a model already available\n    locally.\n- Spark:\n  - Install `datafog[distributed]` for `SparkService`.\n  - Spark PII UDF helpers also require `datafog[nlp]` and an installed spaCy\n    model.\n  - A Java runtime is required by PySpark.\n\nOCR and Spark are not deprecated. Their broader API and packaging overhaul is\ndeferred; the 4.5 goal is to keep them explicit, documented, and isolated from\nthe lightweight core path.\n\n## Backward-Compatible APIs\n\nThe existing public API remains available.\n\n### `DataFog` class\n\n```python\nfrom datafog import DataFog\n\nresult = DataFog().scan_text(\"Email john@example.com\")\nprint(result[\"EMAIL\"])\n```\n\n### `TextService` class\n\n```python\nfrom datafog.services import TextService\n\nservice = TextService(engine=\"regex\")\nresult = service.annotate_text_sync(\"Call (555) 123-4567\")\nprint(result[\"PHONE\"])\n```\n\n## CLI\n\n```bash\n# Scan text\ndatafog scan-text \"john@example.com\"\n\n# Redact text\ndatafog redact-text \"john@example.com\"\n\n# Replace text with pseudonyms\ndatafog replace-text \"john@example.com\"\n\n# Hash detected entities\ndatafog hash-text \"john@example.com\"\n\n# Enable German regex identifiers\ndatafog redact-text \"Steuer-ID 12345678901\" --locale de\n```\n\n## Telemetry\n\nDataFog telemetry is disabled by default.\n\nTo opt in:\n\n```bash\nexport DATAFOG_TELEMETRY=1\n```\n\nTo force telemetry off:\n\n```bash\nexport DATAFOG_NO_TELEMETRY=1\n# or\nexport DO_NOT_TRACK=1\n```\n\nTelemetry does not include input text or detected PII values.\n\n## Development\n\n```bash\ngit clone https://github.com/datafog/datafog-python\ncd datafog-python\npython -m venv .venv\nsource .venv/bin/activate  # Windows: .venv\\Scripts\\activate\npip install -e \".[all,dev]\"\npip install -r requirements-dev.txt\npytest tests/\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafog%2Fdatafog-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatafog%2Fdatafog-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafog%2Fdatafog-python/lists"}