{"id":46236382,"url":"https://github.com/philterd/phileas-python","last_synced_at":"2026-03-05T21:01:08.465Z","repository":{"id":341246638,"uuid":"1168825598","full_name":"philterd/phileas-python","owner":"philterd","description":"A library to deidentify and redact PII, PHI, and other sensitive information from text.","archived":false,"fork":false,"pushed_at":"2026-02-28T17:29:16.000Z","size":3161,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-03T21:11:13.663Z","etag":null,"topics":["anonymize","deidentification","deidentify","phi","phileas","philter","pii","redaction"],"latest_commit_sha":null,"homepage":"https://www.philterd.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philterd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-27T20:58:38.000Z","updated_at":"2026-02-28T17:29:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/philterd/phileas-python","commit_stats":null,"previous_names":["philterd/phileas-python"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/philterd/phileas-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philterd%2Fphileas-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philterd%2Fphileas-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philterd%2Fphileas-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philterd%2Fphileas-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philterd","download_url":"https://codeload.github.com/philterd/phileas-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philterd%2Fphileas-python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30091556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T19:41:02.502Z","status":"ssl_error","status_checked_at":"2026-03-04T19:40:05.550Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anonymize","deidentification","deidentify","phi","phileas","philter","pii","redaction"],"created_at":"2026-03-03T19:07:59.394Z","updated_at":"2026-03-04T20:00:42.040Z","avatar_url":"https://github.com/philterd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Phileas (python)\n\nA Python port of [Phileas (Java)](https://github.com/philterd/phileas) — a library to deidentify and redact PII, PHI, and other sensitive information from text.\n\n* Check out the [documentation](https://philterd.github.io/phileas-python/) or details and code examples.\n* Built by [Philterd](https://www.philterd.ai).\n* Commercial support and consulting is available - [contact us](https://www.philterd.ai).\n\n## Overview\n\nPhileas analyzes text searching for sensitive information such as email addresses, phone numbers, SSNs, credit card numbers, and many other types of PII/PHI. When sensitive information is identified, Phileas can manipulate it in a variety of ways: the information can be redacted, masked, hashed, or replaced with a static value. The user defines how to handle each type of sensitive information through policies (YAML or JSON).\n\nOther capabilities include referential integrity for redactions, conditional logic for redactions, and a CLI.\n\nPhileas requires no external dependencies (e.g. no ChatGPT/etc.) and is intended to be lightweight and easy to use.\n\n## Compatibility Notes\n\nNote that this port of [Phileas](https://github.com/philterd/phileas) is not 1:1 with the Java version. There are some differences:\n\n* This project includes a server that exposes redaction HTTP endpoints. For the Java version, the API server is [Philter](https://github.com/philterd/philter).\n* This project includes support for policies in YAML as well as JSON.\n* This project does not include all redaction strategies present in the Java version.\n* This project includes a CLI.\n* This project includes the ability to evaluate performance using precision and recall through a built-in evaluation tool.\n* This project does not include support for PDF documents which is present in the Java version.\n\n## Installation\n\n```bash\npip install phileas-redact\n```\n\nOr, to install in development mode from source:\n\n```bash\ngit clone https://github.com/philterd/phileas-python.git\ncd phileas-python\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nfrom phileas.policy.policy import Policy\nfrom phileas.services.filter_service import FilterService\n\n# Define a policy as a Python dict (or load from YAML)\npolicy_dict = {\n    \"name\": \"my-policy\",\n    \"identifiers\": {\n        \"emailAddress\": {\n            \"emailAddressFilterStrategies\": [{\n                \"strategy\": \"REDACT\",\n                \"redactionFormat\": \"{{{REDACTED-%t}}}\"\n            }]\n        },\n        \"ssn\": {\n            \"ssnFilterStrategies\": [{\n                \"strategy\": \"REDACT\",\n                \"redactionFormat\": \"{{{REDACTED-%t}}}\"\n            }]\n        }\n    }\n}\n\npolicy = Policy.from_dict(policy_dict)\nservice = FilterService()\n\nresult = service.filter(\n    policy=policy,\n    context=\"my-context\",\n    document_id=\"doc-001\",\n    text=\"Contact john@example.com or call about SSN 123-45-6789.\"\n)\n\nprint(result.filtered_text)\n# Contact {{{REDACTED-email-address}}} or call about SSN {{{REDACTED-ssn}}}.\n\nfor span in result.spans:\n    print(f\"  [{span.filter_type}] '{span.text}' -\u003e '{span.replacement}' at {span.character_start}:{span.character_end}\")\n```\n\n## Supported PII / PHI Types\n\n| Policy Key | Filter Type | Description |\n|---|---|---|\n| `age` | `age` | Age references (e.g., \"35 years old\", \"aged 25\") |\n| `emailAddress` | `email-address` | Email addresses |\n| `creditCard` | `credit-card` | Credit card numbers (Visa, MC, AmEx, Discover, etc.) |\n| `ssn` | `ssn` | Social Security Numbers (SSNs) and TINs |\n| `phoneNumber` | `phone-number` | US phone numbers |\n| `ipAddress` | `ip-address` | IPv4 and IPv6 addresses |\n| `url` | `url` | HTTP/HTTPS URLs |\n| `zipCode` | `zip-code` | US ZIP codes (5-digit and ZIP+4) |\n| `vin` | `vin` | Vehicle Identification Numbers |\n| `bitcoinAddress` | `bitcoin-address` | Bitcoin addresses |\n| `bankRoutingNumber` | `bank-routing-number` | US ABA bank routing numbers |\n| `date` | `date` | Dates in common formats |\n| `macAddress` | `mac-address` | Network MAC addresses |\n| `currency` | `currency` | USD currency amounts |\n| `streetAddress` | `street-address` | US street addresses |\n| `trackingNumber` | `tracking-number` | UPS, FedEx, and USPS tracking numbers |\n| `driversLicense` | `drivers-license` | US driver's license numbers |\n| `ibanCode` | `iban-code` | International Bank Account Numbers (IBANs) |\n| `passportNumber` | `passport-number` | US passport numbers |\n| `patterns` | user-defined | Custom regex-based patterns (list of pattern filters) |\n\n## Policies\n\nA **policy** is a YAML (or Python dict) object that defines what sensitive information to identify and how to handle it.\n\n### Policy Structure\n\n```yaml\nname: my-policy\nidentifiers:\n  emailAddress:\n    enabled: true\n    emailAddressFilterStrategies:\n      - strategy: REDACT\n        redactionFormat: \"{{{REDACTED-%t}}}\"\n    ignored:\n      - noreply@example.com\nignored:\n  - safe-term\nignoredPatterns:\n  - \"\\\\d{3}-test-\\\\d{4}\"\n```\n\n### Filter Strategies\n\nEach filter type supports one or more strategies that define what to do with the identified information:\n\n| Strategy | Description | Example Output |\n|---|---|---|\n| `REDACT` | Replace with a redaction tag | `{{{REDACTED-email-address}}}` |\n| `MASK` | Replace each character with `*` | `***@*******.***` |\n| `STATIC_REPLACE` | Replace with a fixed string | `[REMOVED]` |\n| `HASH_SHA256_REPLACE` | Replace with the SHA-256 hash | `a665a4592...` |\n| `LAST_4` | Mask all but the last 4 characters | `****6789` |\n| `SAME` | Leave the value unchanged (identify only) | `123-45-6789` |\n| `TRUNCATE` | Keep leading or trailing characters | `john@***` |\n| `ABBREVIATE` | Abbreviate the value | `J. S.` |\n\n### Strategy Options\n\n```yaml\nstrategy: REDACT\nredactionFormat: \"{{{REDACTED-%t}}}\"\nstaticReplacement: \"[REMOVED]\"\nmaskCharacter: \"*\"\nmaskLength: SAME\ntruncateLeaveCharacters: 4\ntruncateDirection: LEADING\ncondition: \"\"\n```\n\n- `%t` in `redactionFormat` is replaced by the filter type name.\n\n### Ignored Terms\n\nYou can specify terms that should never be redacted at the policy level or per-filter level:\n\n```python\npolicy_dict = {\n    \"name\": \"my-policy\",\n    \"identifiers\": {\n        \"emailAddress\": {\n            \"emailAddressFilterStrategies\": [{\"strategy\": \"REDACT\"}],\n            \"ignored\": [\"noreply@internal.com\"]\n        }\n    },\n    \"ignored\": [\"safe-global-term\"],\n    \"ignoredPatterns\": [\"\\\\d{3}-555-\\\\d{4}\"]\n}\n```\n\n### Pattern-Based Filters\n\nA policy can include a list of custom regex-based filters. Each pattern filter specifies a `pattern` (a regular expression) and an optional `label` used as the filter type in results. This is useful for identifying domain-specific PII that is not covered by the built-in filters.\n\n```python\npolicy_dict = {\n    \"name\": \"my-policy\",\n    \"identifiers\": {\n        \"patterns\": [\n            {\n                \"pattern\": \"\\\\d{3}-\\\\d{3}-\\\\d{3}\",\n                \"label\": \"custom-id\",\n                \"patternFilterStrategies\": [{\"strategy\": \"REDACT\"}]\n            }\n        ]\n    }\n}\n\npolicy = Policy.from_dict(policy_dict)\nresult = service.filter(policy, \"ctx\", \"doc1\", \"ID: 123-456-789\")\nprint(result.filtered_text)  # ID: {{{REDACTED-custom-id}}}\n```\n\nMultiple pattern filters can be included in the same policy:\n\n```python\n\"patterns\": [\n    {\"pattern\": \"\\\\d{3}-\\\\d{3}-\\\\d{3}\", \"label\": \"id-number\"},\n    {\"pattern\": \"[A-Z]{2}\\\\d{6}\", \"label\": \"passport-number\"}\n]\n```\n\n#### Pattern Filter Options\n\n| Field | Type | Description |\n|---|---|---|\n| `pattern` | `str` | Regular expression used to identify PII |\n| `label` | `str` | Filter type label used in spans (defaults to `\"pattern\"`) |\n| `patternFilterStrategies` | `list` | List of filter strategies (same as other filter types) |\n| `ignored` | `list` | Terms that should not be redacted even if they match |\n| `enabled` | `bool` | Whether the filter is active (default: `true`) |\n\n## Contexts and Referential Integrity\n\nEvery call to `FilterService.filter()` takes a **context** name. The context is a logical grouping that ties multiple documents together — for example, all documents belonging to a single patient, user, or case.\n\nPhileas uses the context to maintain **referential integrity**: once a PII token has been replaced, every subsequent occurrence of that same token in the same context receives the *identical* replacement. This ensures that redacted documents within a context remain internally consistent and can still be cross-referenced without revealing the underlying sensitive values.\n\n### How it works\n\nPhileas maintains a `ContextService` — a map of maps with the structure:\n\n```\ncontext_name → { token → replacement }\n```\n\nBefore applying any replacement, `FilterService` checks whether the token already has a stored replacement for the current context:\n\n- **Token found** — the stored replacement is used instead of generating a new one.\n- **Token not found** — the newly generated replacement is stored and then applied.\n\nThe default implementation is `InMemoryContextService`, which stores mappings in memory for the lifetime of the `FilterService` instance.\n\n### Using the default in-memory context service\n\n```python\nfrom phileas import FilterService\n\nservice = FilterService()  # uses InMemoryContextService automatically\n\n# Both calls operate in the same context, so 555-123-4567 always gets\n# the same replacement across documents.\nresult1 = service.filter(policy, \"patient-records\", \"doc1\", \"Call 555-123-4567 for info.\")\nresult2 = service.filter(policy, \"patient-records\", \"doc2\", \"Patient called 555-123-4567 back.\")\n```\n\n### Pre-seeding the context service\n\nYou can pre-populate the context service before filtering to force specific replacements:\n\n```python\nfrom phileas import FilterService, InMemoryContextService\n\nctx_svc = InMemoryContextService()\nctx_svc.put(\"patient-records\", \"john@example.com\", \"EMAIL-001\")\n\nservice = FilterService(context_service=ctx_svc)\n# john@example.com will always be replaced with EMAIL-001 in the \"patient-records\" context\n```\n\n### Providing a custom context service\n\nSubclass `AbstractContextService` to integrate any external store (e.g. Redis, a database):\n\n```python\nfrom phileas import FilterService, AbstractContextService\n\nclass RedisContextService(AbstractContextService):\n    def put(self, context: str, token: str, replacement: str) -\u003e None:\n        # store in Redis\n        ...\n\n    def get(self, context: str, token: str) -\u003e str | None:\n        # retrieve from Redis, return None if not found\n        ...\n\n    def contains(self, context: str, token: str) -\u003e bool:\n        # check existence in Redis\n        ...\n\nservice = FilterService(context_service=RedisContextService())\n```\n\n## API Reference\n\n### `FilterService`\n\n```python\nfrom phileas.services.filter_service import FilterService\n\nservice = FilterService(context_service=None)\nresult = service.filter(policy, context, document_id, text)\n```\n\n#### Constructor Parameters\n\n| Parameter | Type | Description |\n|---|---|---|\n| `context_service` | `AbstractContextService \\| None` | Context service implementation to use for referential integrity. Defaults to `InMemoryContextService` when `None`. |\n\n#### `filter()` Parameters\n\n| Parameter | Type | Description |\n|---|---|---|\n| `policy` | `Policy` | The policy to apply |\n| `context` | `str` | Named context that groups documents for referential integrity (e.g., a patient ID or session name) |\n| `document_id` | `str` | A unique identifier for the document being filtered |\n| `text` | `str` | The text to filter |\n\n#### Returns `FilterResult`\n\n| Attribute | Type | Description |\n|---|---|---|\n| `filtered_text` | `str` | The text with sensitive information replaced |\n| `spans` | `List[Span]` | Metadata about each identified piece of sensitive information |\n| `context` | `str` | The context passed to `filter()` |\n| `document_id` | `str` | The document ID passed to `filter()` |\n\n### `Span`\n\n| Attribute | Type | Description |\n|---|---|---|\n| `character_start` | `int` | Start index of the span in the original text |\n| `character_end` | `int` | End index of the span in the original text |\n| `filter_type` | `str` | The type of PII identified (e.g., `\"email-address\"`) |\n| `text` | `str` | The original text of the span |\n| `replacement` | `str` | The replacement value |\n| `confidence` | `float` | Confidence score (0.0–1.0) |\n| `ignored` | `bool` | Whether this span was marked as ignored (not replaced) |\n| `context` | `str` | The context |\n\n### `Policy`\n\n```python\nfrom phileas.policy.policy import Policy\n\n# From a dict\npolicy = Policy.from_dict({\"name\": \"default\", \"identifiers\": {...}})\n\n# From a JSON string\npolicy = Policy.from_json('{\"name\": \"default\", ...}')\n\n# To JSON\njson_str = policy.to_json()\n\n# To dict\nd = policy.to_dict()\n```\n\n### `AbstractContextService`\n\nAbstract base class for context service implementations. Subclass this to provide a custom backend.\n\n```python\nfrom phileas import AbstractContextService\n\nclass MyContextService(AbstractContextService):\n    def put(self, context: str, token: str, replacement: str) -\u003e None: ...\n    def get(self, context: str, token: str) -\u003e str | None: ...\n    def contains(self, context: str, token: str) -\u003e bool: ...\n```\n\n#### Methods\n\n| Method | Signature | Description |\n|---|---|---|\n| `put` | `(context, token, replacement) -\u003e None` | Store a replacement value for a token under the given context |\n| `get` | `(context, token) -\u003e str \\| None` | Return the stored replacement, or `None` if not found |\n| `contains` | `(context, token) -\u003e bool` | Return `True` if a replacement exists for the token in the given context |\n\n### `InMemoryContextService`\n\nDefault implementation of `AbstractContextService` backed by a `dict[str, dict[str, str]]`. Suitable for single-process, in-memory use.\n\n```python\nfrom phileas import InMemoryContextService\n\nctx_svc = InMemoryContextService()\nctx_svc.put(\"my-context\", \"john@example.com\", \"EMAIL-001\")\nctx_svc.get(\"my-context\", \"john@example.com\")      # \"EMAIL-001\"\nctx_svc.contains(\"my-context\", \"john@example.com\") # True\n```\n\n## Examples\n\n### Mask credit card numbers\n\n```python\npolicy_dict = {\n    \"name\": \"cc-mask\",\n    \"identifiers\": {\n        \"creditCard\": {\n            \"creditCardFilterStrategies\": [{\"strategy\": \"LAST_4\"}]\n        }\n    }\n}\npolicy = Policy.from_dict(policy_dict)\nresult = service.filter(policy, \"ctx\", \"doc1\", \"Card: 4111111111111111\")\nprint(result.filtered_text)  # Card: ************1111\n```\n\n### Hash SSNs\n\n```python\npolicy_dict = {\n    \"name\": \"ssn-hash\",\n    \"identifiers\": {\n        \"ssn\": {\n            \"ssnFilterStrategies\": [{\"strategy\": \"HASH_SHA256_REPLACE\"}]\n        }\n    }\n}\n```\n\n### Disable a filter\n\n```python\npolicy_dict = {\n    \"name\": \"no-url\",\n    \"identifiers\": {\n        \"url\": {\"enabled\": False}\n    }\n}\n```\n\n## CLI\n\nphileas ships a `phileas` command that performs redaction directly from the terminal.\n\n### Usage\n\n```\nphileas -p POLICY_FILE -c CONTEXT (-t TEXT | -f FILE) [options]\n```\n\n| Argument | Description |\n|---|---|\n| `-p / --policy FILE` | Path to a policy file (JSON or YAML). |\n| `-c / --context CONTEXT` | Context name for referential integrity. |\n| `-t / --text TEXT` | Text to redact (mutually exclusive with `--file`). |\n| `-f / --file FILE` | Path to a file to redact (mutually exclusive with `--text`). |\n| `-d / --document-id ID` | Optional document identifier (auto-generated if omitted). |\n| `-o / --output FILE` | Write redacted text to a file instead of stdout. |\n| `--spans` | Print span metadata as JSON to stderr. |\n| `--evaluate FILE` | Evaluate redaction quality against a JSON ground-truth file. Prints precision, recall, and F1 metrics to stdout. |\n\n### Examples\n\nRedact a string:\n\n```bash\nphileas -p policy.yaml -c my-context -t \"Contact john@example.com or call 800-555-1234.\"\n# Contact {{{REDACTED-email-address}}} or call {{{REDACTED-phone-number}}}.\n```\n\nRedact a file and write output to a new file:\n\n```bash\nphileas -p policy.yaml -c my-context -f report.txt -o report_redacted.txt\n```\n\nView span metadata for each detected item:\n\n```bash\nphileas -p policy.yaml -c my-context -t \"Email john@example.com.\" --spans\n```\n\n### Evaluation Mode\n\nUse `--evaluate FILE` to measure the redaction quality of a policy against a set of ground-truth annotations. Phileas runs the filter on the input text, compares the detected spans against the ground-truth spans, and prints precision, recall, and F1 metrics to stdout.\n\n```bash\nphileas -p policy.json -c my-context -t \"Email john@example.com.\" --evaluate gt.json\n```\n\nThe ground-truth file must be a JSON array of span objects, or a JSON object with a `\"spans\"` key. Each span must have `\"start\"` and `\"end\"` character positions; `\"type\"` is optional:\n\n```json\n[{\"start\": 6, \"end\": 22, \"type\": \"email-address\"}]\n```\n\n**Example output:**\n\n```\nEmail {{{REDACTED-email-address}}}.\n{\n  \"truePositives\": 1,\n  \"falsePositives\": 0,\n  \"falseNegatives\": 0,\n  \"precision\": 1.0,\n  \"recall\": 1.0,\n  \"f1\": 1.0\n}\n```\n\n## Running Tests\n\n```bash\npytest tests/ -v\n```\n\n## License\n\nCopyright 2026 Philterd, LLC.\n\nLicensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.\n\n\"Phileas\" and \"Philter\" are registered trademarks of Philterd, LLC.\n\nThis project is a Python port of [Phileas](https://github.com/philterd/phileas), which is also Apache-2.0 licensed.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilterd%2Fphileas-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilterd%2Fphileas-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilterd%2Fphileas-python/lists"}