https://github.com/philterd/phileas-python
A library to deidentify and redact PII, PHI, and other sensitive information from text.
https://github.com/philterd/phileas-python
anonymize deidentification deidentify phi phileas philter pii redaction
Last synced: about 1 month ago
JSON representation
A library to deidentify and redact PII, PHI, and other sensitive information from text.
- Host: GitHub
- URL: https://github.com/philterd/phileas-python
- Owner: philterd
- License: other
- Created: 2026-02-27T20:58:38.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-02-28T17:29:16.000Z (about 1 month ago)
- Last Synced: 2026-03-03T21:11:13.663Z (about 1 month ago)
- Topics: anonymize, deidentification, deidentify, phi, phileas, philter, pii, redaction
- Language: Python
- Homepage: https://www.philterd.ai
- Size: 3.01 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Phileas (python)
A Python port of [Phileas (Java)](https://github.com/philterd/phileas) — a library to deidentify and redact PII, PHI, and other sensitive information from text.
* Check out the [documentation](https://philterd.github.io/phileas-python/) or details and code examples.
* Built by [Philterd](https://www.philterd.ai).
* Commercial support and consulting is available - [contact us](https://www.philterd.ai).
## Overview
Phileas analyzes text searching for sensitive information such as email addresses, phone numbers, SSNs, credit card numbers, and many other types of PII/PHI. When sensitive information is identified, Phileas can manipulate it in a variety of ways: the information can be redacted, masked, hashed, or replaced with a static value. The user defines how to handle each type of sensitive information through policies (YAML or JSON).
Other capabilities include referential integrity for redactions, conditional logic for redactions, and a CLI.
Phileas requires no external dependencies (e.g. no ChatGPT/etc.) and is intended to be lightweight and easy to use.
## Compatibility Notes
Note that this port of [Phileas](https://github.com/philterd/phileas) is not 1:1 with the Java version. There are some differences:
* This project includes a server that exposes redaction HTTP endpoints. For the Java version, the API server is [Philter](https://github.com/philterd/philter).
* This project includes support for policies in YAML as well as JSON.
* This project does not include all redaction strategies present in the Java version.
* This project includes a CLI.
* This project includes the ability to evaluate performance using precision and recall through a built-in evaluation tool.
* This project does not include support for PDF documents which is present in the Java version.
## Installation
```bash
pip install phileas-redact
```
Or, to install in development mode from source:
```bash
git clone https://github.com/philterd/phileas-python.git
cd phileas-python
pip install -e ".[dev]"
```
## Quick Start
```python
from phileas.policy.policy import Policy
from phileas.services.filter_service import FilterService
# Define a policy as a Python dict (or load from YAML)
policy_dict = {
"name": "my-policy",
"identifiers": {
"emailAddress": {
"emailAddressFilterStrategies": [{
"strategy": "REDACT",
"redactionFormat": "{{{REDACTED-%t}}}"
}]
},
"ssn": {
"ssnFilterStrategies": [{
"strategy": "REDACT",
"redactionFormat": "{{{REDACTED-%t}}}"
}]
}
}
}
policy = Policy.from_dict(policy_dict)
service = FilterService()
result = service.filter(
policy=policy,
context="my-context",
document_id="doc-001",
text="Contact john@example.com or call about SSN 123-45-6789."
)
print(result.filtered_text)
# Contact {{{REDACTED-email-address}}} or call about SSN {{{REDACTED-ssn}}}.
for span in result.spans:
print(f" [{span.filter_type}] '{span.text}' -> '{span.replacement}' at {span.character_start}:{span.character_end}")
```
## Supported PII / PHI Types
| Policy Key | Filter Type | Description |
|---|---|---|
| `age` | `age` | Age references (e.g., "35 years old", "aged 25") |
| `emailAddress` | `email-address` | Email addresses |
| `creditCard` | `credit-card` | Credit card numbers (Visa, MC, AmEx, Discover, etc.) |
| `ssn` | `ssn` | Social Security Numbers (SSNs) and TINs |
| `phoneNumber` | `phone-number` | US phone numbers |
| `ipAddress` | `ip-address` | IPv4 and IPv6 addresses |
| `url` | `url` | HTTP/HTTPS URLs |
| `zipCode` | `zip-code` | US ZIP codes (5-digit and ZIP+4) |
| `vin` | `vin` | Vehicle Identification Numbers |
| `bitcoinAddress` | `bitcoin-address` | Bitcoin addresses |
| `bankRoutingNumber` | `bank-routing-number` | US ABA bank routing numbers |
| `date` | `date` | Dates in common formats |
| `macAddress` | `mac-address` | Network MAC addresses |
| `currency` | `currency` | USD currency amounts |
| `streetAddress` | `street-address` | US street addresses |
| `trackingNumber` | `tracking-number` | UPS, FedEx, and USPS tracking numbers |
| `driversLicense` | `drivers-license` | US driver's license numbers |
| `ibanCode` | `iban-code` | International Bank Account Numbers (IBANs) |
| `passportNumber` | `passport-number` | US passport numbers |
| `patterns` | user-defined | Custom regex-based patterns (list of pattern filters) |
## Policies
A **policy** is a YAML (or Python dict) object that defines what sensitive information to identify and how to handle it.
### Policy Structure
```yaml
name: my-policy
identifiers:
emailAddress:
enabled: true
emailAddressFilterStrategies:
- strategy: REDACT
redactionFormat: "{{{REDACTED-%t}}}"
ignored:
- noreply@example.com
ignored:
- safe-term
ignoredPatterns:
- "\\d{3}-test-\\d{4}"
```
### Filter Strategies
Each filter type supports one or more strategies that define what to do with the identified information:
| Strategy | Description | Example Output |
|---|---|---|
| `REDACT` | Replace with a redaction tag | `{{{REDACTED-email-address}}}` |
| `MASK` | Replace each character with `*` | `***@*******.***` |
| `STATIC_REPLACE` | Replace with a fixed string | `[REMOVED]` |
| `HASH_SHA256_REPLACE` | Replace with the SHA-256 hash | `a665a4592...` |
| `LAST_4` | Mask all but the last 4 characters | `****6789` |
| `SAME` | Leave the value unchanged (identify only) | `123-45-6789` |
| `TRUNCATE` | Keep leading or trailing characters | `john@***` |
| `ABBREVIATE` | Abbreviate the value | `J. S.` |
### Strategy Options
```yaml
strategy: REDACT
redactionFormat: "{{{REDACTED-%t}}}"
staticReplacement: "[REMOVED]"
maskCharacter: "*"
maskLength: SAME
truncateLeaveCharacters: 4
truncateDirection: LEADING
condition: ""
```
- `%t` in `redactionFormat` is replaced by the filter type name.
### Ignored Terms
You can specify terms that should never be redacted at the policy level or per-filter level:
```python
policy_dict = {
"name": "my-policy",
"identifiers": {
"emailAddress": {
"emailAddressFilterStrategies": [{"strategy": "REDACT"}],
"ignored": ["noreply@internal.com"]
}
},
"ignored": ["safe-global-term"],
"ignoredPatterns": ["\\d{3}-555-\\d{4}"]
}
```
### Pattern-Based Filters
A policy can include a list of custom regex-based filters. Each pattern filter specifies a `pattern` (a regular expression) and an optional `label` used as the filter type in results. This is useful for identifying domain-specific PII that is not covered by the built-in filters.
```python
policy_dict = {
"name": "my-policy",
"identifiers": {
"patterns": [
{
"pattern": "\\d{3}-\\d{3}-\\d{3}",
"label": "custom-id",
"patternFilterStrategies": [{"strategy": "REDACT"}]
}
]
}
}
policy = Policy.from_dict(policy_dict)
result = service.filter(policy, "ctx", "doc1", "ID: 123-456-789")
print(result.filtered_text) # ID: {{{REDACTED-custom-id}}}
```
Multiple pattern filters can be included in the same policy:
```python
"patterns": [
{"pattern": "\\d{3}-\\d{3}-\\d{3}", "label": "id-number"},
{"pattern": "[A-Z]{2}\\d{6}", "label": "passport-number"}
]
```
#### Pattern Filter Options
| Field | Type | Description |
|---|---|---|
| `pattern` | `str` | Regular expression used to identify PII |
| `label` | `str` | Filter type label used in spans (defaults to `"pattern"`) |
| `patternFilterStrategies` | `list` | List of filter strategies (same as other filter types) |
| `ignored` | `list` | Terms that should not be redacted even if they match |
| `enabled` | `bool` | Whether the filter is active (default: `true`) |
## Contexts and Referential Integrity
Every call to `FilterService.filter()` takes a **context** name. The context is a logical grouping that ties multiple documents together — for example, all documents belonging to a single patient, user, or case.
Phileas uses the context to maintain **referential integrity**: once a PII token has been replaced, every subsequent occurrence of that same token in the same context receives the *identical* replacement. This ensures that redacted documents within a context remain internally consistent and can still be cross-referenced without revealing the underlying sensitive values.
### How it works
Phileas maintains a `ContextService` — a map of maps with the structure:
```
context_name → { token → replacement }
```
Before applying any replacement, `FilterService` checks whether the token already has a stored replacement for the current context:
- **Token found** — the stored replacement is used instead of generating a new one.
- **Token not found** — the newly generated replacement is stored and then applied.
The default implementation is `InMemoryContextService`, which stores mappings in memory for the lifetime of the `FilterService` instance.
### Using the default in-memory context service
```python
from phileas import FilterService
service = FilterService() # uses InMemoryContextService automatically
# Both calls operate in the same context, so 555-123-4567 always gets
# the same replacement across documents.
result1 = service.filter(policy, "patient-records", "doc1", "Call 555-123-4567 for info.")
result2 = service.filter(policy, "patient-records", "doc2", "Patient called 555-123-4567 back.")
```
### Pre-seeding the context service
You can pre-populate the context service before filtering to force specific replacements:
```python
from phileas import FilterService, InMemoryContextService
ctx_svc = InMemoryContextService()
ctx_svc.put("patient-records", "john@example.com", "EMAIL-001")
service = FilterService(context_service=ctx_svc)
# john@example.com will always be replaced with EMAIL-001 in the "patient-records" context
```
### Providing a custom context service
Subclass `AbstractContextService` to integrate any external store (e.g. Redis, a database):
```python
from phileas import FilterService, AbstractContextService
class RedisContextService(AbstractContextService):
def put(self, context: str, token: str, replacement: str) -> None:
# store in Redis
...
def get(self, context: str, token: str) -> str | None:
# retrieve from Redis, return None if not found
...
def contains(self, context: str, token: str) -> bool:
# check existence in Redis
...
service = FilterService(context_service=RedisContextService())
```
## API Reference
### `FilterService`
```python
from phileas.services.filter_service import FilterService
service = FilterService(context_service=None)
result = service.filter(policy, context, document_id, text)
```
#### Constructor Parameters
| Parameter | Type | Description |
|---|---|---|
| `context_service` | `AbstractContextService \| None` | Context service implementation to use for referential integrity. Defaults to `InMemoryContextService` when `None`. |
#### `filter()` Parameters
| Parameter | Type | Description |
|---|---|---|
| `policy` | `Policy` | The policy to apply |
| `context` | `str` | Named context that groups documents for referential integrity (e.g., a patient ID or session name) |
| `document_id` | `str` | A unique identifier for the document being filtered |
| `text` | `str` | The text to filter |
#### Returns `FilterResult`
| Attribute | Type | Description |
|---|---|---|
| `filtered_text` | `str` | The text with sensitive information replaced |
| `spans` | `List[Span]` | Metadata about each identified piece of sensitive information |
| `context` | `str` | The context passed to `filter()` |
| `document_id` | `str` | The document ID passed to `filter()` |
### `Span`
| Attribute | Type | Description |
|---|---|---|
| `character_start` | `int` | Start index of the span in the original text |
| `character_end` | `int` | End index of the span in the original text |
| `filter_type` | `str` | The type of PII identified (e.g., `"email-address"`) |
| `text` | `str` | The original text of the span |
| `replacement` | `str` | The replacement value |
| `confidence` | `float` | Confidence score (0.0–1.0) |
| `ignored` | `bool` | Whether this span was marked as ignored (not replaced) |
| `context` | `str` | The context |
### `Policy`
```python
from phileas.policy.policy import Policy
# From a dict
policy = Policy.from_dict({"name": "default", "identifiers": {...}})
# From a JSON string
policy = Policy.from_json('{"name": "default", ...}')
# To JSON
json_str = policy.to_json()
# To dict
d = policy.to_dict()
```
### `AbstractContextService`
Abstract base class for context service implementations. Subclass this to provide a custom backend.
```python
from phileas import AbstractContextService
class MyContextService(AbstractContextService):
def put(self, context: str, token: str, replacement: str) -> None: ...
def get(self, context: str, token: str) -> str | None: ...
def contains(self, context: str, token: str) -> bool: ...
```
#### Methods
| Method | Signature | Description |
|---|---|---|
| `put` | `(context, token, replacement) -> None` | Store a replacement value for a token under the given context |
| `get` | `(context, token) -> str \| None` | Return the stored replacement, or `None` if not found |
| `contains` | `(context, token) -> bool` | Return `True` if a replacement exists for the token in the given context |
### `InMemoryContextService`
Default implementation of `AbstractContextService` backed by a `dict[str, dict[str, str]]`. Suitable for single-process, in-memory use.
```python
from phileas import InMemoryContextService
ctx_svc = InMemoryContextService()
ctx_svc.put("my-context", "john@example.com", "EMAIL-001")
ctx_svc.get("my-context", "john@example.com") # "EMAIL-001"
ctx_svc.contains("my-context", "john@example.com") # True
```
## Examples
### Mask credit card numbers
```python
policy_dict = {
"name": "cc-mask",
"identifiers": {
"creditCard": {
"creditCardFilterStrategies": [{"strategy": "LAST_4"}]
}
}
}
policy = Policy.from_dict(policy_dict)
result = service.filter(policy, "ctx", "doc1", "Card: 4111111111111111")
print(result.filtered_text) # Card: ************1111
```
### Hash SSNs
```python
policy_dict = {
"name": "ssn-hash",
"identifiers": {
"ssn": {
"ssnFilterStrategies": [{"strategy": "HASH_SHA256_REPLACE"}]
}
}
}
```
### Disable a filter
```python
policy_dict = {
"name": "no-url",
"identifiers": {
"url": {"enabled": False}
}
}
```
## CLI
phileas ships a `phileas` command that performs redaction directly from the terminal.
### Usage
```
phileas -p POLICY_FILE -c CONTEXT (-t TEXT | -f FILE) [options]
```
| Argument | Description |
|---|---|
| `-p / --policy FILE` | Path to a policy file (JSON or YAML). |
| `-c / --context CONTEXT` | Context name for referential integrity. |
| `-t / --text TEXT` | Text to redact (mutually exclusive with `--file`). |
| `-f / --file FILE` | Path to a file to redact (mutually exclusive with `--text`). |
| `-d / --document-id ID` | Optional document identifier (auto-generated if omitted). |
| `-o / --output FILE` | Write redacted text to a file instead of stdout. |
| `--spans` | Print span metadata as JSON to stderr. |
| `--evaluate FILE` | Evaluate redaction quality against a JSON ground-truth file. Prints precision, recall, and F1 metrics to stdout. |
### Examples
Redact a string:
```bash
phileas -p policy.yaml -c my-context -t "Contact john@example.com or call 800-555-1234."
# Contact {{{REDACTED-email-address}}} or call {{{REDACTED-phone-number}}}.
```
Redact a file and write output to a new file:
```bash
phileas -p policy.yaml -c my-context -f report.txt -o report_redacted.txt
```
View span metadata for each detected item:
```bash
phileas -p policy.yaml -c my-context -t "Email john@example.com." --spans
```
### Evaluation Mode
Use `--evaluate FILE` to measure the redaction quality of a policy against a set of ground-truth annotations. Phileas runs the filter on the input text, compares the detected spans against the ground-truth spans, and prints precision, recall, and F1 metrics to stdout.
```bash
phileas -p policy.json -c my-context -t "Email john@example.com." --evaluate gt.json
```
The ground-truth file must be a JSON array of span objects, or a JSON object with a `"spans"` key. Each span must have `"start"` and `"end"` character positions; `"type"` is optional:
```json
[{"start": 6, "end": 22, "type": "email-address"}]
```
**Example output:**
```
Email {{{REDACTED-email-address}}}.
{
"truePositives": 1,
"falsePositives": 0,
"falseNegatives": 0,
"precision": 1.0,
"recall": 1.0,
"f1": 1.0
}
```
## Running Tests
```bash
pytest tests/ -v
```
## License
Copyright 2026 Philterd, LLC.
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.
"Phileas" and "Philter" are registered trademarks of Philterd, LLC.
This project is a Python port of [Phileas](https://github.com/philterd/phileas), which is also Apache-2.0 licensed.