https://github.com/philterd/phileas-python

A library to deidentify and redact PII, PHI, and other sensitive information from text.
https://github.com/philterd/phileas-python
anonymize deidentification deidentify phi phileas philter pii redaction
Last synced: 4 months ago
JSON representation
A library to deidentify and redact PII, PHI, and other sensitive information from text.
Host: GitHub
URL: https://github.com/philterd/phileas-python
Owner: philterd
License: other
Created: 2026-02-27T20:58:38.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-02-28T17:29:16.000Z (5 months ago)
Last Synced: 2026-03-03T21:11:13.663Z (4 months ago)
Topics: anonymize, deidentification, deidentify, phi, phileas, philter, pii, redaction
Language: Python
Homepage: https://www.philterd.ai
Size: 3.01 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Phileas (python)

A Python port of [Phileas (Java)](https://github.com/philterd/phileas) — a library to deidentify and redact PII, PHI, and other sensitive information from text.

* Check out the [documentation](https://philterd.github.io/phileas-python/) or details and code examples.

* Built by [Philterd](https://www.philterd.ai).

* Commercial support and consulting is available - [contact us](https://www.philterd.ai).

## Overview

Phileas analyzes text searching for sensitive information such as email addresses, phone numbers, SSNs, credit card numbers, and many other types of PII/PHI. When sensitive information is identified, Phileas can manipulate it in a variety of ways: the information can be redacted, masked, hashed, or replaced with a static value. The user defines how to handle each type of sensitive information through policies (YAML or JSON).

Other capabilities include referential integrity for redactions, conditional logic for redactions, and a CLI.

Phileas requires no external dependencies (e.g. no ChatGPT/etc.) and is intended to be lightweight and easy to use.

## Compatibility Notes

Note that this port of [Phileas](https://github.com/philterd/phileas) is not 1:1 with the Java version. There are some differences:

* This project includes a server that exposes redaction HTTP endpoints. For the Java version, the API server is [Philter](https://github.com/philterd/philter).

* This project includes support for policies in YAML as well as JSON.

* This project does not include all redaction strategies present in the Java version.

* This project includes a CLI.

* This project includes the ability to evaluate performance using precision and recall through a built-in evaluation tool.

* This project does not include support for PDF documents which is present in the Java version.

## Installation

```bash

pip install phileas-redact

```

Or, to install in development mode from source:

```bash

git clone https://github.com/philterd/phileas-python.git

cd phileas-python

pip install -e ".[dev]"

```

## Quick Start

```python

from phileas.policy.policy import Policy

from phileas.services.filter_service import FilterService

# Define a policy as a Python dict (or load from YAML)

policy_dict = {

    "name": "my-policy",

    "identifiers": {

        "emailAddress": {

            "emailAddressFilterStrategies": [{

                "strategy": "REDACT",

                "redactionFormat": "{{{REDACTED-%t}}}"

            }]

        },

        "ssn": {

            "ssnFilterStrategies": [{

                "strategy": "REDACT",

                "redactionFormat": "{{{REDACTED-%t}}}"

            }]

        }

    }

}

policy = Policy.from_dict(policy_dict)

service = FilterService()

result = service.filter(

    policy=policy,

    context="my-context",

    document_id="doc-001",

    text="Contact john@example.com or call about SSN 123-45-6789."

)

print(result.filtered_text)

# Contact {{{REDACTED-email-address}}} or call about SSN {{{REDACTED-ssn}}}.

for span in result.spans:

    print(f"  [{span.filter_type}] '{span.text}' -> '{span.replacement}' at {span.character_start}:{span.character_end}")

```

## Supported PII / PHI Types

| Policy Key | Filter Type | Description |

|---|---|---|

| `age` | `age` | Age references (e.g., "35 years old", "aged 25") |

| `emailAddress` | `email-address` | Email addresses |

| `creditCard` | `credit-card` | Credit card numbers (Visa, MC, AmEx, Discover, etc.) |

| `ssn` | `ssn` | Social Security Numbers (SSNs) and TINs |

| `phoneNumber` | `phone-number` | US phone numbers |

| `ipAddress` | `ip-address` | IPv4 and IPv6 addresses |

| `url` | `url` | HTTP/HTTPS URLs |

| `zipCode` | `zip-code` | US ZIP codes (5-digit and ZIP+4) |

| `vin` | `vin` | Vehicle Identification Numbers |

| `bitcoinAddress` | `bitcoin-address` | Bitcoin addresses |

| `bankRoutingNumber` | `bank-routing-number` | US ABA bank routing numbers |

| `date` | `date` | Dates in common formats |

| `macAddress` | `mac-address` | Network MAC addresses |

| `currency` | `currency` | USD currency amounts |

| `streetAddress` | `street-address` | US street addresses |

| `trackingNumber` | `tracking-number` | UPS, FedEx, and USPS tracking numbers |

| `driversLicense` | `drivers-license` | US driver's license numbers |

| `ibanCode` | `iban-code` | International Bank Account Numbers (IBANs) |

| `passportNumber` | `passport-number` | US passport numbers |

| `patterns` | user-defined | Custom regex-based patterns (list of pattern filters) |

## Policies

A **policy** is a YAML (or Python dict) object that defines what sensitive information to identify and how to handle it.

### Policy Structure

```yaml

name: my-policy

identifiers:

  emailAddress:

    enabled: true

    emailAddressFilterStrategies:

      - strategy: REDACT

        redactionFormat: "{{{REDACTED-%t}}}"

    ignored:

      - noreply@example.com

ignored:

  - safe-term

ignoredPatterns:

  - "\\d{3}-test-\\d{4}"

```

### Filter Strategies

Each filter type supports one or more strategies that define what to do with the identified information:

| Strategy | Description | Example Output |

|---|---|---|

| `REDACT` | Replace with a redaction tag | `{{{REDACTED-email-address}}}` |

| `MASK` | Replace each character with `*` | `***@*******.***` |

| `STATIC_REPLACE` | Replace with a fixed string | `[REMOVED]` |

| `HASH_SHA256_REPLACE` | Replace with the SHA-256 hash | `a665a4592...` |

| `LAST_4` | Mask all but the last 4 characters | `****6789` |

| `SAME` | Leave the value unchanged (identify only) | `123-45-6789` |

| `TRUNCATE` | Keep leading or trailing characters | `john@***` |

| `ABBREVIATE` | Abbreviate the value | `J. S.` |

### Strategy Options

```yaml

strategy: REDACT

redactionFormat: "{{{REDACTED-%t}}}"

staticReplacement: "[REMOVED]"

maskCharacter: "*"

maskLength: SAME

truncateLeaveCharacters: 4

truncateDirection: LEADING

condition: ""

```

- `%t` in `redactionFormat` is replaced by the filter type name.

### Ignored Terms

You can specify terms that should never be redacted at the policy level or per-filter level:

```python

policy_dict = {

    "name": "my-policy",

    "identifiers": {

        "emailAddress": {

            "emailAddressFilterStrategies": [{"strategy": "REDACT"}],

            "ignored": ["noreply@internal.com"]

        }

    },

    "ignored": ["safe-global-term"],

    "ignoredPatterns": ["\\d{3}-555-\\d{4}"]

}

```

### Pattern-Based Filters

A policy can include a list of custom regex-based filters. Each pattern filter specifies a `pattern` (a regular expression) and an optional `label` used as the filter type in results. This is useful for identifying domain-specific PII that is not covered by the built-in filters.

```python

policy_dict = {

    "name": "my-policy",

    "identifiers": {

        "patterns": [

            {

                "pattern": "\\d{3}-\\d{3}-\\d{3}",

                "label": "custom-id",

                "patternFilterStrategies": [{"strategy": "REDACT"}]

            }

        ]

    }

}

policy = Policy.from_dict(policy_dict)

result = service.filter(policy, "ctx", "doc1", "ID: 123-456-789")

print(result.filtered_text)  # ID: {{{REDACTED-custom-id}}}

```

Multiple pattern filters can be included in the same policy:

```python

"patterns": [

    {"pattern": "\\d{3}-\\d{3}-\\d{3}", "label": "id-number"},

    {"pattern": "[A-Z]{2}\\d{6}", "label": "passport-number"}

]

```

#### Pattern Filter Options

| Field | Type | Description |

|---|---|---|

| `pattern` | `str` | Regular expression used to identify PII |

| `label` | `str` | Filter type label used in spans (defaults to `"pattern"`) |

| `patternFilterStrategies` | `list` | List of filter strategies (same as other filter types) |

| `ignored` | `list` | Terms that should not be redacted even if they match |

| `enabled` | `bool` | Whether the filter is active (default: `true`) |

## Contexts and Referential Integrity

Every call to `FilterService.filter()` takes a **context** name. The context is a logical grouping that ties multiple documents together — for example, all documents belonging to a single patient, user, or case.

Phileas uses the context to maintain **referential integrity**: once a PII token has been replaced, every subsequent occurrence of that same token in the same context receives the *identical* replacement. This ensures that redacted documents within a context remain internally consistent and can still be cross-referenced without revealing the underlying sensitive values.

### How it works

Phileas maintains a `ContextService` — a map of maps with the structure:

```

context_name → { token → replacement }

```

Before applying any replacement, `FilterService` checks whether the token already has a stored replacement for the current context:

- **Token found** — the stored replacement is used instead of generating a new one.

- **Token not found** — the newly generated replacement is stored and then applied.

The default implementation is `InMemoryContextService`, which stores mappings in memory for the lifetime of the `FilterService` instance.

### Using the default in-memory context service

```python

from phileas import FilterService

service = FilterService()  # uses InMemoryContextService automatically

# Both calls operate in the same context, so 555-123-4567 always gets

# the same replacement across documents.

result1 = service.filter(policy, "patient-records", "doc1", "Call 555-123-4567 for info.")

result2 = service.filter(policy, "patient-records", "doc2", "Patient called 555-123-4567 back.")

```

### Pre-seeding the context service

You can pre-populate the context service before filtering to force specific replacements:

```python

from phileas import FilterService, InMemoryContextService

ctx_svc = InMemoryContextService()

ctx_svc.put("patient-records", "john@example.com", "EMAIL-001")

service = FilterService(context_service=ctx_svc)

# john@example.com will always be replaced with EMAIL-001 in the "patient-records" context

```

### Providing a custom context service

Subclass `AbstractContextService` to integrate any external store (e.g. Redis, a database):

```python

from phileas import FilterService, AbstractContextService

class RedisContextService(AbstractContextService):

    def put(self, context: str, token: str, replacement: str) -> None:

        # store in Redis

        ...

    def get(self, context: str, token: str) -> str | None:

        # retrieve from Redis, return None if not found

        ...

    def contains(self, context: str, token: str) -> bool:

        # check existence in Redis

        ...

service = FilterService(context_service=RedisContextService())

```

## API Reference

### `FilterService`

```python

from phileas.services.filter_service import FilterService

service = FilterService(context_service=None)

result = service.filter(policy, context, document_id, text)

```

#### Constructor Parameters

| Parameter | Type | Description |

|---|---|---|

| `context_service` | `AbstractContextService \| None` | Context service implementation to use for referential integrity. Defaults to `InMemoryContextService` when `None`. |

#### `filter()` Parameters

| Parameter | Type | Description |

|---|---|---|

| `policy` | `Policy` | The policy to apply |

| `context` | `str` | Named context that groups documents for referential integrity (e.g., a patient ID or session name) |

| `document_id` | `str` | A unique identifier for the document being filtered |

| `text` | `str` | The text to filter |

#### Returns `FilterResult`

| Attribute | Type | Description |

|---|---|---|

| `filtered_text` | `str` | The text with sensitive information replaced |

| `spans` | `List[Span]` | Metadata about each identified piece of sensitive information |

| `context` | `str` | The context passed to `filter()` |

| `document_id` | `str` | The document ID passed to `filter()` |

### `Span`

| Attribute | Type | Description |

|---|---|---|

| `character_start` | `int` | Start index of the span in the original text |

| `character_end` | `int` | End index of the span in the original text |

| `filter_type` | `str` | The type of PII identified (e.g., `"email-address"`) |

| `text` | `str` | The original text of the span |

| `replacement` | `str` | The replacement value |

| `confidence` | `float` | Confidence score (0.0–1.0) |

| `ignored` | `bool` | Whether this span was marked as ignored (not replaced) |

| `context` | `str` | The context |

### `Policy`

```python

from phileas.policy.policy import Policy

# From a dict

policy = Policy.from_dict({"name": "default", "identifiers": {...}})

# From a JSON string

policy = Policy.from_json('{"name": "default", ...}')

# To JSON

json_str = policy.to_json()

# To dict

d = policy.to_dict()

```

### `AbstractContextService`

Abstract base class for context service implementations. Subclass this to provide a custom backend.

```python

from phileas import AbstractContextService

class MyContextService(AbstractContextService):

    def put(self, context: str, token: str, replacement: str) -> None: ...

    def get(self, context: str, token: str) -> str | None: ...

    def contains(self, context: str, token: str) -> bool: ...

```

#### Methods

| Method | Signature | Description |

|---|---|---|

| `put` | `(context, token, replacement) -> None` | Store a replacement value for a token under the given context |

| `get` | `(context, token) -> str \| None` | Return the stored replacement, or `None` if not found |

| `contains` | `(context, token) -> bool` | Return `True` if a replacement exists for the token in the given context |

### `InMemoryContextService`

Default implementation of `AbstractContextService` backed by a `dict[str, dict[str, str]]`. Suitable for single-process, in-memory use.

```python

from phileas import InMemoryContextService

ctx_svc = InMemoryContextService()

ctx_svc.put("my-context", "john@example.com", "EMAIL-001")

ctx_svc.get("my-context", "john@example.com")      # "EMAIL-001"

ctx_svc.contains("my-context", "john@example.com") # True

```

## Examples

### Mask credit card numbers

```python

policy_dict = {

    "name": "cc-mask",

    "identifiers": {

        "creditCard": {

            "creditCardFilterStrategies": [{"strategy": "LAST_4"}]

        }

    }

}

policy = Policy.from_dict(policy_dict)

result = service.filter(policy, "ctx", "doc1", "Card: 4111111111111111")

print(result.filtered_text)  # Card: ************1111

```

### Hash SSNs

```python

policy_dict = {

    "name": "ssn-hash",

    "identifiers": {

        "ssn": {

            "ssnFilterStrategies": [{"strategy": "HASH_SHA256_REPLACE"}]

        }

    }

}

```

### Disable a filter

```python

policy_dict = {

    "name": "no-url",

    "identifiers": {

        "url": {"enabled": False}

    }

}

```

## CLI

phileas ships a `phileas` command that performs redaction directly from the terminal.

### Usage

```

phileas -p POLICY_FILE -c CONTEXT (-t TEXT | -f FILE) [options]

```

| Argument | Description |

|---|---|

| `-p / --policy FILE` | Path to a policy file (JSON or YAML). |

| `-c / --context CONTEXT` | Context name for referential integrity. |

| `-t / --text TEXT` | Text to redact (mutually exclusive with `--file`). |

| `-f / --file FILE` | Path to a file to redact (mutually exclusive with `--text`). |

| `-d / --document-id ID` | Optional document identifier (auto-generated if omitted). |

| `-o / --output FILE` | Write redacted text to a file instead of stdout. |

| `--spans` | Print span metadata as JSON to stderr. |

| `--evaluate FILE` | Evaluate redaction quality against a JSON ground-truth file. Prints precision, recall, and F1 metrics to stdout. |

### Examples

Redact a string:

```bash

phileas -p policy.yaml -c my-context -t "Contact john@example.com or call 800-555-1234."

# Contact {{{REDACTED-email-address}}} or call {{{REDACTED-phone-number}}}.

```

Redact a file and write output to a new file:

```bash

phileas -p policy.yaml -c my-context -f report.txt -o report_redacted.txt

```

View span metadata for each detected item:

```bash

phileas -p policy.yaml -c my-context -t "Email john@example.com." --spans

```

### Evaluation Mode

Use `--evaluate FILE` to measure the redaction quality of a policy against a set of ground-truth annotations. Phileas runs the filter on the input text, compares the detected spans against the ground-truth spans, and prints precision, recall, and F1 metrics to stdout.

```bash

phileas -p policy.json -c my-context -t "Email john@example.com." --evaluate gt.json

```

The ground-truth file must be a JSON array of span objects, or a JSON object with a `"spans"` key. Each span must have `"start"` and `"end"` character positions; `"type"` is optional:

```json

[{"start": 6, "end": 22, "type": "email-address"}]

```

**Example output:**

```

Email {{{REDACTED-email-address}}}.

{

  "truePositives": 1,

  "falsePositives": 0,

  "falseNegatives": 0,

  "precision": 1.0,

  "recall": 1.0,

  "f1": 1.0

}

```

## Running Tests

```bash

pytest tests/ -v

```

## License

Copyright 2026 Philterd, LLC.

Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.

"Phileas" and "Philter" are registered trademarks of Philterd, LLC.

This project is a Python port of [Phileas](https://github.com/philterd/phileas), which is also Apache-2.0 licensed.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philterd/phileas-python

Awesome Lists containing this project

README