An open API service indexing awesome lists of open source software.

https://github.com/daedalus/pii-safe

Redact PII from text
https://github.com/daedalus/pii-safe

openai personal-identifiable-information pii pii-redaction pii-safety

Last synced: about 1 month ago
JSON representation

Redact PII from text

Awesome Lists containing this project

README

          

# pii-safe — Redact PII from text

[![PyPI](https://img.shields.io/pypi/v/pii-safe.svg)](https://pypi.org/project/pii-safe/)
[![Python](https://img.shields.io/pypi/pyversions/pii-safe.svg)](https://pypi.org/project/pii-safe/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/daedalus/pii-safe)

Uses the [OpenAI Privacy Filter](https://openai.com/index/introducing-openai-privacy-filter/) model to detect and redact personally identifiable information (PII) from text.

## Why hash-based redaction?

Plain `[REDACTED]` placeholders lose all information about which PII values are the same. Using `hash(salt | pii_data)` instead:

- **Consistent identifiers**: The same PII always maps to the same hash, enabling cross-document correlation (e.g., "how many documents mention the same person?")
- **Reversible with salt**: With the salt, you can recompute hashes to identify original PII if needed
- **Salt prevents rainbow table attacks**: Without a salt, hashes could be precomputed for common names/emails to reverse-identify PII from redacted text

## Install

```bash
pip install pii-safe
```

## Usage

```python
from pii_safe import redact_text

text = "mi nombre es Dario Clavijo"
redacted = redact_text(text)
print(redacted) # mi nombre es[REDACTED_]
```

### Salt for hashing

By default, a random 64-character salt is generated at startup. You can specify a salt to ensure consistent hashing across runs:

```python
from pii_safe import redact_text, set_salt

# Option 1: Pass salt to redact_text
redacted = redact_text("mi nombre es Dario Clavijo", salt="my_secret_salt")

# Option 2: Set salt globally
set_salt("my_secret_salt")
redacted = redact_text("mi nombre es Dario Clavijo")
```

### Using the Redacter class

```python
from pii_safe import Redacter

redacter = Redacter(salt="my_secret_salt")
result1 = redacter.redact("mi nombre es Dario Clavijo")
result2 = redacter.redact("el es Dario Clavijo")

# Same PII gets consistent hash within this instance
hash_map = redacter.get_hash_map()
print(hash_map) # {' Dario Clavijo': ''}
```

## CLI

```bash
pii-safe input.txt
pii-safe input.txt -o output.txt
pii-safe input.txt --salt my_secret_salt
```

## Development

```bash
git clone https://github.com/daedalus/pii-safe.git
cd pii-safe
pip install -e ".[test]"

# run tests
pytest

# format
ruff format src/ tests/

# lint
ruff check src/ tests/

# type check
mypy src/
```

## API

### `redact_text(text: str, salt: str | None = None) -> str`

Redacts PII from text using the openai/privacy-filter model.

### `set_salt(salt: str) -> None`

Set the salt for hashing PII in the default redacter.

### `class Redacter`

Context manager for consistent PII-to-hash mapping across calls.

- `__init__(salt: str | None = None)`: Initialize with optional salt
- `redact(text: str) -> str`: Redact PII from text
- `get_hash_map() -> dict[str, str]`: Get PII-to-hash mapping