https://github.com/xz-dev/ai-gateway-filter
https://github.com/xz-dev/ai-gateway-filter
Last synced: 3 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/xz-dev/ai-gateway-filter
- Owner: xz-dev
- Created: 2026-06-03T09:07:05.000Z (23 days ago)
- Default Branch: master
- Last Pushed: 2026-06-17T09:08:19.000Z (9 days ago)
- Last Synced: 2026-06-17T09:18:02.207Z (9 days ago)
- Language: Python
- Size: 276 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Privacy Gateway (Core Library)
`privacy-gateway` is a **pure Python library** for:
- reversible natural-language PII protection with self-describing `` tokens
- automatic restoration of `` tokens without special HTTP headers
- prompt-injection phrase detection and decisioning
- streaming detection helper
- automatic sensitive image-region protection and restoration
It is intentionally not a gateway, HTTP server, or network service. JSON parsing,
field selection, routing, proxying, and request/response rewriting belong in the
embedding gateway/plugin. The APISIX example shows one way for a gateway to parse
JSON first and pass only relevant string values to this library.
## Install
```bash
uv sync
```
## Public API
```python
from privacy_gateway import PrivacyGatewayFilter
filter_ = PrivacyGatewayFilter(privacy_password="use-a-high-entropy-deployment-secret")
protected = filter_.protect_privacy_text(
"我叫张三,身份证是110101199001011234,邮箱是zhangsan@example.com。"
)
assert ">
```
The token body contains only crypto material: a per-token salt and the encrypted
value. The ` replaces detected PII spans with `` tokens.
- `restore_privacy_text(content, privacy_password=None)` -> decrypts all `` tokens inside text.
- `protect_secret(content, privacy_password=None)` -> encrypts one caller-selected complete value as a token.
- `detect_pii(content)` -> returns detected PII spans.
- `process_inbound_privacy_text(content, privacy_password=None)` -> restores tokens, then checks prompt-injection phrases.
- `process_outbound_privacy_text(content, privacy_password=None)` -> checks prompt-injection phrases, then tokenizes detected PII.
Example:
```python
from privacy_gateway import PrivacyGatewayFilter
filter_ = PrivacyGatewayFilter.from_settings()
inbound = filter_.process_inbound_privacy_text("hello ")
if inbound.error:
...
if inbound.decision.blocked:
...
plaintext_for_ai = inbound.content
outbound = filter_.process_outbound_privacy_text("User 张三 can be reached at zhangsan@example.com")
protected_for_client = outbound.content
```
`TextProcessingResult` contains `content`, `decision`, and optional normalized
`error` details. It never exposes the password/key used for encryption.
## Detection behavior
The library uses Presidio Analyzer backed by a prepared spaCy model. By default
it expects `en_core_web_sm` to be installed before startup and refuses to
download models at runtime. Prepare it with:
```bash
uv run python scripts/prepare_spacy_model.py en_core_web_sm
```
Presidio recognizers cover common English PII patterns such as:
- `EMAIL_ADDRESS`
- `PHONE_NUMBER`
- `CREDIT_CARD`
- `CRYPTO`
- `IBAN_CODE`
- `IP_ADDRESS`
- `LOCATION`
- `PERSON` when available from configured analyzers
- US identifiers such as `US_SSN`, `US_PASSPORT`, `US_DRIVER_LICENSE`, etc.
The library also adds deterministic rules for common Chinese/business text:
- Chinese mainland ID card numbers
- Chinese mobile numbers
- common Chinese name contexts such as `我叫张三` / `姓名是张三`
- common Chinese address contexts such as `住在北京市...` / `地址是...`
- password/secret contexts such as `password=...` / `密码是...`
No PII detector is perfect. Gateways that know a string is sensitive because of
its JSON field name should pass the complete field value to `protect_secret(...)`.
This keeps JSON/field logic outside the library while still using the same token
format and crypto.
## JSON and gateway integration
The core library does **not** parse or format JSON. For JSON APIs, the gateway
must:
1. parse JSON first (`json.loads` or framework equivalent),
2. walk the resulting object/list,
3. pass selected string values to `process_inbound_privacy_text`,
`process_outbound_privacy_text`, or `protect_secret`,
4. serialize JSON again.
This prevents unsafe raw-string rewriting of JSON and lets gateway code decide
which message/tool-call/AI-output fields are relevant.
## Image privacy APIs
Images are never protected by encrypting the whole image. The image API analyzes
the image for sensitive OCR/PII bounding boxes, protects only those pixel regions,
and leaves the rest of the image viewable.
Preferred image APIs:
- `protect_image(content, crypto_key)` -> detects sensitive regions in a base64 image and returns a base64 PNG with protected rectangles.
- `restore_image(content, crypto_key)` -> restores protected rectangles from the region cache or embedded fallback metadata.
Compatibility payload helpers also use this image behavior:
- `encrypt_payload("image", content, crypto_key)` protects detected regions; it does not encrypt the full image.
- `decrypt_payload("image", content, crypto_key)` restores protected regions.
- `restore_payload("image", content, crypto_key)` is an alias for `decrypt_payload`.
For each detected region, the service encrypts the full-quality crop and stores
it in a process-local LRU cache keyed by the encrypted crop's SHA-256 hash. The
cache keeps the newest 1000 region entries. The returned PNG embeds the region
hash and an encrypted low-resolution fallback crop. Restore behavior is:
1. hash cache hit -> restore the original full-quality region,
2. cache miss -> decrypt the embedded low-resolution fallback and scale it back
into place.
Image region detection uses `presidio-image-redactor`/OCR plus the same prepared
Presidio analyzer configuration as text PII detection. If OCR/region detection
fails, image protection fails closed with `ImageCryptoError` rather than silently
returning an unprotected image. If detection succeeds and finds no sensitive
regions, the original image base64 is returned unchanged.
Deployments must provide the OCR runtime expected by `presidio-image-redactor`
(for example Tesseract in container images) in addition to the prepared spaCy
model.
## Backward-compatible text crypto
Existing whole-text APIs remain available for older callers and tests:
- `encrypt_text(content, crypto_key=None)` -> `str`
- `decrypt_text(content, crypto_key=None)` -> `str`
- `encrypt_payload("text", content, crypto_key)`
- `decrypt_payload("text", content, crypto_key)`
- `restore_payload("text", content, crypto_key)`
Text crypto requires AES key byte lengths `{16, 24, 32}`. New automatic privacy
flows should prefer `` tokenization instead of whole-body payload
encryption.
## Filter / detection
- `check_text(text)` -> `FilterDecision`
- `stream_matcher(max_window=None).feed(chunk)` -> `FilterDecision`
`check_text` masks `` token bodies before prompt-injection checks,
so ciphertext is not misinterpreted as plaintext instructions.
## HTTP adapter primitives
`privacy_gateway.adapters.http` exposes pure data helpers for HTTP gateways
without importing FastAPI, Flask, APISIX, or any networking framework:
```python
from privacy_gateway.adapters.http import build_block_error
```
Legacy encrypted-header helpers are still exported for old integrations, but new
automatic token flows should not depend on them.
## Settings
Read environment variables:
- `PRIVACY_GATEWAY_PASSWORD`: password for reversible `` tokens.
- `PRIVACY_GATEWAY_CRYPTO_KEY`: optional legacy full-text AES key; also used as token password fallback when `PRIVACY_GATEWAY_PASSWORD` is not set.
- `PRIVACY_GATEWAY_PII_ENTITIES`: comma-separated Presidio entity types to enable.
- `PRIVACY_GATEWAY_SPACY_MODEL`: prepared spaCy model name/path, default `en_core_web_sm`.
- `PRIVACY_GATEWAY_REQUIRE_SPACY_MODEL`: require the model at startup, default `true`.
- `PRIVACY_GATEWAY_SENSITIVE_PHRASES`: comma-separated prompt-injection phrase list.
- `PRIVACY_GATEWAY_MAX_SENSITIVE_STREAM_WINDOW`: integer window size, default `4096`.
Create a configured filter from env:
```python
from privacy_gateway import PrivacyGatewayFilter
from privacy_gateway.config import get_settings
filter_ = PrivacyGatewayFilter.from_settings(get_settings())
```
## Validation
Run behavior tests on core library APIs:
```bash
uv run python scripts/prepare_spacy_model.py en_core_web_sm
uv run behave
uv run python -m compileall -q src/privacy_gateway
```
To include the APISIX example files in syntax validation:
```bash
uv run python -m compileall -q src/privacy_gateway \
apisix-plugin-example/init \
apisix-plugin-example/privacy_proxy \
apisix-plugin-example/runner/apisix/plugins \
apisix-plugin-example/upstream \
apisix-plugin-example/tests
```
## Error classes
- `TextCryptoError`
- `TextCryptoKeyError`
- `ImageCryptoError`
- `UnsupportedPayloadTypeError`
- `PrivacyGatewayError`
## Notes
- This repository intentionally provides **library primitives only**.
- Gateway plugins should import and embed `PrivacyGatewayFilter`.
- Gateway plugins should own JSON parsing/field selection and use this library for natural-language string protection/restoration.