{"id":50271757,"url":"https://github.com/crp4222/privaite","last_synced_at":"2026-06-27T19:01:16.136Z","repository":{"id":356630606,"uuid":"1232100220","full_name":"crp4222/PrivAiTe","owner":"crp4222","description":"Privacy-first LLM proxy : automatically anonymizes PII before sending to any LLM provider, then de-anonymizes responses. Drop-in OpenAI-compatible API. Works with OpenWebUI, Ollama, OpenAI, Anthropic.","archived":false,"fork":false,"pushed_at":"2026-06-22T20:11:36.000Z","size":463,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-22T20:12:52.795Z","etag":null,"topics":["anonymization","gdpr","litellm","llm","openai","pii","privacy","proxy"],"latest_commit_sha":null,"homepage":"https://github.com/crp4222/PrivAiTe","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crp4222.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-07T15:32:39.000Z","updated_at":"2026-06-22T20:11:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/crp4222/PrivAiTe","commit_stats":null,"previous_names":["crp4222/privaite"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/crp4222/PrivAiTe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crp4222%2FPrivAiTe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crp4222%2FPrivAiTe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crp4222%2FPrivAiTe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crp4222%2FPrivAiTe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crp4222","download_url":"https://codeload.github.com/crp4222/PrivAiTe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crp4222%2FPrivAiTe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34864431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-27T02:00:06.362Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anonymization","gdpr","litellm","llm","openai","pii","privacy","proxy"],"created_at":"2026-05-27T18:04:13.278Z","updated_at":"2026-06-27T19:01:16.130Z","avatar_url":"https://github.com/crp4222.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PrivAiTe\n\n[![CI](https://github.com/crp4222/PrivAiTe/actions/workflows/ci.yml/badge.svg)](https://github.com/crp4222/PrivAiTe/actions)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/license-BSD--3--Clause-green.svg)](LICENSE)\n\nKeep personal data out of your LLM calls. PrivAiTe is a local proxy that sits between your app and the model provider. It finds names, emails, phone numbers, cards, IBANs, secrets and more, swaps them for stand-ins before anything leaves your machine, and puts the real values back in the reply. It does this across message text, **tool-call arguments, and multimodal content**, which is where most tools stop looking. Detection runs on your machine and nothing phones home. By default it runs the full ONNX suite, so it also catches **secrets and passwords**, not just the easy regex entities. Point any OpenAI-compatible client at it.\n\n```\nYou type: \"Je m'appelle Marie Dupont, email marie@acme.com\"\nLLM sees: \"Je m'appelle \u003cPERSON_1\u003e, email \u003cEMAIL_ADDRESS_1\u003e\"\nLLM says: \"Bonjour \u003cPERSON_1\u003e, votre email \u003cEMAIL_ADDRESS_1\u003e est noté.\"\nYou  see: \"Bonjour Marie Dupont, votre email marie@acme.com est noté.\"\n```\n\nThis is local pseudonymization, not anonymization, and detection is best-effort rather than a guarantee. You remain the data controller. The [Threat model](#threat-model) spells out exactly what it protects against and what it does not.\n\n## How detection works\n\nPrivAiTe uses two detection engines that can run together or separately:\n\n### Presidio (Microsoft): regex + spaCy NER\n\nThe default engine. Handles structured PII through pattern matching and basic NER.\n\n| What it detects | How |\n|---|---|\n| Emails | Regex |\n| Phone numbers | Regex + international format validation |\n| Credit cards | Regex + Luhn checksum |\n| IBAN | Regex + checksum validation |\n| IP addresses | Regex |\n| US SSN | Regex + format validation |\n| Person names (capitalized, 2+ words) | spaCy NER, only kept if all words are capitalized |\n| Person names (lowercase or single word) | Contextual regex, only after \"je m'appelle X\", \"my name is X\", \"ich heiße X\", \"Nom: X\", etc. |\n| Dates (FR/DE) | Custom regex, \"15 mars 1987\", \"3. März 1990\" |\n\nPresidio is fast (~23ms/request) and produces zero false positives on code, news articles, and technical text. The tradeoff: it misses names that spaCy doesn't recognize (unusual names, single-word names without context) and doesn't detect secrets/passwords.\n\n### OpenAI Privacy Filter: contextual ML model\n\n[OpenAI's open-source PII model](https://openai.com/index/introducing-openai-privacy-filter/) (1.5B params, 50M active, Apache 2.0). Runs locally via ONNX Runtime (~800MB, no PyTorch needed).\n\n| What it adds over Presidio | How |\n|---|---|\n| Person names (any format, any case) | ML NER, understands context, not just capitalization |\n| Passwords and secrets | Detects \"SuperSecret2024!\", API keys like \"sk-proj-...\" |\n| Account numbers | Detects bank account numbers, policy numbers, etc. |\n| Dates (all languages) | ML-based, not limited to FR/DE regex |\n\nThe Privacy Filter is slower (~400ms/request) and occasionally flags technical identifiers as account numbers (e.g., \"CMD-2024-98765\"). It runs as a second pass alongside Presidio, which handles the regex-based entities while the Privacy Filter handles contextual NER.\n\n### Why two engines?\n\nNeither is perfect alone:\n- **Presidio alone** misses names that spaCy doesn't recognize, and can't detect secrets. But it has zero false positives.\n- **Privacy Filter alone** misses some names in credit/list formats, and doesn't have regex validators for IBAN/credit card checksums.\n- **Both together** cover each other's blind spots. Presidio handles structured formats with validation, the Privacy Filter handles context-dependent PII.\n\n## Presets\n\n`onnx` is the default. It runs the full suite and detects everything, including secrets and passwords. `light` is a faster, zero false-positive option for when you only care about classic PII.\n\n| Preset | What runs | Detection | False positives | Speed | Secrets |\n|--------|-----------|-----------|-----------------|-------|---------|\n| `onnx` (default) | Presidio + Privacy Filter | **100%** | ~7% | 400ms | **yes** |\n| `light` | Presidio only | 97% | **0%** | **23ms** | no |\n\n```yaml\npii:\n  preset: \"onnx\"    # Default. Detects everything including secrets. Downloads the model on first run.\n  # preset: \"light\" # Faster, zero false positives, classic PII only.\n```\n\nThe default install already includes onnxruntime and the tokenizer, so the `onnx` preset works out of the box. The model is downloaded the first time the proxy starts. The `ml` extra (the `standard` and `full` BERT presets) is the only one that adds torch.\n\n**When to use `onnx` (default):** You want maximum coverage. Secrets, passwords, API keys, account numbers, unusual names. Accept occasional false positives on technical identifiers.\n\n**When to use `light`:** You want zero disruption and the fastest path. Code, news, business text all pass through untouched. Only clearly identifiable PII (names, emails, phones, cards, IBANs) is anonymized.\n\nTwo other presets exist (`standard`, `full`) but are less useful in practice: they add BERT NER, which does not improve much over spaCy and pulls in PyTorch.\n\n## Benchmark\n\nTested on 61 documents across 5 languages (FR, EN, DE, ES, IT). Corporate letters, contracts, invoices, medical referrals, CVs, bank transfers, news articles, codebases. Mix of synthetic data (valid checksums) and real-world public report extracts.\n\n| | **light** | **onnx** |\n|---|---|---|\n| Detection | 96.7% (236/244) | **100% (244/244)** |\n| False positives | **0/14 (0%)** | 1/14 (7%) |\n| PERSON | 93% | **100%** |\n| EMAIL | 98% | **100%** |\n| PHONE | 100% | 100% |\n| IBAN | 100% | 100% |\n| CREDIT_CARD | 100% | 100% |\n| DATE | 100% | 100% |\n| SSN | 100% | 100% |\n| Secrets | no | **yes** |\n\nThe `light` misses are all PERSON entities: single-word names, long multi-part Spanish names, and names spaCy doesn't recognize. Regex entities are 100% on both presets.\n\nFull benchmark with all test data: [privaite-bench](https://github.com/crp4222/privaite-bench)\n\n## What's NOT detected by default\n\nThe default `onnx` preset does detect personal addresses (as `LOCATION`) and personal URLs (as `URL`) through the Privacy Filter model, and replaces them. What stays off by default are Presidio's broad recognizers for those types, because they cause heavy false positives:\n\n- **Generic place names (the Presidio LOCATION recognizer):** \"Paris\" or \"London\" on their own aren't PII, and spaCy flags ordinary words (\"Kubernetes\", \"Saturday\") as locations. The `onnx` preset keeps this recognizer off and relies on the model's context-aware address detection instead.\n- **The Presidio URL regex:** it matches code like `logging.getLogger` because `.ge` is a valid TLD. The `onnx` preset keeps it off, and the model still catches genuine personal URLs.\n\nOn the `light` preset (Presidio only), addresses and URLs are not detected. Secrets and passwords are detected only by the `onnx` preset. Any recognizer can be turned on in the YAML config.\n\n## Threat model\n\nPrivAiTe performs **local pseudonymization**, not guaranteed anonymization. Detection runs on your machine; the real ↔ placeholder mapping lives in memory only for the duration of a request and is dropped afterwards.\n\n**What it protects against:** the LLM provider storing, training on, or logging your raw PII. The provider receives placeholders (`\u003cPERSON_1\u003e`, …) for everything the detector catches, across message content, tool-call arguments, and multimodal text.\n\n**What it does NOT protect against:**\n\n- **PII the detector misses.** Detection is statistical and never 100% (see the [benchmark](https://github.com/crp4222/privaite-bench)). A name it doesn't recognize reaches the provider. The `onnx` preset has the best recall; treat the output as best-effort, not a guarantee.\n- **Re-identification from context.** Even with names replaced, the surrounding text can stay identifying (\"the CEO of `\u003cORG_1\u003e` who resigned in March\").\n- **A compromised local machine.** The mapping and raw text live in local memory; this is not a defense against a local attacker.\n- **The provider correlating** requests within a session.\n\nFor GDPR/HIPAA: treat this as pseudonymization + transfer minimization, not anonymization. If you need irreversible removal, use `method: \"redact\"` instead of `method: \"placeholder\"`.\n\n## Alternatives\n\nKeeping PII out of LLM calls is a crowded space, and PrivAiTe is not always the right pick. Based on each project's public docs as of June 2026:\n\n- LiteLLM has a built-in Presidio guardrail, the natural choice if you already run the LiteLLM proxy and want PII handling inline (there are a few open bugs around scrubbing requests and responses).\n- Managed/cloud options exist too, such as Microsoft PII Shield and [LangChain's gateway redaction](https://docs.langchain.com/langsmith/llm-gateway-redaction).\n\nWhere PrivAiTe differs: it anonymizes PII **inside tool-call arguments and multimodal content**, not just message text (LangChain's gateway docs, for instance, note that tool-call arguments are not scanned), it **restores** the original values in the response, and it ships a [reproducible benchmark](https://github.com/crp4222/privaite-bench). If your traffic is agentic or multimodal, that gap is the reason this exists.\n\n## Quick start\n\n### 1. Install\n\n```bash\npip install -e .\npython -m spacy download en_core_web_lg\npython -m spacy download fr_core_news_md\n```\n\nThe default `onnx` preset downloads its model the first time the proxy starts. Want the lighter, faster path with no model download? Set `preset: \"light\"` in your config.\n\n### 2. Configure\n\n```bash\ncp .env.example .env\ncp config/privaite.example.yaml config/privaite.yaml\n```\n\nEdit `.env` with your API keys and `config/privaite.yaml` with your LLM providers.\n\n### 3. Run\n\n```bash\npython -m privaite\n\n# Dev mode (auto-reload)\npython -m privaite --reload\n```\n\n### 4. Connect\n\nPoint any OpenAI-compatible client to `http://localhost:8400/v1` with your proxy API key. Ready-to-run client snippets (curl, Python, Node) are in [`examples/`](examples/).\n\n**OpenWebUI (Docker):** Admin → Settings → Connections → OpenAI API:\n- URL: `http://host.docker.internal:8400/v1`\n- Key: your `PRIVAITE_API_KEYS` value\n\nIf you would rather not run a separate proxy, there is also an in-process Open\nWebUI filter (see [Open WebUI filter](#open-webui-filter) below).\n\n## Docker\n\n```bash\ndocker compose up -d\n```\n\n## Open WebUI filter\n\n`integrations/openwebui/privaite_filter.py` is an Open WebUI Filter Function. It\nruns the engine inside Open WebUI, so it anonymizes the outgoing request and\nrestores PII in the reply without a separate proxy. It covers message text,\ntool-call arguments, and multimodal text.\n\nTo install it: Admin Panel → Functions → \"+\", paste the file, save, enable it,\nthen open its valves to pick the preset (`light` or `onnx`) and the languages.\nThe filter pulls Presidio and spaCy into Open WebUI and downloads the spaCy\nmodels on first use, so the first request after enabling it can be slow. Setup\nnotes are in [`integrations/openwebui/README.md`](integrations/openwebui/README.md).\n\n## Configuration\n\n### LLM providers\n\nAny [LiteLLM-supported provider](https://docs.litellm.ai/docs/providers) works:\n\n```yaml\nproviders:\n  - model_name: \"gpt-4o\"\n    litellm_params:\n      model: \"openai/gpt-4o\"\n      api_key: \"${OPENAI_API_KEY}\"\n\n  - model_name: \"local-llama\"\n    litellm_params:\n      model: \"ollama/llama3.1\"\n      api_base: \"http://localhost:11434\"\n```\n\n### Anonymization method\n\n```yaml\npii:\n  anonymization:\n    method: \"placeholder\"        # \u003cPERSON_1\u003e, \u003cEMAIL_ADDRESS_1\u003e (recommended)\n    # method: \"fake_replacement\" # Realistic fakes via Faker (Jean → Michel)\n    # method: \"redact\"           # [PERSON], [EMAIL_ADDRESS] (irreversible)\n    # method: \"mask\"             # ********\n```\n\n### Custom regex patterns\n\nAdd your own PII patterns without touching code:\n\n```yaml\npii:\n  custom_patterns:\n    - pattern: \"KD-\\\\d{6}\"\n      entity_type: \"CUSTOMER_ID\"\n    - pattern: \"REF-[A-Z]{3}-\\\\d+\"\n      entity_type: \"REFERENCE\"\n```\n\n### Languages\n\n7 languages supported with spaCy NER and contextual patterns: FR, EN, DE, ES, IT (benchmarked), plus PT and NL (best-effort, not yet in the benchmark).\n\n```yaml\npii:\n  detectors:\n    presidio:\n      languages: [\"fr\", \"en\"]  # Add \"de\", \"es\", etc.\n```\n\nEach language needs its spaCy model: `python -m spacy download de_core_news_md`\n\n## API\n\nOpenAI-compatible:\n\n| Endpoint | Description |\n|----------|-------------|\n| `POST /v1/chat/completions` | Chat (streaming + non-streaming) |\n| `POST /v1/completions` | Text completions |\n| `POST /v1/embeddings` | Embeddings (anonymized, no de-anonymization) |\n| `GET /v1/models` | List configured models |\n| `GET /health` | Health check |\n| `GET /ready` | Readiness check |\n| `GET /stats` | PII detection stats per session |\n\n### What gets anonymized\n\nPII is stripped from every field that carries user text to the provider:\n\n- `messages[].content`, whether a plain string or a multimodal list of parts (text parts are scrubbed, images and audio are left alone).\n- `tool_calls[].function.arguments` and the legacy `function_call.arguments`: parsed as JSON and scrubbed value by value, so object keys and the function name stay intact. Arguments that are not valid JSON are scrubbed as free text.\n- `/v1/completions` `prompt` and `/v1/embeddings` `input`, as a string or a list of strings.\n\nOn the way back, the original values are restored in `message.content` and in returned `tool_calls` (including the legacy `function_call`), in both non-streaming and streaming responses. Set `pii.passthrough.tool_calls: true` to forward tool-call arguments unchanged.\n\nFor a stricter posture, set `pii.strict: true`: any request whose content can't be inspected (a shape that is neither text nor a known media part) is rejected with `400` instead of being forwarded.\n\n## Known limitations\n\n- **Single-word names** from spaCy are dropped (too many false positives). Caught by contextual patterns (\"Nom: X\") or the `onnx` preset.\n- **Lowercase names** need intro patterns (\"je m'appelle X\"). The `onnx` preset catches them without patterns.\n- **Informal dates** (\"last Tuesday\", \"il y a deux ans\") are not detected.\n- **No policy gate**: all requests are forwarded after pseudonymization.\n\n## Development\n\n```bash\npip install -e \".[dev]\"\npython -m pytest tests/ -v\n```\n\n## License\n\nBSD 3-Clause. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrp4222%2Fprivaite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrp4222%2Fprivaite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrp4222%2Fprivaite/lists"}