{"id":51022889,"url":"https://github.com/screenpipe/screenleak","last_synced_at":"2026-06-21T17:02:03.065Z","repository":{"id":357193449,"uuid":"1235793033","full_name":"screenpipe/screenleak","owner":"screenpipe","description":"Multi-modal benchmark for measuring sensitive-information disclosure in computer-use agents","archived":false,"fork":false,"pushed_at":"2026-06-02T00:26:49.000Z","size":2513,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-02T01:19:01.679Z","etag":null,"topics":["benchmark","computer-use","computer-use-agent","eval","evaluation"],"latest_commit_sha":null,"homepage":"https://screenpipe.github.io/screenleak/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/screenpipe.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":"THREAT_MODEL.md","audit":null,"citation":"CITATION.bib","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-11T16:56:33.000Z","updated_at":"2026-06-02T00:26:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/screenpipe/screenleak","commit_stats":null,"previous_names":["screenpipe/screenleak"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/screenpipe/screenleak","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/screenpipe%2Fscreenleak","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/screenpipe%2Fscreenleak/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/screenpipe%2Fscreenleak/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/screenpipe%2Fscreenleak/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/screenpipe","download_url":"https://codeload.github.com/screenpipe/screenleak/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/screenpipe%2Fscreenleak/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34618484,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","computer-use","computer-use-agent","eval","evaluation"],"created_at":"2026-06-21T17:02:02.154Z","updated_at":"2026-06-21T17:02:03.054Z","avatar_url":"https://github.com/screenpipe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScreenLeak\n\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n[![Data License: CC BY 4.0](https://img.shields.io/badge/Data-CC_BY_4.0-lightgrey.svg)](LICENSE-DATA)\n[![ci](https://img.shields.io/badge/ci-pytest%20%2B%20ruff-black)](.github/workflows/ci.yml)\n[![python](https://img.shields.io/badge/python-3.10%2B-blue)](pyproject.toml)\n\n\u003e **A multi-modal benchmark measuring how well today's tools redact PII from screen telemetry, screenshots, and multi-step computer-use traces.**\n\nBlog: [screenpipe.github.io/screenleak](https://screenpipe.github.io/screenleak/) · Contact: `louis@screenpi.pe`\n\n**▶ [Try the live demo](https://screenpipe.github.io/screenleak/demo/)** — redact text and screenshots **in your browser** with the actual local models (`pii-redactor` + `rfdetr_v11`). Nothing is uploaded.\n\n## Headline — composite framework coverage (text · image · trace)\n\nEach adapter is scored on every surface where it operates. The composite is the mean across surfaces; the trace surface is the weakest link and caps every row.\n\n| Framework | Text\u003cbr\u003e(`v45_phase3`) | Image\u003cbr\u003e(`rfdetr_v11`) | Trace\u003cbr\u003e(`gpt5`) | **Composite** |\n|---|---:|---:|---:|---:|\n| HIPAA   | 91.8% | 98.8% | 76.0% | **88.9%** |\n| GDPR    | 90.2% | 98.8% | 68.0% | 85.7% |\n| CCPA    | 90.2% | 98.8% | 68.0% | 85.7% |\n| SOC 2   | 88.0% | 98.9% | 68.0% | 85.0% |\n| PCI DSS | 88.7% | 100.0% | 78.3% | **89.0%** |\n| DPDPA   | 91.6% | 98.8% | 72.0% | 87.5% |\n\nSingle shared label-subset dict ([`scoring/frameworks.py`](scoring/frameworks.py)) applies across all three surfaces. Numbers are zero-leak rates on the private val sets (422 text · 221 image · 25 trace) — re-runnable from the public sample with the same probes. Full unified breakdown: [`results/framework_coverage.md`](results/framework_coverage.md).\n\n## Three sub-benches\n\n| Bench | Question | Corpus |\n|---|---|---|\n| [`text/`](text/) | Find PII in window titles / AX nodes / OCR fragments | 422 cases · 13 labels · multilingual + adversarial splits |\n| [`image/`](image/) | Find pixel regions of PII in rendered screens | Synthetic screenshots · 9 app templates · pixel-precise gold |\n| [`trace/`](trace/) | Does the agent leak PII it observes inside a task? | 50 traces (25 train + 25 val) with injected PII |\n\nAll three use the same 13-label taxonomy ([`CATEGORIES.md`](CATEGORIES.md)).\n\n## Three failure modes — text adapters, full comparison\n\nZero-leak alone is half the picture. Local size + RAM + latency are the other half. The `v45_phase3` row is the only one strong on every dimension.\n\n| Adapter | Local | Zero-leak | Oversmash | macro-F1 | Size | RAM | p50 |\n|---|:---:|---:|---:|---:|---:|---:|---:|\n| `gemini` (gemini-3.1-pro) | ❌ | **91.0%** | 2.6% | 0.85 | — | — | 3 754 ms |\n| `gpt5` (gpt-5.5) | ❌ | 90.7% | 5.2% | 0.85 | — | — | 2 173 ms |\n| `claude` (claude-opus-4-7) | ❌ | 87.8% | 5.2% | 0.81 | — | — | 1 550 ms |\n| **`v45_phase3`** ⭐ | ✅ | **86.7%** | **0.0%** | 0.78 | **278 MB** | **1.1 GB** | **9 ms** |\n| `privacy_filter_ft_v6` | ✅ | 80.9% | 3.9% | 0.72 | ~ 1.4 GB | ~ 6 GB | 54 ms |\n| `opf_rs` / privacy_filter family | ✅ | 38 – 80% | 4 – 10% | 0.35 – 0.72 | ~ 1.4 GB | ~ 6 GB | 1 – 120 ms |\n| `gliner_pii` | ✅ | 62.6% | 79.2% | 0.44 | ~ 500 MB | ~ 1.5 GB | 104 ms |\n| `gcp_dlp` | ❌ | 37.7% | 11.7% | 0.24 | — | — | 84 ms |\n| `presidio` | ✅ | 35.4% | 22.1% | 0.20 | ~ 200 MB | ~ 400 MB | 6 ms |\n| `regex` | ✅ | 33.9% | 1.3% | 0.57 | \u003c 1 MB | ~ 30 MB | \u003c 1 ms |\n\n`v45_phase3` peak RSS measured via `cargo run --release --example v45_phase3_smoke --features onnx-cpu`. Other adapters' size / RAM from each model's card.\n\n## Findings\n\n**1. A 278 MB local model matches frontier APIs on text. Cloud DLP products don't.** On the 422-case text bench, Gemini / GPT-5.5 / Claude all score 87.8 – 91.0 % zero-leak. `v45_phase3` averages 90 % across compliance frameworks at 9 ms p50 on CPU. Google Cloud DLP (37.7 %) and Microsoft Presidio (35.4 %) barely beat a hand-rolled regex (33.9 %) — they were built for documents, not screen telemetry.\n\n**2. Frontier vision can't draw boxes. A small specialized detector can.** On 190 PII-bearing screenshots at IoU ≥ 0.30, every frontier vision API sits below 5 % zero-leak with CIs that overlap each other and the Tesseract+regex baseline. A locally fine-tuned RF-DETR-Nano (28 M params, ~109 MB, 512×512 input) hits **98.9 %** with a lower CI bound of 96.2 % — decisively separated.\n\n**3. Frontier APIs don't withhold PII when working.** On 25 multi-turn traces, the strongest (GPT-5.5 at 64.0 %, CI 44 – 80 %) still leaks at least one observed PII item in 36 % of traces. Gemini at 20 % leaks in 80 % of traces.\n\n**The pattern:** capability ≠ pixel-grounding ≠ disposition. A model nailing text detection at 91 % can still leak PII 80 % of the time when it observes that PII inside a task. See [`THREAT_MODEL.md`](THREAT_MODEL.md) for what counts as a leak, [`LIMITATIONS.md`](LIMITATIONS.md) for caveats (notably: `rfdetr`'s val is in-distribution with its training — held-out images, same source).\n\n## What's in this repo\n\n- **Scoring code** — `text/src/score.py`, `image/src/score.py`, `trace/src/score.py` + `trace/src/replay.py`. Shared framework probe at [`scoring/frameworks.py`](scoring/frameworks.py).\n- **Every adapter benchmarked** — Claude, GPT-5.5, Gemini, GCP DLP, Presidio, GLiNER, `privacy_filter` family, RF-DETR, regex baselines.\n- **Methodology / threat model / categories / limitations / citation.**\n- **Public sample per surface** so any adapter can be run end-to-end without the full corpus:\n  - `text/data/sample.jsonl` — 51 cases (incl. PHI / PCI / Art. 9 / multilingual tasters)\n  - `image/corpus/sample/` — 30 rendered screenshots + DOM-extracted gold bboxes\n  - `trace/data/injected_sample.jsonl` — 5 multi-turn computer-use traces\n\nThe **full val sets** and the data pipelines that produce them live in a private companion repo — to keep the leaderboard uncontaminated by training-on-the-bench and to preserve the moat. Researchers running serious evaluations can request access at `louis@screenpi.pe`.\n\n## Run an adapter on the sample\n\n```bash\nmake install\nexport ANTHROPIC_API_KEY=...  OPENAI_API_KEY=...  GOOGLE_API_KEY=...\n\nmake bench-text  ADAPTER=claude          # or: gpt5, gemini, v45_phase3, gcp_dlp, regex, …\nmake bench-image ADAPTER=rfdetr          # or: claude, gpt5, gemini, regex_ocr, …\nmake bench-trace ADAPTER=claude          # or: gpt5, gemini\n```\n\nHeadline leaderboard numbers are computed on the full private corpus; sample-corpus runs are for adapter validation and onboarding, not re-ranking.\n\n## Cite this\n\nSee [CITATION.bib](CITATION.bib).\n\n## Contact\n\n`louis@screenpi.pe`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscreenpipe%2Fscreenleak","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscreenpipe%2Fscreenleak","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscreenpipe%2Fscreenleak/lists"}