https://github.com/screenpipe/screenleak
Multi-modal benchmark for measuring sensitive-information disclosure in computer-use agents
https://github.com/screenpipe/screenleak
benchmark computer-use computer-use-agent eval evaluation
Last synced: 8 days ago
JSON representation
Multi-modal benchmark for measuring sensitive-information disclosure in computer-use agents
- Host: GitHub
- URL: https://github.com/screenpipe/screenleak
- Owner: screenpipe
- License: apache-2.0
- Created: 2026-05-11T16:56:33.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-06-02T00:26:49.000Z (28 days ago)
- Last Synced: 2026-06-02T01:19:01.679Z (28 days ago)
- Topics: benchmark, computer-use, computer-use-agent, eval, evaluation
- Language: Python
- Homepage: https://screenpipe.github.io/screenleak/
- Size: 2.4 MB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Threat model: THREAT_MODEL.md
- Citation: CITATION.bib
- Security: SECURITY.md
- Roadmap: ROADMAP.md
Awesome Lists containing this project
README
# ScreenLeak
[](LICENSE)
[](LICENSE-DATA)
[](.github/workflows/ci.yml)
[](pyproject.toml)
> **A multi-modal benchmark measuring how well today's tools redact PII from screen telemetry, screenshots, and multi-step computer-use traces.**
Blog: [screenpipe.github.io/screenleak](https://screenpipe.github.io/screenleak/) · Contact: `louis@screenpi.pe`
**▶ [Try the live demo](https://screenpipe.github.io/screenleak/demo/)** — redact text and screenshots **in your browser** with the actual local models (`pii-redactor` + `rfdetr_v11`). Nothing is uploaded.
## Headline — composite framework coverage (text · image · trace)
Each adapter is scored on every surface where it operates. The composite is the mean across surfaces; the trace surface is the weakest link and caps every row.
| Framework | Text
(`v45_phase3`) | Image
(`rfdetr_v11`) | Trace
(`gpt5`) | **Composite** |
|---|---:|---:|---:|---:|
| HIPAA | 91.8% | 98.8% | 76.0% | **88.9%** |
| GDPR | 90.2% | 98.8% | 68.0% | 85.7% |
| CCPA | 90.2% | 98.8% | 68.0% | 85.7% |
| SOC 2 | 88.0% | 98.9% | 68.0% | 85.0% |
| PCI DSS | 88.7% | 100.0% | 78.3% | **89.0%** |
| DPDPA | 91.6% | 98.8% | 72.0% | 87.5% |
Single shared label-subset dict ([`scoring/frameworks.py`](scoring/frameworks.py)) applies across all three surfaces. Numbers are zero-leak rates on the private val sets (422 text · 221 image · 25 trace) — re-runnable from the public sample with the same probes. Full unified breakdown: [`results/framework_coverage.md`](results/framework_coverage.md).
## Three sub-benches
| Bench | Question | Corpus |
|---|---|---|
| [`text/`](text/) | Find PII in window titles / AX nodes / OCR fragments | 422 cases · 13 labels · multilingual + adversarial splits |
| [`image/`](image/) | Find pixel regions of PII in rendered screens | Synthetic screenshots · 9 app templates · pixel-precise gold |
| [`trace/`](trace/) | Does the agent leak PII it observes inside a task? | 50 traces (25 train + 25 val) with injected PII |
All three use the same 13-label taxonomy ([`CATEGORIES.md`](CATEGORIES.md)).
## Three failure modes — text adapters, full comparison
Zero-leak alone is half the picture. Local size + RAM + latency are the other half. The `v45_phase3` row is the only one strong on every dimension.
| Adapter | Local | Zero-leak | Oversmash | macro-F1 | Size | RAM | p50 |
|---|:---:|---:|---:|---:|---:|---:|---:|
| `gemini` (gemini-3.1-pro) | ❌ | **91.0%** | 2.6% | 0.85 | — | — | 3 754 ms |
| `gpt5` (gpt-5.5) | ❌ | 90.7% | 5.2% | 0.85 | — | — | 2 173 ms |
| `claude` (claude-opus-4-7) | ❌ | 87.8% | 5.2% | 0.81 | — | — | 1 550 ms |
| **`v45_phase3`** ⭐ | ✅ | **86.7%** | **0.0%** | 0.78 | **278 MB** | **1.1 GB** | **9 ms** |
| `privacy_filter_ft_v6` | ✅ | 80.9% | 3.9% | 0.72 | ~ 1.4 GB | ~ 6 GB | 54 ms |
| `opf_rs` / privacy_filter family | ✅ | 38 – 80% | 4 – 10% | 0.35 – 0.72 | ~ 1.4 GB | ~ 6 GB | 1 – 120 ms |
| `gliner_pii` | ✅ | 62.6% | 79.2% | 0.44 | ~ 500 MB | ~ 1.5 GB | 104 ms |
| `gcp_dlp` | ❌ | 37.7% | 11.7% | 0.24 | — | — | 84 ms |
| `presidio` | ✅ | 35.4% | 22.1% | 0.20 | ~ 200 MB | ~ 400 MB | 6 ms |
| `regex` | ✅ | 33.9% | 1.3% | 0.57 | < 1 MB | ~ 30 MB | < 1 ms |
`v45_phase3` peak RSS measured via `cargo run --release --example v45_phase3_smoke --features onnx-cpu`. Other adapters' size / RAM from each model's card.
## Findings
**1. A 278 MB local model matches frontier APIs on text. Cloud DLP products don't.** On the 422-case text bench, Gemini / GPT-5.5 / Claude all score 87.8 – 91.0 % zero-leak. `v45_phase3` averages 90 % across compliance frameworks at 9 ms p50 on CPU. Google Cloud DLP (37.7 %) and Microsoft Presidio (35.4 %) barely beat a hand-rolled regex (33.9 %) — they were built for documents, not screen telemetry.
**2. Frontier vision can't draw boxes. A small specialized detector can.** On 190 PII-bearing screenshots at IoU ≥ 0.30, every frontier vision API sits below 5 % zero-leak with CIs that overlap each other and the Tesseract+regex baseline. A locally fine-tuned RF-DETR-Nano (28 M params, ~109 MB, 512×512 input) hits **98.9 %** with a lower CI bound of 96.2 % — decisively separated.
**3. Frontier APIs don't withhold PII when working.** On 25 multi-turn traces, the strongest (GPT-5.5 at 64.0 %, CI 44 – 80 %) still leaks at least one observed PII item in 36 % of traces. Gemini at 20 % leaks in 80 % of traces.
**The pattern:** capability ≠ pixel-grounding ≠ disposition. A model nailing text detection at 91 % can still leak PII 80 % of the time when it observes that PII inside a task. See [`THREAT_MODEL.md`](THREAT_MODEL.md) for what counts as a leak, [`LIMITATIONS.md`](LIMITATIONS.md) for caveats (notably: `rfdetr`'s val is in-distribution with its training — held-out images, same source).
## What's in this repo
- **Scoring code** — `text/src/score.py`, `image/src/score.py`, `trace/src/score.py` + `trace/src/replay.py`. Shared framework probe at [`scoring/frameworks.py`](scoring/frameworks.py).
- **Every adapter benchmarked** — Claude, GPT-5.5, Gemini, GCP DLP, Presidio, GLiNER, `privacy_filter` family, RF-DETR, regex baselines.
- **Methodology / threat model / categories / limitations / citation.**
- **Public sample per surface** so any adapter can be run end-to-end without the full corpus:
- `text/data/sample.jsonl` — 51 cases (incl. PHI / PCI / Art. 9 / multilingual tasters)
- `image/corpus/sample/` — 30 rendered screenshots + DOM-extracted gold bboxes
- `trace/data/injected_sample.jsonl` — 5 multi-turn computer-use traces
The **full val sets** and the data pipelines that produce them live in a private companion repo — to keep the leaderboard uncontaminated by training-on-the-bench and to preserve the moat. Researchers running serious evaluations can request access at `louis@screenpi.pe`.
## Run an adapter on the sample
```bash
make install
export ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GOOGLE_API_KEY=...
make bench-text ADAPTER=claude # or: gpt5, gemini, v45_phase3, gcp_dlp, regex, …
make bench-image ADAPTER=rfdetr # or: claude, gpt5, gemini, regex_ocr, …
make bench-trace ADAPTER=claude # or: gpt5, gemini
```
Headline leaderboard numbers are computed on the full private corpus; sample-corpus runs are for adapter validation and onboarding, not re-ranking.
## Cite this
See [CITATION.bib](CITATION.bib).
## Contact
`louis@screenpi.pe`