{"id":50241423,"url":"https://github.com/teamchong/textsift","last_synced_at":"2026-05-26T21:05:30.631Z","repository":{"id":354078062,"uuid":"1222043078","full_name":"teamchong/textsift","owner":"teamchong","description":"Local-first PII detection + redaction running openai/privacy-filter on-device. Same engine in browser (WebGPU), Node native (Metal/Vulkan/Dawn), CLI, pre-commit hook, and GitHub Action.","archived":false,"fork":false,"pushed_at":"2026-04-27T03:57:56.000Z","size":10525,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-27T04:06:23.919Z","etag":null,"topics":["dawn","github-action","local-first","metal","openai","pii","pre-commit-hook","privacy","privacy-filter","redaction","sarif","vulkan","wasm","webgpu"],"latest_commit_sha":null,"homepage":"https://teamchong.github.io/textsift/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/teamchong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-27T01:46:04.000Z","updated_at":"2026-04-27T03:58:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/teamchong/textsift","commit_stats":null,"previous_names":["teamchong/textsift"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/teamchong/textsift","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/teamchong%2Ftextsift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/teamchong%2Ftextsift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/teamchong%2Ftextsift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/teamchong%2Ftextsift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/teamchong","download_url":"https://codeload.github.com/teamchong/textsift/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/teamchong%2Ftextsift/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33538731,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"ssl_error","status_checked_at":"2026-05-26T15:22:15.568Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dawn","github-action","local-first","metal","openai","pii","pre-commit-hook","privacy","privacy-filter","redaction","sarif","vulkan","wasm","webgpu"],"created_at":"2026-05-26T21:05:29.514Z","updated_at":"2026-05-26T21:05:30.625Z","avatar_url":"https://github.com/teamchong.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# textsift\n\n\u003e **Personal learning project.** I built this to teach myself WebGPU compute shaders, Zig→WASM with SIMD intrinsics, and the o200k-style BPE tokenizer pipeline. The code works and the tests pass, but treat it as such — there's no SLA, no roadmap commitment, no team behind it. PRs and bug reports welcome; \"production support\" is not.\n\nPII detection and redaction that runs [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) on the user's device. Per-platform GPU fast paths (Metal on macOS, Vulkan on Linux, Dawn on Windows, WebGPU in browsers); Zig + SIMD128 WASM as the no-GPU fallback. Apache 2.0.\n\n[**Docs**](https://teamchong.github.io/textsift/) · [**Quickstart**](https://teamchong.github.io/textsift/quickstart/) · [**Playground**](https://teamchong.github.io/textsift/playground/) · [**API**](https://teamchong.github.io/textsift/api/) · [**Architecture deck**](https://teamchong.github.io/textsift/intro.pdf)\n\n\u003e Architecture walkthrough — [open the deck](https://teamchong.github.io/textsift/intro.pdf)\n\n## What this is\n\nOne npm package, two entry points + a CLI:\n\n```sh\nnpm install textsift\n```\n\n```ts\n// Browser / Node-via-WASM — pure WebGPU + WASM, no native binary.\nimport { PrivacyFilter } from \"textsift/browser\";\n\n// Node native — auto-picks the platform's GPU fast path (Metal on macOS,\n// Vulkan on Linux, Dawn on Windows). Falls back to WASM if no GPU.\nimport { PrivacyFilter } from \"textsift\";\n```\n\n```sh\n# Same engine as a CLI — no install, no browser, no clipboard dance\necho \"Hi Alice, alice@example.com\" | npx textsift redact\nnpx textsift table customers.csv --header --mode synth \u003e clean.csv\nnpx textsift detect log.txt --jsonl | jq 'select(.label == \"private_email\")'\nTEXTSIFT_OFFLINE=1 npx textsift redact file.txt   # CI: fail if not pre-cached\nnpx textsift download                              # pre-warm in CI\nnpx textsift cache info                            # show cache location + size\n```\n\n```yaml\n# Or as a pre-commit hook — block commits that contain PII\n# .pre-commit-config.yaml\nrepos:\n  - repo: https://github.com/teamchong/textsift\n    rev: v0.1.0\n    hooks:\n      - id: textsift-pii-scan\n```\n\n```yaml\n# Or as a GitHub Action — block PRs that introduce PII; findings\n# show up inline + in the repo's Security tab via SARIF.\n# .github/workflows/pii.yml\n- uses: teamchong/textsift@v0.1.0\n  with:\n    sarif-output: textsift.sarif\n- uses: github/codeql-action/upload-sarif@v3\n  with: { sarif_file: textsift.sarif, category: textsift }\n```\n\nBundlers (Vite/Webpack/esbuild/etc.) resolve `textsift/browser` and never touch the native entry. Node code resolves `textsift` and gets the platform-native binding via `optionalDependencies`.\n\nThe model is OpenAI's; the value here is packaging:\n\n- A native o200k-style BPE tokenizer in pure TypeScript. If you're not already shipping `@huggingface/transformers` for other models, that's a real bundle-size win.\n- Per-platform native GPU backends — hand-written MSL on macOS, hand-written GLSL→SPIR-V on Linux, Tint→D3D12 on Windows — plus WGSL for browser WebGPU. All produce byte-identical span output.\n- A WASM CPU path (Zig + SIMD128) that loads `model_q4f16.onnx` directly. The transformers.js / ORT-Web stack can't load this model on CPU because ORT-Web's WASM bundle lacks `MatMulNBits` / `GatherBlockQuantized` — different runtimes (onnxruntime-node, web-llm, etc.) can in principle, but no JS ecosystem alternative ships out-of-the-box.\n- Persistent OPFS caching of the 770 MB model weights in browsers (filesystem cache in Node), configured by default.\n- Streaming overloads of `detect()` and `redact()` — pass an `AsyncIterable\u003cstring\u003e` to abort an LLM stream the moment a credit card / API key appears, render redacted text progressively as it arrives, or front a model gateway (Cloudflare Worker style) that has to forward chunk-by-chunk.\n- Custom rule engine (regex + match-fn) that merges with model spans. Built-in `\"secrets\"` preset covers JWT, GitHub PAT, AWS, Slack, OpenAI/Anthropic/Google/Stripe keys, and PEM private-key headers.\n\n## Use\n\n```ts\nimport { PrivacyFilter } from \"textsift/browser\";\n\nconst filter = await PrivacyFilter.create();\n\nconst result = await filter.redact(\n  \"Hi, my name is John Smith and my email is john@example.com.\",\n);\n// result.redactedText\n//   \"Hi, my name is [private_person] and my email is [private_email].\"\n\n// result.spans\n//   [ { label: \"private_person\", start: 15, end: 25, ... },\n//     { label: \"private_email\",  start: 43, end: 59, ... } ]\n```\n\nDetect-only:\n\n```ts\nconst { spans, containsPii } = await filter.detect(text);\n```\n\nStreaming detect / redact — abort an LLM stream when PII appears, render progressively, or proxy chunk-by-chunk. Same `detect()` / `redact()`, just pass an async source:\n\n```ts\nasync function* llmStream() {\n  for await (const chunk of openai.chat.completions.create({ stream: true, ... })) {\n    yield chunk.choices[0]?.delta?.content ?? \"\";\n  }\n}\n\n// Detect — iterate spans as they become detectable\nconst det = filter.detect(llmStream());\nfor await (const span of det.spanStream) {\n  if (span.label === \"secret\" \u0026\u0026 span.confidence \u003e 0.9) abort();\n}\nconst detFinal = await det.result;\n\n// Redact — pipe redacted text downstream as it becomes safe to emit.\nconst red = filter.redact(llmStream());\nfor await (const piece of red.textStream) {\n  await downstreamWriter.write(piece);\n}\nconst redFinal = await red.result;\n```\n\nBuilt-in secrets preset:\n\n```ts\nconst filter = await PrivacyFilter.create({ presets: [\"secrets\"] });\n// Detects JWT, GitHub PAT, AWS access keys, Slack tokens + webhooks,\n// OpenAI/Anthropic/Google API keys, Stripe keys + webhook secrets,\n// npm tokens, PEM private-key headers. All severity \"block\".\n```\n\nFaker mode — emit realistic fakes instead of `[private_email]` markers (so downstream validators / templates / pipelines still see PII-shaped data):\n\n```ts\nimport { PrivacyFilter, markerPresets } from \"textsift\";\n\nconst filter = await PrivacyFilter.create({ markers: markerPresets.faker() });\nawait filter.redact(\"Hi Alice, email alice@example.com, phone +1-555-0123\");\n// → \"Hi Alice Anderson, email alice.anderson@example.com, phone +1-555-0100\"\n//   Same input text → same fake within the filter's lifetime\n//   (so \"Alice\" appearing twice yields \"Alice Anderson\" both times)\n```\n\nTabular data — classify which CSV / DB columns contain PII, or redact a whole table in one call:\n\n```ts\nconst rows = [\n  [\"id\", \"name\",         \"email\",             \"amount\"],\n  [\"1\",  \"Alice Carter\", \"alice@example.com\", \"100\"],\n  [\"2\",  \"Bob Davis\",    \"bob@example.com\",   \"250\"],\n];\n\n// Audit: which columns have PII?\nconst cols = await filter.classifyColumns(rows, { headerRow: true });\n// → [{ index:0, label:null }, { index:1, label:\"private_person\", confidence:1 },\n//    { index:2, label:\"private_email\", confidence:1 }, { index:3, label:null }]\n\n// Pipeline: redact in one of three modes\nconst safe = await filter.redactTable(rows, {\n  headerRow: true,\n  mode: \"synth\",   // \"redact\" | \"synth\" | \"drop_column\"\n});\n// mode \"synth\" gives you Tonic.ai-style fake-but-realistic output;\n// \"drop_column\" omits PII columns entirely; \"redact\" uses [label] markers.\n```\n\nBatch inputs, custom markers, per-category enabling — see the [API reference](https://teamchong.github.io/textsift/api/).\n\n## Measured numbers\n\nPer-forward latency, median of 5–10 runs, synthetic-weight bench at production model dimensions.\n\n**Browser (M3 Pro, Chromium 147):**\n\n| Input length | textsift (WebGPU) | textsift (WASM MT) | tjs (WebGPU) |\n|---|---:|---:|---:|\n| ~7 tokens | **8.9 ms** | 29.0 ms | 32.7 ms |\n| ~25 tokens | **11.8 ms** | 44.6 ms | 38.5 ms |\n| ~80 tokens | **22.0 ms** | 95.9 ms | 56.4 ms |\n\ntextsift WebGPU is 2.6–3.7× faster than transformers.js across every input length.\n\n**Node native — macOS (M2 Pro, Metal-direct):**\n\n| T   | textsift native | tjs CPU equivalent |\n|----:|----------------:|-------------------:|\n|  7  | **5.2 ms**      | ~30 ms             |\n| 32  | **10.8 ms**     | ~40 ms             |\n| 80  | **23.8 ms**     | ~95 ms             |\n\nHand-written MSL beats Tint's WGSL→MSL codegen by ~1.9× on the same hardware.\n\n**Node native — Linux (Intel Iris Xe, Vulkan-direct):**\n\n| T   | textsift native | ONNX Runtime Node CPU |\n|----:|----------------:|----------------------:|\n| 32  | **28 ms**       | ~800 ms (**28×** slower) |\n\nThe Linux story is the real differentiator: GPU-accelerated PII detection on Intel iGPU / AMD APU / non-NVIDIA hardware **without CUDA, without ROCm, without driver dance**. `npm install textsift` ships a vendored Vulkan-direct binary that talks to whatever Mesa-supported GPU is there.\n\n**Cold start:** we don't claim a speedup over transformers.js. See [benchmarks](https://teamchong.github.io/textsift/benchmarks/) for the rationale; the OPFS-vs-Cache-API gap is a storage choice, not an inference-engine one.\n\nThese numbers will look different on your hardware.\n\n## Repo layout (npm workspaces monorepo)\n\n```\npackages/\n  textsift/\n    src/\n      browser/         ← public API, viterbi, chunking, redaction, native BPE tokenizer\n      zig/             ← Zig kernels → WASM\n      c/               ← FMA shim for relaxed_simd\n      native/          ← Node-native backends (Metal / Vulkan / Dawn) + NAPI bindings\n        metal/         ← Mac: Obj-C bridge + hand-written MSL kernels\n        vulkan/        ← Linux: C bridge + hand-written GLSL → SPIR-V kernels\n        dawn/          ← Windows: Dawn C++ via Tint\n        shaders/       ← canonical WGSL kernels (single source of truth)\n      index.ts         ← Node native entry (auto-picks platform GPU + WASM fallback)\n    scripts/           ← inline-wasm.mjs, build-native.sh, serve-coi.py, etc.\ndocs-site/             ← Astro + Starlight docs site\ntests/browser/         ← Playwright tests\ntests/native/          ← Node native conformance + bench + integration tests\n.github/workflows/     ← test / release / bench across linux/darwin/windows\n```\n\n## Development\n\n```sh\nnpm install                # workspace bootstrap\nnpm run build              # zig → wasm, bundle, .d.ts\nnpm run typecheck          # strict, noUncheckedIndexedAccess on\nnpm run test               # all playwright tests\n```\n\n## Caveats\n\n`openai/privacy-filter` is a detection aid, not an anonymization guarantee. English-first (Japanese ~88% F1, other languages untested). Short text under-contextualizes.\n\nRead the [caveats page](https://teamchong.github.io/textsift/caveats/) and OpenAI's [model card](https://cdn.openai.com/pdf/c66281ed-b638-456a-8ce1-97e9f5264a90/OpenAI-Privacy-Filter-Model-Card.pdf) before treating output as compliance-safe.\n\n## License\n\nApache 2.0, matching the upstream model.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fteamchong%2Ftextsift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fteamchong%2Ftextsift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fteamchong%2Ftextsift/lists"}