{"id":50318646,"url":"https://github.com/suarezpm/apohara-synthex","last_synced_at":"2026-05-29T02:01:08.116Z","repository":{"id":360758285,"uuid":"1251580524","full_name":"SuarezPM/apohara-synthex","owner":"SuarezPM","description":"The evidence layer that lives inside Bright Data — scrape → classify → sign verifiable web intelligence. MCP companion to brightdata-mcp. Web Data UNLOCKED hackathon.","archived":false,"fork":false,"pushed_at":"2026-05-27T18:32:57.000Z","size":85,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T20:08:53.249Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SuarezPM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T18:06:55.000Z","updated_at":"2026-05-27T18:33:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/SuarezPM/apohara-synthex","commit_stats":null,"previous_names":["suarezpm/apohara-synthex"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/SuarezPM/apohara-synthex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuarezPM%2Fapohara-synthex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuarezPM%2Fapohara-synthex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuarezPM%2Fapohara-synthex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuarezPM%2Fapohara-synthex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SuarezPM","download_url":"https://codeload.github.com/SuarezPM/apohara-synthex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuarezPM%2Fapohara-synthex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33633468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-29T02:01:07.047Z","updated_at":"2026-05-29T02:01:08.097Z","avatar_url":"https://github.com/SuarezPM.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# ◆ Synthex\n\n### The evidence layer that lives inside Bright Data\n\n**Scrape it · Classify it · Prove it.**\nTurn the web your AI agents touch into classified intelligence, sealed with court-grade, verifiable evidence.\n\n![License](https://img.shields.io/badge/license-MIT-blue)\n![Tests](https://img.shields.io/badge/tests-suite%20green-brightgreen)\n![Defense Layers](https://img.shields.io/badge/pre--LLM_defense-28%20web--injection%20%2B%2078%20prompt--level-orange)\n![Token Saver](https://img.shields.io/badge/tokens--saved-deterministic%20pre--LLM%20block%20%2B%20dedup-9775fa)\n![Node](https://img.shields.io/badge/node-%E2%89%A520-339933?logo=node.js\u0026logoColor=white)\n![Runtime](https://img.shields.io/badge/100%25-JavaScript-f7df1e?logo=javascript\u0026logoColor=000)\n![MCP](https://img.shields.io/badge/MCP-companion-7c3aed)\n![Substrate](https://img.shields.io/badge/substrate-Bright%20Data-ff6b35)\n\n### ▶ [Live demo: synthex.apohara.dev](https://synthex.apohara.dev)\n\n📄 **[See a real Evidence Report → sample PDF](samples/synthex-evidence-report.pdf)** — 6 pages, sealed with HMAC-SHA256 + RFC 3161 (DigiCert), generated by the real pipeline ([regenerate](scripts/gen-sample-report.mjs)).\n\n[Live demo](https://synthex.apohara.dev) · [Sample report](samples/synthex-evidence-report.pdf) · [Quickstart](#-quickstart) · [Verify in 60s](#-verify-it-yourself-60-seconds) · [Architecture](#-architecture) · [Honesty](#-honesty)\n\n\u003csub\u003eWeb Data UNLOCKED Hackathon · Bright Data × lablab.ai · MIT\u003c/sub\u003e\n\n\u003c/div\u003e\n\n---\n\n\u003e **Your AI agents are scraping the live web right now.**\n\u003e Do you know what they found, what they classified, and what you can *prove*?\n\nSynthex is a **100% JavaScript MCP server** that wraps [`brightdata-mcp`](https://github.com/brightdata/brightdata-mcp) and turns raw web scraping into a defensible intelligence pipeline:\n**scrape → dedup \u0026 screen → classify (GTM · Finance · Security · Supply-chain) → remember → seal as verifiable evidence → react.**\n\n**For** AI Operations \u0026 Security teams running agents with web access that must account for *what those agents found and decided* — under EU AI Act / DORA.\n\n**The moat:** SIEMs and agent-observability tools watch the agent's *infrastructure*. Synthex sees — and **cryptographically signs** — the web *content* the agent touched. The signed Evidence Report is something no competitor ships.\n\n---\n\n## ◆ Architecture\n\n```\n Triggerware ─(react)─┐                                       ┌─(act)─► alert + webhook\n                      ▼                                       │\n   FETCH ─────► FORGE ──────────► CLASSIFY ─────► PROVE ─────► OBSERVE ─────► MEMORY\n   Bright Data  SHA-256 dedup +   AI/ML API       HMAC-SHA256  OpenTelemetry  Cognee (graph)\n   (5 APIs)     78-rule DJL +     (frontier LLM)  + RFC 3161   GenAI spans    + local store\n                28-rule prefilter   4 lenses ‖     TSA + PDF    (OTLP opt-in)  (opt-in / CLI)\n```\n\n| Stage | What it does |\n|------|--------------|\n| **FETCH** | Routes each target to the right Bright Data surface: **Web Unlocker** (MCP stdio + REST), **SERP API** (zone `serp_api1`), **Browser API** (Playwright `connectOverCDP`), **Web Scraper / Datasets API** (`datasets/v3/scrape`), and **Crawl API**. *No Bright Data, no data.* |\n| **FORGE** | SHA-256 dedup + **two-layer deterministic pre-LLM defense (106 rules)**. Layer 1 `prefilter.js` (28 rules): SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors. Layer 2 `djl.js` (78 rules): prompt-injection, harm/PII bilingual EN+ES, jailbreak, SQLi/XSS, exfiltration, tool misuse, sector policy (HIPAA/PCI/EO-13526). Audit trail per-stage emitted in payload v2 `decisions[]` with policy_bundle sha. |\n| **CLASSIFY** | A frontier model via **AI/ML API** extracts structured signals under one lens — or all **four lenses in parallel** (`lens=\"all\"` → GTM + Finance + Security + Supply-chain). |\n| **PROVE** | Every report sealed with HMAC-SHA256 **and** an RFC 3161 timestamp from **DigiCert** — exportable as a **6-page downloadable PDF Evidence Report** (4-buyer framing: CISO · CFO · General Counsel · Broker) with a Synthex Risk Score 0–100. |\n| **OBSERVE** | Every stage emits OpenTelemetry GenAI spans (`gen_ai.client.operation.duration`, token usage, blocked count). OTLP export is opt-in; latencies stream into the UI over SSE. |\n| **MEMORY** | Local store for deltas + **Cognee** (OSS knowledge graph) — default in the local/CLI path, off on the public endpoint to control cost. |\n| **WATCH / REACT** | Always-on loop: detect change → run the pipeline → alert. No human in the loop. |\n\n---\n\n## ◆ Quickstart\n\n```bash\nnpm install\n\n# credentials live OUTSIDE the repo (never committed):\nexport BRIGHT_DATA_TOKEN=...    # Bright Data (promo: unlocked)\nexport AIML_API_KEY=...         # AI/ML API\nexport TRIGGERWARE_API_KEY=...  # Triggerware\n\nnpm test        # unit suite (network tests are opt-in)\nnpm run demo    # end-to-end Evidence Report + LIVE DigiCert seal\nSYNTHEX_TRACE=console npm run demo   # same, with per-stage OTel latencies printed\nnode server.js  # run as an MCP server (companion to brightdata-mcp)\n```\n\n**Web UI / Vercel:** `public/` + `api/` deploy as a static site + serverless functions\n(`vercel deploy`). The deployed `/api/analyze` runs the **full live pipeline** via the Bright\nData REST API; `/api/stream` pushes per-stage progress to the UI over **SSE** (cinematic\nstage view). Set `BRIGHT_DATA_TOKEN`, `WEB_UNLOCKER_ZONE`, `AIML_API_KEY`, `SYNTHEX_HMAC_KEY`\nin the project env (without them it falls back to a labeled cached demo). The public endpoint is\nguarded (SSRF block + per-IP rate-limit); Cognee memory stays off there to control cost.\n→ **[synthex.apohara.dev](https://synthex.apohara.dev)** (live · deployed on Vercel,\nalso reachable at `apohara-synthex.vercel.app`).\n\n---\n\n## ◆ Verify it yourself (60 seconds)\n\nDon't trust the claims — run them.\n\n```bash\nnpm test                                   # → full suite green (zero failing, opt-in live tests skipped)\nnpm run demo                               # → Evidence Report; verify → hash OK · HMAC OK · TSA OK\nnpm run bench:djl                          # → logs/djl-latency.json (p95\u003c5ms, p99 adv\u003c50ms)\nnode bin/decode-evidence.js \u003cevidence.json\u003e  # offline audit-trail inspector (verifies HMAC + TSA, prints decisions[])\n\n# Real, live, end-to-end (needs BRIGHT_DATA_TOKEN + AIML_API_KEY):\nnode scripts/check-pipeline-live.mjs \"https://en.wikipedia.org/wiki/Bright_Data\" all   # 4 lenses in parallel\n```\n\nOpt-in live checks (gated by env flags so the suite never fakes a pass): `AIML_LIVE=1` · `TRIGGERWARE_LIVE=1` · `COGNEE_LIVE=1`.\n\n---\n\n## ◆ Partners — each verified against the real service\n\n| Partner | Role in Synthex | Verified |\n|---|---|:--:|\n| **Bright Data — Web Unlocker** | FETCH (MCP stdio + REST) | ✅ live |\n| **Bright Data — SERP API** | FETCH (structured JSON, zone `serp_api1`) | ✅ live |\n| **Bright Data — Browser API** | FETCH (Playwright `connectOverCDP`, JS-heavy) | ✅ live (local/flag) |\n| **Bright Data — Web Scraper / Datasets** | FETCH (`datasets/v3/scrape`) | ✅ live |\n| **Bright Data — Crawl** | FETCH (multi-page via Web Unlocker) | ✅ live · native Crawl API opt-in |\n| **Bright Data — MCP** | FETCH substrate (`server.js` companion) | ✅ live |\n| **AI/ML API** | CLASSIFY brain (frontier model, extraction) | ✅ live classification |\n| **Cognee** | MEMORY knowledge graph (OSS, via its MCP) | ✅ tools `remember`/`recall` confirmed |\n| **Triggerware** | REACT (poll deltas → fire pipeline) | ✅ live API (`GET /triggers` 200) |\n\n**All 6 Bright Data surfaces verified LIVE with real code.** Crawl is a multi-page crawl over Web Unlocker; the native Crawl API stays opt-in via a Crawl `dataset_id`.\n\n---\n\n## ◆ Market \u0026 business\n\nSynthex doesn't claim a single tidy TAM — it sits at the **intersection** of three real markets, each sized by a named firm with very different scopes. We address a **wedge** of that intersection: a verifiable evidence + screening layer for the web content autonomous agents ingest, for teams accountable under EU AI Act / DORA. It is *not* the whole AI-agents market.\n\n| Adjacent market | Size \u0026 horizon | Source |\n|---|---|---|\n| AI agents | **$52.6B by 2030** → $231.9B by 2034 (CAGR 46.3%) | [MarketsandMarkets](https://www.marketsandmarkets.com/PressReleases/ai-agents.asp) · [Dimension Market Research](https://dimensionmarketresearch.com/report/ai-agents-market/) |\n| AI-driven web scraping | **$46.1B by 2035** (CAGR 19.9%) | [Market Research Future](https://www.marketresearchfuture.com/reports/ai-driven-web-scraping-market-24744) |\n| AI in observability | **$10.7B by 2033** (CAGR 22.5%) | [Market.us](https://market.us/report/ai-in-observability-market/) |\n\n\u003e Forecasts across firms differ widely because they define scope differently — we cite the firm and horizon for each rather than collapse them into one headline number. Synthex's serviceable slice is a subset of all three.\n\n### Pricing — *proposed* (not yet live revenue)\n\nEvery tier below is a **proposed** go-to-market model. Synthex has **no paying customers and no revenue today**; these are pricing hypotheses, not reported figures.\n\n| Tier | Proposed price | For |\n|---|---|---|\n| **OSS** | Free (MIT) | the full pipeline, self-hosted — what's in this repo |\n| **Pro** | ~$99/mo *(proposed)* | hosted endpoint, higher rate limits, retained Evidence Reports |\n| **Enterprise** | $2,500+/mo *(proposed)* | SSO, audit retention, on-prem TSA, EU AI Act / DORA evidence workflows |\n\n### Why us — the signed Evidence Report\n\n| Category | What they watch | What they can't ship |\n|---|---|---|\n| SIEM / log tools | the agent's *infrastructure* | proof of the web *content* the agent saw |\n| Agent-observability | traces, tokens, latency | a cryptographically sealed, court-grade report |\n| Scraping APIs | raw bytes | classification + screening + RFC 3161 evidence |\n| **Synthex** | the web content itself | — *the signed Evidence Report is the moat* |\n\n---\n\n## ◆ v0.6.0 — Watch \u0026 Prove\n\nThe chain-of-custody release. Every re-scrape of the same target now\nencadena `previous_tsa_serial → current_tsa_serial`, with a normalized\ncontent diff and a 7th PDF page when delta is present.\n\n- **`src/delta/`** — Delta Engine: `normalize → hash → diff → sealDeltaChain`.\n  35 unit + 1 integration tests.\n- **`HMAC_EXCLUDED_KEYS`** (`src/prove/evidence-report.js`) — cross-run\n  determinism: `kg_status`, `kg_latency_ms`, `surface_status` are normative\n  metadata, excluded from the HMAC bytestring so the chain never reports\n  a phantom change.\n- **`src/forge/pii-filter.js`** — 25-rule PII bundle (10 DJL-PII reused +\n  15 secrets-leak: AWS / GitHub / Stripe / JWT / etc.) gating Cognee ingest.\n- **Model tier selector** (`src/classify/tiers.js`) — `free` / `oss` / `paid`.\n  FREE labeled `free-low-quality` per `docs/v060-calibration.md` (50 % of\n  fixtures had Δseverity \u003e 1.5 vs DeepSeek baseline).\n- **Try it inline** in the `#live` section of [synthex.apohara.dev](https://synthex.apohara.dev) —\n  paste a URL, pick a lens + model tier (OSS / PAID / FREE), watch the 4\n  stages execute in real time against Bright Data, download the 6-7 page\n  signed PDF. No separate playground page needed — everything lives in\n  one REEF-style scroll.\n- **Live stress run** (2026-05-28): 500 URLs · 99.6 % success ·\n  $0.75 cost · 9.1 min wall clock — see `docs/v060-stress-report.md`.\n- **DigiCert TSA RTT baseline**: p95 385 ms — see `logs/digicert-rtt-baseline.json`.\n- **`docs/PRIOR_ART.md`** — reproducible directed-search queries proving\n  the \"no open-source combination of [scrape + diff + HMAC + RFC 3161 + KG]\n  found at 2026-05-28\" claim.\n- **`.kiro/specs/delta-engine.md`** — Kiro-native MCP spec for the Kiro\n  Challenge integration.\n\n---\n\n## ◆ Honesty\n\nThe pitch *is* honesty — so it applies to us too. **Canonical caveats live in [`docs/HONESTY.md`](docs/HONESTY.md)** — RFC-3161 verification scope (v0.7.0 M1), rate-limit posture, PII-gate placement, durability choices, and Risk-Score semantics. This section is the short list; the doc is the long list.\n\n- **Proven live:** Bright Data — Web Unlocker (MCP **and** REST), SERP API, Browser API, Web Scraper / Datasets API, native MCP server (`server.js`) · AI/ML classification (single + 4-lens parallel) · DigiCert RFC 3161 timestamp · downloadable 6-page PDF · Vercel deploy (`/api/analyze` live, end-to-end) · Triggerware API · Cognee MCP tools.\n- **Crawl is multi-page over Web Unlocker, not the native Crawl product.** All 6 Bright Data surfaces are verified live; we name the crawl honestly — the native Crawl API stays opt-in via a Crawl `dataset_id`.\n- **Risk Score is an internal estimate:** the PDF's Synthex Risk Score (0–100) is a deterministic heuristic computed from the report's own data, with the formula printed on the page. It is **NOT** a Munich Re rating or any third-party underwriting score.\n- **Opt-in (cost/credentials):** Cognee memory is default in the local/CLI path but **off** on the public endpoint; its `remember` ingest uses an LLM → behind `COGNEE_LIVE`. OTel OTLP export only runs if `OTEL_EXPORTER_OTLP_ENDPOINT` is set (otherwise spans are no-op / console-only). Network tests are env-gated so the suite never fabricates a pass.\n- **Two-layer defense scope:** Synthex runs **28 web-injection rules** (`src/forge/prefilter.js` — SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors) **plus 78 prompt-level rules** (`src/forge/djl.js` — jailbreak, harm/PII bilingual EN+ES, SQLi/XSS, exfiltration, tool misuse, sector policy). Both layers are **heuristic regex deterministic** — *aligned with* the SkillFortify benchmark (arXiv 2603.00195), not a formal guarantee. They do **not** stop *visual* prompt injection (VPI in rendered screenshots/images) — a different threat model.\n- **Coverage on curated fixtures:** `test/djl.test.js` validates **78/78 fixtures pass identically** (78 positive + 78 negative = 156 assertions). This is measured coverage on curated examples, NOT a formal guarantee against every adversarial input. Divergences would land in [`docs/djl-parity-divergence.md`](docs/djl-parity-divergence.md) (currently empty).\n- **Effective coverage on synthetic corpus** (SC-11, `node scripts/measure-coverage.mjs`): on the 156 internal fixtures (78 positive + 78 negative — *synthetic*, not real Bright Data scraping), DJL fires on **50.0% of docs** and **100% of the 78 rules** fired at least once; prefilter fires on **9.0% of docs** (natural overlap with DJL on SQLi/XSS/exf vectors) and **39.3% of the 28 rules**. On real Bright Data corpus the split will differ — prefilter higher (HTML scraping), DJL lower (rules designed for prompts, not docs). Re-run with `node scripts/measure-coverage.mjs \u003cpath-to-docs.json\u003e` on your own corpus.\n- **HMAC canonicalization (schema_v2):** since v4 the HMAC sealing uses `canonicalize()` (JCS-like, `src/prove/canonicalize.js`) so the order of payload keys is irrelevant — an identical v2 payload produces the same HMAC no matter how it was built. The verifier auto-detects `schema_version` and verifies both v1 (legacy, JSON.stringify) and v2 (canonicalize) without flags. Global flag `EVIDENCE_SCHEMA_V2=0` forces sealer legacy (rollback demo only).\n- **Tokens saved (estimated, in the sealed payload):** Synthex emits `tokens_saved: {dedup_bytes, blocked_bytes, total_bytes, estimated_tokens, chars_per_token, note}` inside the v2 payload. The estimate uses **4 chars/token** as a conservative approximation — actual depends on the tokenizer (GPT-4 `cl100k_base` ~4.2, Claude ~3.8, multilingual CJK worse). The 3 mechanisms that contribute: (1) SHA-256 dedupe drops N-1 copies of identical content, (2) the 78-rule DJL blocks prompt-level attacks before classify, (3) the 28-rule prefilter blocks web-injection on top. Verify on any sealed report with `node bin/decode-evidence.js \u003cevidence.json\u003e` — the \"tokens saved\" line is right there in the summary.\n- **Endpoint guard is best-effort:** the public rate-limit is in-memory per warm instance (a hard multi-instance limit would need Vercel KV). The SSRF block filters the hostname (literal + obfuscated/IPv6 private ranges) but does **not** resolve DNS, so a public domain pointing at a private IP (DNS rebinding) would pass — low risk here because the scrape runs on Bright Data's *remote* proxy, not the function's network.\n- **Research grounding (cited, not implemented):** the parallel multi-lens design is grounded in **KVCOMM** (NeurIPS 2025); KV-cache memory is a stated future direction per **MemArt** (ICLR 2026). These are foundations we cite — not features Synthex ships.\n- **Prior art, not pipeline:** the **INV-15** invariant ([Context_Forge paper](https://doi.org/10.5281/zenodo.20277875)) ships as a module and is cited as prior art — it is *not* part of this scraping pipeline.\n- **Not claimed:** Synthex doesn't bypass any site's ToS — it uses Bright Data's compliant infrastructure. The timestamp proves *when* evidence existed, not the truth of its content.\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**We didn't just use Bright Data — we improved it.**\nUpstream contribution: [`brightdata-mcp` PR #140](https://github.com/brightdata/brightdata-mcp/pull/140) (dedup + field filtering). See [`docs/CONTRIBUTION.md`](docs/CONTRIBUTION.md).\n\nMIT © 2026 Pablo M. Suárez · [Apohara]\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuarezpm%2Fapohara-synthex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuarezpm%2Fapohara-synthex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuarezpm%2Fapohara-synthex/lists"}