https://github.com/suarezpm/apohara-synthex
The evidence layer that lives inside Bright Data — scrape → classify → sign verifiable web intelligence. MCP companion to brightdata-mcp. Web Data UNLOCKED hackathon.
https://github.com/suarezpm/apohara-synthex
Last synced: 20 days ago
JSON representation
The evidence layer that lives inside Bright Data — scrape → classify → sign verifiable web intelligence. MCP companion to brightdata-mcp. Web Data UNLOCKED hackathon.
- Host: GitHub
- URL: https://github.com/suarezpm/apohara-synthex
- Owner: SuarezPM
- License: mit
- Created: 2026-05-27T18:06:55.000Z (21 days ago)
- Default Branch: main
- Last Pushed: 2026-05-27T18:32:57.000Z (21 days ago)
- Last Synced: 2026-05-27T20:08:53.249Z (21 days ago)
- Language: JavaScript
- Size: 83 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ◆ Synthex
### The evidence layer that lives inside Bright Data
**Scrape it · Classify it · Prove it.**
Turn the web your AI agents touch into classified intelligence, sealed with court-grade, verifiable evidence.








### ▶ [Live demo: synthex.apohara.dev](https://synthex.apohara.dev)
📄 **[See a real Evidence Report → sample PDF](samples/synthex-evidence-report.pdf)** — 6 pages, sealed with HMAC-SHA256 + RFC 3161 (DigiCert), generated by the real pipeline ([regenerate](scripts/gen-sample-report.mjs)).
[Live demo](https://synthex.apohara.dev) · [Sample report](samples/synthex-evidence-report.pdf) · [Quickstart](#-quickstart) · [Verify in 60s](#-verify-it-yourself-60-seconds) · [Architecture](#-architecture) · [Honesty](#-honesty)
Web Data UNLOCKED Hackathon · Bright Data × lablab.ai · MIT
---
> **Your AI agents are scraping the live web right now.**
> Do you know what they found, what they classified, and what you can *prove*?
Synthex is a **100% JavaScript MCP server** that wraps [`brightdata-mcp`](https://github.com/brightdata/brightdata-mcp) and turns raw web scraping into a defensible intelligence pipeline:
**scrape → dedup & screen → classify (GTM · Finance · Security · Supply-chain) → remember → seal as verifiable evidence → react.**
**For** AI Operations & Security teams running agents with web access that must account for *what those agents found and decided* — under EU AI Act / DORA.
**The moat:** SIEMs and agent-observability tools watch the agent's *infrastructure*. Synthex sees — and **cryptographically signs** — the web *content* the agent touched. The signed Evidence Report is something no competitor ships.
---
## ◆ Architecture
```
Triggerware ─(react)─┐ ┌─(act)─► alert + webhook
▼ │
FETCH ─────► FORGE ──────────► CLASSIFY ─────► PROVE ─────► OBSERVE ─────► MEMORY
Bright Data SHA-256 dedup + AI/ML API HMAC-SHA256 OpenTelemetry Cognee (graph)
(5 APIs) 78-rule DJL + (frontier LLM) + RFC 3161 GenAI spans + local store
28-rule prefilter 4 lenses ‖ TSA + PDF (OTLP opt-in) (opt-in / CLI)
```
| Stage | What it does |
|------|--------------|
| **FETCH** | Routes each target to the right Bright Data surface: **Web Unlocker** (MCP stdio + REST), **SERP API** (zone `serp_api1`), **Browser API** (Playwright `connectOverCDP`), **Web Scraper / Datasets API** (`datasets/v3/scrape`), and **Crawl API**. *No Bright Data, no data.* |
| **FORGE** | SHA-256 dedup + **two-layer deterministic pre-LLM defense (106 rules)**. Layer 1 `prefilter.js` (28 rules): SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors. Layer 2 `djl.js` (78 rules): prompt-injection, harm/PII bilingual EN+ES, jailbreak, SQLi/XSS, exfiltration, tool misuse, sector policy (HIPAA/PCI/EO-13526). Audit trail per-stage emitted in payload v2 `decisions[]` with policy_bundle sha. |
| **CLASSIFY** | A frontier model via **AI/ML API** extracts structured signals under one lens — or all **four lenses in parallel** (`lens="all"` → GTM + Finance + Security + Supply-chain). |
| **PROVE** | Every report sealed with HMAC-SHA256 **and** an RFC 3161 timestamp from **DigiCert** — exportable as a **6-page downloadable PDF Evidence Report** (4-buyer framing: CISO · CFO · General Counsel · Broker) with a Synthex Risk Score 0–100. |
| **OBSERVE** | Every stage emits OpenTelemetry GenAI spans (`gen_ai.client.operation.duration`, token usage, blocked count). OTLP export is opt-in; latencies stream into the UI over SSE. |
| **MEMORY** | Local store for deltas + **Cognee** (OSS knowledge graph) — default in the local/CLI path, off on the public endpoint to control cost. |
| **WATCH / REACT** | Always-on loop: detect change → run the pipeline → alert. No human in the loop. |
---
## ◆ Quickstart
```bash
npm install
# credentials live OUTSIDE the repo (never committed):
export BRIGHT_DATA_TOKEN=... # Bright Data (promo: unlocked)
export AIML_API_KEY=... # AI/ML API
export TRIGGERWARE_API_KEY=... # Triggerware
npm test # unit suite (network tests are opt-in)
npm run demo # end-to-end Evidence Report + LIVE DigiCert seal
SYNTHEX_TRACE=console npm run demo # same, with per-stage OTel latencies printed
node server.js # run as an MCP server (companion to brightdata-mcp)
```
**Web UI / Vercel:** `public/` + `api/` deploy as a static site + serverless functions
(`vercel deploy`). The deployed `/api/analyze` runs the **full live pipeline** via the Bright
Data REST API; `/api/stream` pushes per-stage progress to the UI over **SSE** (cinematic
stage view). Set `BRIGHT_DATA_TOKEN`, `WEB_UNLOCKER_ZONE`, `AIML_API_KEY`, `SYNTHEX_HMAC_KEY`
in the project env (without them it falls back to a labeled cached demo). The public endpoint is
guarded (SSRF block + per-IP rate-limit); Cognee memory stays off there to control cost.
→ **[synthex.apohara.dev](https://synthex.apohara.dev)** (live · deployed on Vercel,
also reachable at `apohara-synthex.vercel.app`).
---
## ◆ Verify it yourself (60 seconds)
Don't trust the claims — run them.
```bash
npm test # → full suite green (zero failing, opt-in live tests skipped)
npm run demo # → Evidence Report; verify → hash OK · HMAC OK · TSA OK
npm run bench:djl # → logs/djl-latency.json (p95<5ms, p99 adv<50ms)
node bin/decode-evidence.js # offline audit-trail inspector (verifies HMAC + TSA, prints decisions[])
# Real, live, end-to-end (needs BRIGHT_DATA_TOKEN + AIML_API_KEY):
node scripts/check-pipeline-live.mjs "https://en.wikipedia.org/wiki/Bright_Data" all # 4 lenses in parallel
```
Opt-in live checks (gated by env flags so the suite never fakes a pass): `AIML_LIVE=1` · `TRIGGERWARE_LIVE=1` · `COGNEE_LIVE=1`.
---
## ◆ Partners — each verified against the real service
| Partner | Role in Synthex | Verified |
|---|---|:--:|
| **Bright Data — Web Unlocker** | FETCH (MCP stdio + REST) | ✅ live |
| **Bright Data — SERP API** | FETCH (structured JSON, zone `serp_api1`) | ✅ live |
| **Bright Data — Browser API** | FETCH (Playwright `connectOverCDP`, JS-heavy) | ✅ live (local/flag) |
| **Bright Data — Web Scraper / Datasets** | FETCH (`datasets/v3/scrape`) | ✅ live |
| **Bright Data — Crawl** | FETCH (multi-page via Web Unlocker) | ✅ live · native Crawl API opt-in |
| **Bright Data — MCP** | FETCH substrate (`server.js` companion) | ✅ live |
| **AI/ML API** | CLASSIFY brain (frontier model, extraction) | ✅ live classification |
| **Cognee** | MEMORY knowledge graph (OSS, via its MCP) | ✅ tools `remember`/`recall` confirmed |
| **Triggerware** | REACT (poll deltas → fire pipeline) | ✅ live API (`GET /triggers` 200) |
**All 6 Bright Data surfaces verified LIVE with real code.** Crawl is a multi-page crawl over Web Unlocker; the native Crawl API stays opt-in via a Crawl `dataset_id`.
---
## ◆ Market & business
Synthex doesn't claim a single tidy TAM — it sits at the **intersection** of three real markets, each sized by a named firm with very different scopes. We address a **wedge** of that intersection: a verifiable evidence + screening layer for the web content autonomous agents ingest, for teams accountable under EU AI Act / DORA. It is *not* the whole AI-agents market.
| Adjacent market | Size & horizon | Source |
|---|---|---|
| AI agents | **$52.6B by 2030** → $231.9B by 2034 (CAGR 46.3%) | [MarketsandMarkets](https://www.marketsandmarkets.com/PressReleases/ai-agents.asp) · [Dimension Market Research](https://dimensionmarketresearch.com/report/ai-agents-market/) |
| AI-driven web scraping | **$46.1B by 2035** (CAGR 19.9%) | [Market Research Future](https://www.marketresearchfuture.com/reports/ai-driven-web-scraping-market-24744) |
| AI in observability | **$10.7B by 2033** (CAGR 22.5%) | [Market.us](https://market.us/report/ai-in-observability-market/) |
> Forecasts across firms differ widely because they define scope differently — we cite the firm and horizon for each rather than collapse them into one headline number. Synthex's serviceable slice is a subset of all three.
### Pricing — *proposed* (not yet live revenue)
Every tier below is a **proposed** go-to-market model. Synthex has **no paying customers and no revenue today**; these are pricing hypotheses, not reported figures.
| Tier | Proposed price | For |
|---|---|---|
| **OSS** | Free (MIT) | the full pipeline, self-hosted — what's in this repo |
| **Pro** | ~$99/mo *(proposed)* | hosted endpoint, higher rate limits, retained Evidence Reports |
| **Enterprise** | $2,500+/mo *(proposed)* | SSO, audit retention, on-prem TSA, EU AI Act / DORA evidence workflows |
### Why us — the signed Evidence Report
| Category | What they watch | What they can't ship |
|---|---|---|
| SIEM / log tools | the agent's *infrastructure* | proof of the web *content* the agent saw |
| Agent-observability | traces, tokens, latency | a cryptographically sealed, court-grade report |
| Scraping APIs | raw bytes | classification + screening + RFC 3161 evidence |
| **Synthex** | the web content itself | — *the signed Evidence Report is the moat* |
---
## ◆ v0.6.0 — Watch & Prove
The chain-of-custody release. Every re-scrape of the same target now
encadena `previous_tsa_serial → current_tsa_serial`, with a normalized
content diff and a 7th PDF page when delta is present.
- **`src/delta/`** — Delta Engine: `normalize → hash → diff → sealDeltaChain`.
35 unit + 1 integration tests.
- **`HMAC_EXCLUDED_KEYS`** (`src/prove/evidence-report.js`) — cross-run
determinism: `kg_status`, `kg_latency_ms`, `surface_status` are normative
metadata, excluded from the HMAC bytestring so the chain never reports
a phantom change.
- **`src/forge/pii-filter.js`** — 25-rule PII bundle (10 DJL-PII reused +
15 secrets-leak: AWS / GitHub / Stripe / JWT / etc.) gating Cognee ingest.
- **Model tier selector** (`src/classify/tiers.js`) — `free` / `oss` / `paid`.
FREE labeled `free-low-quality` per `docs/v060-calibration.md` (50 % of
fixtures had Δseverity > 1.5 vs DeepSeek baseline).
- **Try it inline** in the `#live` section of [synthex.apohara.dev](https://synthex.apohara.dev) —
paste a URL, pick a lens + model tier (OSS / PAID / FREE), watch the 4
stages execute in real time against Bright Data, download the 6-7 page
signed PDF. No separate playground page needed — everything lives in
one REEF-style scroll.
- **Live stress run** (2026-05-28): 500 URLs · 99.6 % success ·
$0.75 cost · 9.1 min wall clock — see `docs/v060-stress-report.md`.
- **DigiCert TSA RTT baseline**: p95 385 ms — see `logs/digicert-rtt-baseline.json`.
- **`docs/PRIOR_ART.md`** — reproducible directed-search queries proving
the "no open-source combination of [scrape + diff + HMAC + RFC 3161 + KG]
found at 2026-05-28" claim.
- **`.kiro/specs/delta-engine.md`** — Kiro-native MCP spec for the Kiro
Challenge integration.
---
## ◆ Honesty
The pitch *is* honesty — so it applies to us too. **Canonical caveats live in [`docs/HONESTY.md`](docs/HONESTY.md)** — RFC-3161 verification scope (v0.7.0 M1), rate-limit posture, PII-gate placement, durability choices, and Risk-Score semantics. This section is the short list; the doc is the long list.
- **Proven live:** Bright Data — Web Unlocker (MCP **and** REST), SERP API, Browser API, Web Scraper / Datasets API, native MCP server (`server.js`) · AI/ML classification (single + 4-lens parallel) · DigiCert RFC 3161 timestamp · downloadable 6-page PDF · Vercel deploy (`/api/analyze` live, end-to-end) · Triggerware API · Cognee MCP tools.
- **Crawl is multi-page over Web Unlocker, not the native Crawl product.** All 6 Bright Data surfaces are verified live; we name the crawl honestly — the native Crawl API stays opt-in via a Crawl `dataset_id`.
- **Risk Score is an internal estimate:** the PDF's Synthex Risk Score (0–100) is a deterministic heuristic computed from the report's own data, with the formula printed on the page. It is **NOT** a Munich Re rating or any third-party underwriting score.
- **Opt-in (cost/credentials):** Cognee memory is default in the local/CLI path but **off** on the public endpoint; its `remember` ingest uses an LLM → behind `COGNEE_LIVE`. OTel OTLP export only runs if `OTEL_EXPORTER_OTLP_ENDPOINT` is set (otherwise spans are no-op / console-only). Network tests are env-gated so the suite never fabricates a pass.
- **Two-layer defense scope:** Synthex runs **28 web-injection rules** (`src/forge/prefilter.js` — SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors) **plus 78 prompt-level rules** (`src/forge/djl.js` — jailbreak, harm/PII bilingual EN+ES, SQLi/XSS, exfiltration, tool misuse, sector policy). Both layers are **heuristic regex deterministic** — *aligned with* the SkillFortify benchmark (arXiv 2603.00195), not a formal guarantee. They do **not** stop *visual* prompt injection (VPI in rendered screenshots/images) — a different threat model.
- **Coverage on curated fixtures:** `test/djl.test.js` validates **78/78 fixtures pass identically** (78 positive + 78 negative = 156 assertions). This is measured coverage on curated examples, NOT a formal guarantee against every adversarial input. Divergences would land in [`docs/djl-parity-divergence.md`](docs/djl-parity-divergence.md) (currently empty).
- **Effective coverage on synthetic corpus** (SC-11, `node scripts/measure-coverage.mjs`): on the 156 internal fixtures (78 positive + 78 negative — *synthetic*, not real Bright Data scraping), DJL fires on **50.0% of docs** and **100% of the 78 rules** fired at least once; prefilter fires on **9.0% of docs** (natural overlap with DJL on SQLi/XSS/exf vectors) and **39.3% of the 28 rules**. On real Bright Data corpus the split will differ — prefilter higher (HTML scraping), DJL lower (rules designed for prompts, not docs). Re-run with `node scripts/measure-coverage.mjs ` on your own corpus.
- **HMAC canonicalization (schema_v2):** since v4 the HMAC sealing uses `canonicalize()` (JCS-like, `src/prove/canonicalize.js`) so the order of payload keys is irrelevant — an identical v2 payload produces the same HMAC no matter how it was built. The verifier auto-detects `schema_version` and verifies both v1 (legacy, JSON.stringify) and v2 (canonicalize) without flags. Global flag `EVIDENCE_SCHEMA_V2=0` forces sealer legacy (rollback demo only).
- **Tokens saved (estimated, in the sealed payload):** Synthex emits `tokens_saved: {dedup_bytes, blocked_bytes, total_bytes, estimated_tokens, chars_per_token, note}` inside the v2 payload. The estimate uses **4 chars/token** as a conservative approximation — actual depends on the tokenizer (GPT-4 `cl100k_base` ~4.2, Claude ~3.8, multilingual CJK worse). The 3 mechanisms that contribute: (1) SHA-256 dedupe drops N-1 copies of identical content, (2) the 78-rule DJL blocks prompt-level attacks before classify, (3) the 28-rule prefilter blocks web-injection on top. Verify on any sealed report with `node bin/decode-evidence.js ` — the "tokens saved" line is right there in the summary.
- **Endpoint guard is best-effort:** the public rate-limit is in-memory per warm instance (a hard multi-instance limit would need Vercel KV). The SSRF block filters the hostname (literal + obfuscated/IPv6 private ranges) but does **not** resolve DNS, so a public domain pointing at a private IP (DNS rebinding) would pass — low risk here because the scrape runs on Bright Data's *remote* proxy, not the function's network.
- **Research grounding (cited, not implemented):** the parallel multi-lens design is grounded in **KVCOMM** (NeurIPS 2025); KV-cache memory is a stated future direction per **MemArt** (ICLR 2026). These are foundations we cite — not features Synthex ships.
- **Prior art, not pipeline:** the **INV-15** invariant ([Context_Forge paper](https://doi.org/10.5281/zenodo.20277875)) ships as a module and is cited as prior art — it is *not* part of this scraping pipeline.
- **Not claimed:** Synthex doesn't bypass any site's ToS — it uses Bright Data's compliant infrastructure. The timestamp proves *when* evidence existed, not the truth of its content.
---
**We didn't just use Bright Data — we improved it.**
Upstream contribution: [`brightdata-mcp` PR #140](https://github.com/brightdata/brightdata-mcp/pull/140) (dedup + field filtering). See [`docs/CONTRIBUTION.md`](docs/CONTRIBUTION.md).
MIT © 2026 Pablo M. Suárez · [Apohara]