https://github.com/suarezpm/apohara-synthex

The evidence layer that lives inside Bright Data — scrape → classify → sign verifiable web intelligence. MCP companion to brightdata-mcp. Web Data UNLOCKED hackathon.
https://github.com/suarezpm/apohara-synthex
Last synced: 20 days ago
JSON representation
The evidence layer that lives inside Bright Data — scrape → classify → sign verifiable web intelligence. MCP companion to brightdata-mcp. Web Data UNLOCKED hackathon.
Host: GitHub
URL: https://github.com/suarezpm/apohara-synthex
Owner: SuarezPM
License: mit
Created: 2026-05-27T18:06:55.000Z (21 days ago)
Default Branch: main
Last Pushed: 2026-05-27T18:32:57.000Z (21 days ago)
Last Synced: 2026-05-27T20:08:53.249Z (21 days ago)
Language: JavaScript
Size: 83 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


# ◆ Synthex

### The evidence layer that lives inside Bright Data

**Scrape it · Classify it · Prove it.**

Turn the web your AI agents touch into classified intelligence, sealed with court-grade, verifiable evidence.

![License](https://img.shields.io/badge/license-MIT-blue)

![Tests](https://img.shields.io/badge/tests-suite%20green-brightgreen)

![Defense Layers](https://img.shields.io/badge/pre--LLM_defense-28%20web--injection%20%2B%2078%20prompt--level-orange)

![Token Saver](https://img.shields.io/badge/tokens--saved-deterministic%20pre--LLM%20block%20%2B%20dedup-9775fa)

![Node](https://img.shields.io/badge/node-%E2%89%A520-339933?logo=node.js&logoColor=white)

![Runtime](https://img.shields.io/badge/100%25-JavaScript-f7df1e?logo=javascript&logoColor=000)

![MCP](https://img.shields.io/badge/MCP-companion-7c3aed)

![Substrate](https://img.shields.io/badge/substrate-Bright%20Data-ff6b35)

### ▶ [Live demo: synthex.apohara.dev](https://synthex.apohara.dev)

📄 **[See a real Evidence Report → sample PDF](samples/synthex-evidence-report.pdf)** — 6 pages, sealed with HMAC-SHA256 + RFC 3161 (DigiCert), generated by the real pipeline ([regenerate](scripts/gen-sample-report.mjs)).

[Live demo](https://synthex.apohara.dev) · [Sample report](samples/synthex-evidence-report.pdf) · [Quickstart](#-quickstart) · [Verify in 60s](#-verify-it-yourself-60-seconds) · [Architecture](#-architecture) · [Honesty](#-honesty)

_{Web Data UNLOCKED Hackathon · Bright Data × lablab.ai · MIT}



---

> **Your AI agents are scraping the live web right now.**

> Do you know what they found, what they classified, and what you can *prove*?

Synthex is a **100% JavaScript MCP server** that wraps [`brightdata-mcp`](https://github.com/brightdata/brightdata-mcp) and turns raw web scraping into a defensible intelligence pipeline:

**scrape → dedup & screen → classify (GTM · Finance · Security · Supply-chain) → remember → seal as verifiable evidence → react.**

**For** AI Operations & Security teams running agents with web access that must account for *what those agents found and decided* — under EU AI Act / DORA.

**The moat:** SIEMs and agent-observability tools watch the agent's *infrastructure*. Synthex sees — and **cryptographically signs** — the web *content* the agent touched. The signed Evidence Report is something no competitor ships.

---

## ◆ Architecture

```

 Triggerware ─(react)─┐                                       ┌─(act)─► alert + webhook

                      ▼                                       │

   FETCH ─────► FORGE ──────────► CLASSIFY ─────► PROVE ─────► OBSERVE ─────► MEMORY

   Bright Data  SHA-256 dedup +   AI/ML API       HMAC-SHA256  OpenTelemetry  Cognee (graph)

   (5 APIs)     78-rule DJL +     (frontier LLM)  + RFC 3161   GenAI spans    + local store

                28-rule prefilter   4 lenses ‖     TSA + PDF    (OTLP opt-in)  (opt-in / CLI)

```

| Stage | What it does |

|------|--------------|

| **FETCH** | Routes each target to the right Bright Data surface: **Web Unlocker** (MCP stdio + REST), **SERP API** (zone `serp_api1`), **Browser API** (Playwright `connectOverCDP`), **Web Scraper / Datasets API** (`datasets/v3/scrape`), and **Crawl API**. *No Bright Data, no data.* |

| **FORGE** | SHA-256 dedup + **two-layer deterministic pre-LLM defense (106 rules)**. Layer 1 `prefilter.js` (28 rules): SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors. Layer 2 `djl.js` (78 rules): prompt-injection, harm/PII bilingual EN+ES, jailbreak, SQLi/XSS, exfiltration, tool misuse, sector policy (HIPAA/PCI/EO-13526). Audit trail per-stage emitted in payload v2 `decisions[]` with policy_bundle sha. |

| **CLASSIFY** | A frontier model via **AI/ML API** extracts structured signals under one lens — or all **four lenses in parallel** (`lens="all"` → GTM + Finance + Security + Supply-chain). |

| **PROVE** | Every report sealed with HMAC-SHA256 **and** an RFC 3161 timestamp from **DigiCert** — exportable as a **6-page downloadable PDF Evidence Report** (4-buyer framing: CISO · CFO · General Counsel · Broker) with a Synthex Risk Score 0–100. |

| **OBSERVE** | Every stage emits OpenTelemetry GenAI spans (`gen_ai.client.operation.duration`, token usage, blocked count). OTLP export is opt-in; latencies stream into the UI over SSE. |

| **MEMORY** | Local store for deltas + **Cognee** (OSS knowledge graph) — default in the local/CLI path, off on the public endpoint to control cost. |

| **WATCH / REACT** | Always-on loop: detect change → run the pipeline → alert. No human in the loop. |

---

## ◆ Quickstart

```bash

npm install

# credentials live OUTSIDE the repo (never committed):

export BRIGHT_DATA_TOKEN=...    # Bright Data (promo: unlocked)

export AIML_API_KEY=...         # AI/ML API

export TRIGGERWARE_API_KEY=...  # Triggerware

npm test        # unit suite (network tests are opt-in)

npm run demo    # end-to-end Evidence Report + LIVE DigiCert seal

SYNTHEX_TRACE=console npm run demo   # same, with per-stage OTel latencies printed

node server.js  # run as an MCP server (companion to brightdata-mcp)

```

**Web UI / Vercel:** `public/` + `api/` deploy as a static site + serverless functions

(`vercel deploy`). The deployed `/api/analyze` runs the **full live pipeline** via the Bright

Data REST API; `/api/stream` pushes per-stage progress to the UI over **SSE** (cinematic

stage view). Set `BRIGHT_DATA_TOKEN`, `WEB_UNLOCKER_ZONE`, `AIML_API_KEY`, `SYNTHEX_HMAC_KEY`

in the project env (without them it falls back to a labeled cached demo). The public endpoint is

guarded (SSRF block + per-IP rate-limit); Cognee memory stays off there to control cost.

→ **[synthex.apohara.dev](https://synthex.apohara.dev)** (live · deployed on Vercel,

also reachable at `apohara-synthex.vercel.app`).

---

## ◆ Verify it yourself (60 seconds)

Don't trust the claims — run them.

```bash

npm test                                   # → full suite green (zero failing, opt-in live tests skipped)

npm run demo                               # → Evidence Report; verify → hash OK · HMAC OK · TSA OK

npm run bench:djl                          # → logs/djl-latency.json (p95<5ms, p99 adv<50ms)

node bin/decode-evidence.js   # offline audit-trail inspector (verifies HMAC + TSA, prints decisions[])

# Real, live, end-to-end (needs BRIGHT_DATA_TOKEN + AIML_API_KEY):

node scripts/check-pipeline-live.mjs "https://en.wikipedia.org/wiki/Bright_Data" all   # 4 lenses in parallel

```

Opt-in live checks (gated by env flags so the suite never fakes a pass): `AIML_LIVE=1` · `TRIGGERWARE_LIVE=1` · `COGNEE_LIVE=1`.

---

## ◆ Partners — each verified against the real service

| Partner | Role in Synthex | Verified |

|---|---|:--:|

| **Bright Data — Web Unlocker** | FETCH (MCP stdio + REST) | ✅ live |

| **Bright Data — SERP API** | FETCH (structured JSON, zone `serp_api1`) | ✅ live |

| **Bright Data — Browser API** | FETCH (Playwright `connectOverCDP`, JS-heavy) | ✅ live (local/flag) |

| **Bright Data — Web Scraper / Datasets** | FETCH (`datasets/v3/scrape`) | ✅ live |

| **Bright Data — Crawl** | FETCH (multi-page via Web Unlocker) | ✅ live · native Crawl API opt-in |

| **Bright Data — MCP** | FETCH substrate (`server.js` companion) | ✅ live |

| **AI/ML API** | CLASSIFY brain (frontier model, extraction) | ✅ live classification |

| **Cognee** | MEMORY knowledge graph (OSS, via its MCP) | ✅ tools `remember`/`recall` confirmed |

| **Triggerware** | REACT (poll deltas → fire pipeline) | ✅ live API (`GET /triggers` 200) |

**All 6 Bright Data surfaces verified LIVE with real code.** Crawl is a multi-page crawl over Web Unlocker; the native Crawl API stays opt-in via a Crawl `dataset_id`.

---

## ◆ Market & business

Synthex doesn't claim a single tidy TAM — it sits at the **intersection** of three real markets, each sized by a named firm with very different scopes. We address a **wedge** of that intersection: a verifiable evidence + screening layer for the web content autonomous agents ingest, for teams accountable under EU AI Act / DORA. It is *not* the whole AI-agents market.

| Adjacent market | Size & horizon | Source |

|---|---|---|

| AI agents | **$52.6B by 2030** → $231.9B by 2034 (CAGR 46.3%) | [MarketsandMarkets](https://www.marketsandmarkets.com/PressReleases/ai-agents.asp) · [Dimension Market Research](https://dimensionmarketresearch.com/report/ai-agents-market/) |

| AI-driven web scraping | **$46.1B by 2035** (CAGR 19.9%) | [Market Research Future](https://www.marketresearchfuture.com/reports/ai-driven-web-scraping-market-24744) |

| AI in observability | **$10.7B by 2033** (CAGR 22.5%) | [Market.us](https://market.us/report/ai-in-observability-market/) |

> Forecasts across firms differ widely because they define scope differently — we cite the firm and horizon for each rather than collapse them into one headline number. Synthex's serviceable slice is a subset of all three.

### Pricing — *proposed* (not yet live revenue)

Every tier below is a **proposed** go-to-market model. Synthex has **no paying customers and no revenue today**; these are pricing hypotheses, not reported figures.

| Tier | Proposed price | For |

|---|---|---|

| **OSS** | Free (MIT) | the full pipeline, self-hosted — what's in this repo |

| **Pro** | ~$99/mo *(proposed)* | hosted endpoint, higher rate limits, retained Evidence Reports |

| **Enterprise** | $2,500+/mo *(proposed)* | SSO, audit retention, on-prem TSA, EU AI Act / DORA evidence workflows |

### Why us — the signed Evidence Report

| Category | What they watch | What they can't ship |

|---|---|---|

| SIEM / log tools | the agent's *infrastructure* | proof of the web *content* the agent saw |

| Agent-observability | traces, tokens, latency | a cryptographically sealed, court-grade report |

| Scraping APIs | raw bytes | classification + screening + RFC 3161 evidence |

| **Synthex** | the web content itself | — *the signed Evidence Report is the moat* |

---

## ◆ v0.6.0 — Watch & Prove

The chain-of-custody release. Every re-scrape of the same target now

encadena `previous_tsa_serial → current_tsa_serial`, with a normalized

content diff and a 7th PDF page when delta is present.

- **`src/delta/`** — Delta Engine: `normalize → hash → diff → sealDeltaChain`.

  35 unit + 1 integration tests.

- **`HMAC_EXCLUDED_KEYS`** (`src/prove/evidence-report.js`) — cross-run

  determinism: `kg_status`, `kg_latency_ms`, `surface_status` are normative

  metadata, excluded from the HMAC bytestring so the chain never reports

  a phantom change.

- **`src/forge/pii-filter.js`** — 25-rule PII bundle (10 DJL-PII reused +

  15 secrets-leak: AWS / GitHub / Stripe / JWT / etc.) gating Cognee ingest.

- **Model tier selector** (`src/classify/tiers.js`) — `free` / `oss` / `paid`.

  FREE labeled `free-low-quality` per `docs/v060-calibration.md` (50 % of

  fixtures had Δseverity > 1.5 vs DeepSeek baseline).

- **Try it inline** in the `#live` section of [synthex.apohara.dev](https://synthex.apohara.dev) —

  paste a URL, pick a lens + model tier (OSS / PAID / FREE), watch the 4

  stages execute in real time against Bright Data, download the 6-7 page

  signed PDF. No separate playground page needed — everything lives in

  one REEF-style scroll.

- **Live stress run** (2026-05-28): 500 URLs · 99.6 % success ·

  $0.75 cost · 9.1 min wall clock — see `docs/v060-stress-report.md`.

- **DigiCert TSA RTT baseline**: p95 385 ms — see `logs/digicert-rtt-baseline.json`.

- **`docs/PRIOR_ART.md`** — reproducible directed-search queries proving

  the "no open-source combination of [scrape + diff + HMAC + RFC 3161 + KG]

  found at 2026-05-28" claim.

- **`.kiro/specs/delta-engine.md`** — Kiro-native MCP spec for the Kiro

  Challenge integration.

---

## ◆ Honesty

The pitch *is* honesty — so it applies to us too. **Canonical caveats live in [`docs/HONESTY.md`](docs/HONESTY.md)** — RFC-3161 verification scope (v0.7.0 M1), rate-limit posture, PII-gate placement, durability choices, and Risk-Score semantics. This section is the short list; the doc is the long list.

- **Proven live:** Bright Data — Web Unlocker (MCP **and** REST), SERP API, Browser API, Web Scraper / Datasets API, native MCP server (`server.js`) · AI/ML classification (single + 4-lens parallel) · DigiCert RFC 3161 timestamp · downloadable 6-page PDF · Vercel deploy (`/api/analyze` live, end-to-end) · Triggerware API · Cognee MCP tools.

- **Crawl is multi-page over Web Unlocker, not the native Crawl product.** All 6 Bright Data surfaces are verified live; we name the crawl honestly — the native Crawl API stays opt-in via a Crawl `dataset_id`.

- **Risk Score is an internal estimate:** the PDF's Synthex Risk Score (0–100) is a deterministic heuristic computed from the report's own data, with the formula printed on the page. It is **NOT** a Munich Re rating or any third-party underwriting score.

- **Opt-in (cost/credentials):** Cognee memory is default in the local/CLI path but **off** on the public endpoint; its `remember` ingest uses an LLM → behind `COGNEE_LIVE`. OTel OTLP export only runs if `OTEL_EXPORTER_OTLP_ENDPOINT` is set (otherwise spans are no-op / console-only). Network tests are env-gated so the suite never fabricates a pass.

- **Two-layer defense scope:** Synthex runs **28 web-injection rules** (`src/forge/prefilter.js` — SSRF, prototype-pollution, MCP tool poisoning, indirect prompt-injection, BrowseSafe / VPI-Bench text vectors) **plus 78 prompt-level rules** (`src/forge/djl.js` — jailbreak, harm/PII bilingual EN+ES, SQLi/XSS, exfiltration, tool misuse, sector policy). Both layers are **heuristic regex deterministic** — *aligned with* the SkillFortify benchmark (arXiv 2603.00195), not a formal guarantee. They do **not** stop *visual* prompt injection (VPI in rendered screenshots/images) — a different threat model.

- **Coverage on curated fixtures:** `test/djl.test.js` validates **78/78 fixtures pass identically** (78 positive + 78 negative = 156 assertions). This is measured coverage on curated examples, NOT a formal guarantee against every adversarial input. Divergences would land in [`docs/djl-parity-divergence.md`](docs/djl-parity-divergence.md) (currently empty).

- **Effective coverage on synthetic corpus** (SC-11, `node scripts/measure-coverage.mjs`): on the 156 internal fixtures (78 positive + 78 negative — *synthetic*, not real Bright Data scraping), DJL fires on **50.0% of docs** and **100% of the 78 rules** fired at least once; prefilter fires on **9.0% of docs** (natural overlap with DJL on SQLi/XSS/exf vectors) and **39.3% of the 28 rules**. On real Bright Data corpus the split will differ — prefilter higher (HTML scraping), DJL lower (rules designed for prompts, not docs). Re-run with `node scripts/measure-coverage.mjs ` on your own corpus.

- **HMAC canonicalization (schema_v2):** since v4 the HMAC sealing uses `canonicalize()` (JCS-like, `src/prove/canonicalize.js`) so the order of payload keys is irrelevant — an identical v2 payload produces the same HMAC no matter how it was built. The verifier auto-detects `schema_version` and verifies both v1 (legacy, JSON.stringify) and v2 (canonicalize) without flags. Global flag `EVIDENCE_SCHEMA_V2=0` forces sealer legacy (rollback demo only).

- **Tokens saved (estimated, in the sealed payload):** Synthex emits `tokens_saved: {dedup_bytes, blocked_bytes, total_bytes, estimated_tokens, chars_per_token, note}` inside the v2 payload. The estimate uses **4 chars/token** as a conservative approximation — actual depends on the tokenizer (GPT-4 `cl100k_base` ~4.2, Claude ~3.8, multilingual CJK worse). The 3 mechanisms that contribute: (1) SHA-256 dedupe drops N-1 copies of identical content, (2) the 78-rule DJL blocks prompt-level attacks before classify, (3) the 28-rule prefilter blocks web-injection on top. Verify on any sealed report with `node bin/decode-evidence.js ` — the "tokens saved" line is right there in the summary.

- **Endpoint guard is best-effort:** the public rate-limit is in-memory per warm instance (a hard multi-instance limit would need Vercel KV). The SSRF block filters the hostname (literal + obfuscated/IPv6 private ranges) but does **not** resolve DNS, so a public domain pointing at a private IP (DNS rebinding) would pass — low risk here because the scrape runs on Bright Data's *remote* proxy, not the function's network.

- **Research grounding (cited, not implemented):** the parallel multi-lens design is grounded in **KVCOMM** (NeurIPS 2025); KV-cache memory is a stated future direction per **MemArt** (ICLR 2026). These are foundations we cite — not features Synthex ships.

- **Prior art, not pipeline:** the **INV-15** invariant ([Context_Forge paper](https://doi.org/10.5281/zenodo.20277875)) ships as a module and is cited as prior art — it is *not* part of this scraping pipeline.

- **Not claimed:** Synthex doesn't bypass any site's ToS — it uses Bright Data's compliant infrastructure. The timestamp proves *when* evidence existed, not the truth of its content.

---



**We didn't just use Bright Data — we improved it.**

Upstream contribution: [`brightdata-mcp` PR #140](https://github.com/brightdata/brightdata-mcp/pull/140) (dedup + field filtering). See [`docs/CONTRIBUTION.md`](docs/CONTRIBUTION.md).

MIT © 2026 Pablo M. Suárez · [Apohara]
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/suarezpm/apohara-synthex

Awesome Lists containing this project

README