{"id":49855472,"url":"https://github.com/forattini-dev/crawlex","last_synced_at":"2026-05-14T20:01:06.638Z","repository":{"id":354060486,"uuid":"1217195351","full_name":"forattini-dev/crawlex","owner":"forattini-dev","description":"The stealth crawler that actually looks like Chrome.","archived":false,"fork":false,"pushed_at":"2026-04-27T00:12:37.000Z","size":7186,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-27T01:14:11.177Z","etag":null,"topics":["crawler","stealth"],"latest_commit_sha":null,"homepage":"https://forattini-dev.github.io/crawlex/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/forattini-dev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-21T16:33:39.000Z","updated_at":"2026-04-27T00:12:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/forattini-dev/crawlex","commit_stats":null,"previous_names":["forattini-dev/crawlex"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/forattini-dev/crawlex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forattini-dev%2Fcrawlex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forattini-dev%2Fcrawlex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forattini-dev%2Fcrawlex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forattini-dev%2Fcrawlex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/forattini-dev","download_url":"https://codeload.github.com/forattini-dev/crawlex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forattini-dev%2Fcrawlex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33037097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-14T02:00:06.663Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","stealth"],"created_at":"2026-05-14T20:00:29.487Z","updated_at":"2026-05-14T20:01:06.631Z","avatar_url":"https://github.com/forattini-dev.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🕸️ crawlex\n\n### **The stealth crawler that actually looks like Chrome.**\n\nTLS, HTTP/2, JS fingerprint — every byte indistinguishable from real Chrome 149.\u003cbr\u003e\nRust core • Node SDK • Lua hooks • cross-platform binaries.\n\n[![CI](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml/badge.svg)](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml)\n[![crates.io](https://img.shields.io/crates/v/crawlex.svg?logo=rust)](https://crates.io/crates/crawlex)\n[![npm](https://img.shields.io/npm/v/crawlex.svg?logo=npm)](https://www.npmjs.com/package/crawlex)\n[![docs](https://img.shields.io/badge/docs-docsify-success.svg)](https://forattini-dev.github.io/crawlex/)\n[![downloads](https://img.shields.io/crates/d/crawlex.svg)](https://crates.io/crates/crawlex)\n[![license](https://img.shields.io/badge/license-MIT%20%7C%20Apache--2.0-blue.svg)](#license)\n\n```bash\npnpm add -g crawlex \u0026\u0026 crawlex pages run --seed https://example.com --method render\n```\n\n[**Quickstart**](#-quickstart) · [**Features**](#-features) · [**Examples**](#-examples) · [**Docs**](https://forattini-dev.github.io/crawlex/) · [**Why crawlex**](#-why-crawlex)\n\n\u003c/div\u003e\n\n---\n\n## ⚡ Why crawlex\n\nStandard crawlers fail on the first Cloudflare wall. `crawlex` arrives the way **real Chrome** arrives — every fingerprint surface is identical, not approximated.\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eLayer\u003c/th\u003e\u003cth\u003eWhat we match — exactly, not approximately\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e🔐 \u003cstrong\u003eTLS ClientHello\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003eExtension order, ALPS, GREASE values, \u003ccode\u003epermute_extensions\u003c/code\u003e, X25519MLKEM768, signature algorithms — verified against \u003ca href=\"https://tls.peet.ws\"\u003etls.peet.ws\u003c/a\u003e and \u003ca href=\"https://ja4db.com\"\u003eja4db.com\u003c/a\u003e oracles\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e🚦 \u003cstrong\u003eHTTP/2 frame\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003ePseudo-header order \u003ccode\u003e:method :authority :scheme :path\u003c/code\u003e, SETTINGS frame parameters, WINDOW_UPDATE pattern — passes Akamai BMP signature checks\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e🎭 \u003cstrong\u003eJS fingerprint\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003e29-section stealth shim: \u003ccode\u003enavigator\u003c/code\u003e, \u003ccode\u003echrome.*\u003c/code\u003e, permissions, plugins, screen, timezone, battery, WebGL (vendor / params / extensions), canvas (zero-preserving noise), AudioContext (FFT + offline render), \u003ccode\u003eFunction.prototype.toString\u003c/code\u003e proxy, WebGPU, \u003ccode\u003eperformance.memory\u003c/code\u003e, sensors, iframe, requestAnimationFrame throttle, \u003ccode\u003eperformance.now()\u003c/code\u003e 100µs grain, mediaDevices, fonts, WebRTC SDP/ICE/getStats scrub\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e🤖 \u003cstrong\u003eBehavior\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003eMouse jitter, scroll cadence, dwell time, idle drift — coherent \u003ccode\u003emotion::\u003c/code\u003e profiles per persona\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e📦 \u003cstrong\u003eCatalog\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003e30 Chrome stable × 30 Chromium × 20 Firefox × Edge × Safari fingerprints. Era-fallback resolution: ask for \u003ccode\u003echrome-149-linux\u003c/code\u003e, get the closest captured profile\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e🛠️ \u003cstrong\u003eWorker scope\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003eSame shim auto-attached to dedicated / shared / service workers via CDP \u003ccode\u003eTarget.setAutoAttach\u003c/code\u003e — Camoufox port\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n→ Validated against [BrowserScan](https://browserscan.net), [CreepJS](https://abrahamjuliot.github.io/creepjs/), [Sannysoft](https://bot.sannysoft.com/), [tls.peet.ws](https://tls.peet.ws), [ja4db.com](https://ja4db.com).\n\n---\n\n## 🚀 Install\n\n```bash\n# npm — bundled binary download via postinstall\npnpm add -g crawlex\n\n# Rust — from source\ncargo install crawlex\n\n# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)\n# https://github.com/forattini-dev/crawlex/releases/latest\n```\n\n\u003e ⚠️ **Production crawls run locally**, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.\n\n---\n\n## 🆕 Last 24h highlights\n\n- `1.0.4` release line is live across npm/crates/GitHub Releases, with docsify publishing through GitHub Pages.\n- JS/TS hooks now run through the SDK bridge, so `defineHooks()` can drive the same lifecycle decisions as embedded Rust hooks.\n- NDJSON events now carry richer artifacts, Web Vitals, per-fetch timings, crawl-attempt telemetry and crawl-resolution summaries.\n- `crawlex-mini` was hardened: CDP-only paths are gated cleanly in no-browser builds.\n- Large crawl efficiency grew: cache validation, prefetch discovery mode and best-first URL scoring are now available from CLI/config.\n- Render fallback grew: external CDP connection, GPU posture control, Shadow DOM flattening, overlay cleanup and last-resort fallback fetch are configurable.\n\n---\n\n## 🏃 Quickstart\n\n```bash\n# Stealth render with persona, sitemap discovery, NDJSON event stream\ncrawlex pages run \\\n  --seed https://target.com \\\n  --method render \\\n  --persona atlas \\\n  --max-depth 3 \\\n  --screenshot \\\n  --emit ndjson \u003e events.ndjson\n\n# Live tail what just happened\njq -c 'select(.event == \"fetch.completed\" or .event == \"render.completed\")' events.ndjson\n```\n\nThree integration paths, your pick:\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eCLI\u003c/th\u003e\u003cth\u003eNode SDK\u003c/th\u003e\u003cth\u003eEmbedded Rust\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\n\n```bash\ncrawlex pages run \\\n  --seed https://...\\\n  --method render \\\n  --persona pixel \\\n  --emit ndjson\n```\n\nOne-shot crawls, scripted pipelines.\n\n\u003c/td\u003e\u003ctd\u003e\n\n```ts\nimport { crawl, defineHooks } from 'crawlex';\n\nfor await (const ev of crawl({\n  seeds: ['https://...'],\n  args: { method: 'render' },\n})) { ... }\n```\n\nProduction services with hook logic.\n\n\u003c/td\u003e\u003ctd\u003e\n\n```rust\nuse crawlex::{Crawler, Config};\nlet crawler = Crawler::new(\n    Config::builder().build()?\n)?;\ncrawler.run().await?;\n```\n\nIn-process embedding, zero IPC.\n\n\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## 🎨 Examples\n\n### 1. Hunt a SaaS product page with vitals + screenshot\n\n```ts\nimport { crawl } from 'crawlex';\n\nfor await (const ev of crawl({\n  seeds: ['https://stripe.com/pricing'],\n  args: {\n    method: 'render',\n    persona: 'atlas',                 // macOS Apple M1, Retina, en-US\n    screenshot: true,\n    screenshotMode: 'fullpage',\n    storage: 'filesystem',\n    storagePath: './out',\n    waitStrategy: '{\"NetworkIdle\":{\"idle_ms\":1500}}',\n  },\n})) {\n  if (!('event' in ev)) continue;\n  switch (ev.event) {\n    case 'render.completed':\n      console.log(`✅ ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);\n      break;\n    case 'artifact.saved':\n      if (ev.data.kind === 'screenshot.full_page')\n        console.log(`📸 → out/${ev.data.path}  (${(ev.data.size/1024).toFixed(0)}kB)`);\n      break;\n    case 'challenge.detected':\n      console.log(`🚧 ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);\n      break;\n  }\n}\n```\n\n### 2. Crawl an entire domain with proxy rotation + retry policy\n\n```ts\nimport { crawl, defineHooks } from 'crawlex';\n\nconst hooks = defineHooks({\n  // Rate-limit retry: 429/503 → re-enqueue (up to retry_max)\n  async onAfterFirstByte(ctx) {\n    if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';\n    return 'continue';\n  },\n  // Inject the canonical sitemap.xml for every host we touch\n  async onDiscovery(ctx) {\n    const host = new URL(ctx.url).host;\n    return {\n      decision: 'continue',\n      patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },\n    };\n  },\n  // Tag the crawl with custom metadata that lands in user_data\n  async onJobStart(ctx) {\n    return {\n      decision: 'continue',\n      patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },\n    };\n  },\n});\n\nfor await (const ev of crawl({\n  seeds: ['https://target.com'],\n  args: {\n    method: 'auto',                   // policy engine picks http vs render\n    maxConcurrentHttp: 8,\n    maxConcurrentRender: 2,\n    maxDepth: 5,\n    crtsh: true,                      // certificate-transparency seeding\n    storage: 'sqlite',\n    storagePath: './crawl.db',\n    queue: 'sqlite',\n    queuePath: './crawl.db',\n    proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],\n    proxyStrategy: 'health-weighted',\n    proxyStickyPerHost: true,\n  },\n  hooks,\n  signal: AbortSignal.timeout(30 * 60_000),\n})) {\n  if (!('event' in ev)) continue;\n  if (ev.event === 'job.failed') console.error(`✗ ${ev.url} — ${ev.data.error}`);\n  if (ev.event === 'run.completed') console.log('done.');\n}\n```\n\n### 3. Embedded library with custom Rust hooks\n\n```rust\nuse crawlex::{Config, Crawler, queue::FetchMethod};\nuse crawlex::hooks::{HookDecision, HookRegistry};\nuse std::sync::atomic::{AtomicUsize, Ordering};\nuse std::sync::Arc;\n\n#[tokio::main]\nasync fn main() -\u003e crawlex::Result\u003c()\u003e {\n    let hooks = HookRegistry::new();\n    let pages_seen = Arc::new(AtomicUsize::new(0));\n\n    // Closure-captured counter — observe without intervening\n    let counter = pages_seen.clone();\n    hooks.on_response_body(move |_ctx| {\n        let c = counter.clone();\n        Box::pin(async move {\n            c.fetch_add(1, Ordering::Relaxed);\n            Ok(HookDecision::Continue)\n        })\n    });\n\n    // Domain-level deny list — short-circuit before fetch\n    hooks.on_before_each_request(|ctx| {\n        let url = ctx.url.clone();\n        Box::pin(async move {\n            if url.path().starts_with(\"/admin/\") { return Ok(HookDecision::Skip); }\n            Ok(HookDecision::Continue)\n        })\n    });\n\n    let config = Config::builder()\n        .max_concurrent_http(16)\n        .build()?;\n\n    let crawler = Crawler::new(config)?.with_hooks(hooks);\n    crawler.seed_with(\n        vec![\"https://target.com\".parse().unwrap()],\n        FetchMethod::HttpSpoof,\n    ).await?;\n    crawler.run().await?;\n\n    println!(\"Crawled {} pages\", pages_seen.load(Ordering::Relaxed));\n    Ok(())\n}\n```\n\n→ Full runnable example: [`examples/embedded_with_hooks.rs`](examples/embedded_with_hooks.rs)\n\n### 4. Pin a specific browser fingerprint from the catalog\n\n```bash\n# Browse 80+ ready-to-use fingerprints\ncrawlex stealth catalog list\ncrawlex stealth catalog list --filter chrome\ncrawlex stealth catalog show chrome-149-linux\n\n# Pin a precise version + OS\ncrawlex pages run --seed https://target.com \\\n  --profile chrome-149-linux\n\n# Era fallback: chromium-122 not captured? falls back to closest era + warns\ncrawlex pages run --seed https://target.com \\\n  --profile chromium-122-linux\n\n# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)\ncrawlex pages run --seed https://target.com \\\n  --method render --persona pixel\n```\n\n### 5. Inspect what your stealth stack actually emits\n\n```bash\n# Print active IdentityBundle + TLS profile summary\ncrawlex stealth inspect --profile chrome-149-linux\n\n# Verify ALPN/cipher/JA4 against built-in expectations\ncrawlex stealth test\n\n# Compare against tls.peet.ws / ja4db.com via the live oracle\ncrawlex stealth catalog show chrome-149-linux --json\n```\n\n### 6. Large crawl: validate cache, prefetch links, score the frontier\n\n```bash\ncrawlex pages run \\\n  --seed https://docs.example.com \\\n  --method auto \\\n  --queue sqlite --queue-path state/queue.db \\\n  --storage sqlite --storage-path state/crawl.db \\\n  --cache-validate \\\n  --cache-max-age-secs 86400 \\\n  --prefetch \\\n  --best-first \\\n  --score-keyword docs \\\n  --score-keyword api \\\n  --emit ndjson\n```\n\nThis mode is for discovery passes: reuse fresh cache rows, harvest links cheaply, and let higher-value URLs rise in the queue before expensive render passes.\n\n---\n\n## 🎯 Features\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### 🥷 Stealth core\n- 🔐 Chrome 149 TLS via BoringSSL fork\n- 🚦 H2 pseudo-header order patch\n- 🎭 29-section JS shim — full leak inventory covered\n- 🤖 Worker scope shim (dedicated / shared / SW)\n- 📦 80+ browser fingerprints from curl-impersonate + ja4db + tls.peet\n- 🌍 5 personas: `tux`, `office`, `gamer`, `atlas`, `pixel`\n- 🎬 Coherent `motion::` profiles (mouse / scroll / dwell)\n- 🕸️ WebRTC scrub (SDP, ICE, getStats — public-interface only)\n\n### 🔍 Discovery\n- 🗺️ Sitemap recursion + robots.txt parsing\n- 🔎 Certificate transparency (crt.sh)\n- 🌐 DNS records + RDAP + Wayback CDX\n- 📜 PWA manifest + service worker probes\n- 📂 `.well-known/*` enumeration\n- 🔬 Tech fingerprinting (Wappalyzer-class)\n- 🔌 JS endpoint extraction from runtime\n- 🛡️ security.txt parser\n- 🧬 Asset-ref classification (JS / CSS / image / API / nav)\n- ⚡ Prefetch mode for fast discovery-only passes\n- 🎯 Best-first URL scoring with keyword bonuses\n- 🔓 TCP port scan (opt-in, network-active)\n\n### 🛡️ Antibot policy engine\n- 🚧 Detect: Cloudflare, DataDome, PerimeterX, Akamai BMP, Imperva, hCaptcha, reCAPTCHA, Turnstile\n- 📊 Vendor telemetry observer (passive — sees outbound calls to known endpoints)\n- 🔄 Policy decisions: keep / drop / retry / scope-demote / proxy-rotate / give-up\n- 🧱 Unified block classifier with attempt-level crawl stats\n- 🪂 Fallback fetch command for last-resort HTML retrieval\n- 🎯 4 captcha solver adapters: in-house reCAPTCHA v3, 2captcha, anticaptcha, VLM\n\n\u003c/td\u003e\n\u003ctd width=\"50%\" valign=\"top\"\u003e\n\n### ⚙️ Pipeline\n- 🎯 Render pool — Chromium auto-fetch + isolated user-data dirs\n- 🔌 External CDP endpoint support for managed/browser-farm Chrome\n- 🌑 Shadow DOM flattening + overlay / consent-popup cleanup\n- 🖥️ GPU policy: compatibility mode or stealth-friendly GPU surfaces\n- 🔁 Persistent queue: in-memory / SQLite / Redis backends\n- 💾 Storage: filesystem / SQLite / memory — opt-in per concern (artifact, state, challenge, telemetry, intel)\n- 🧠 Smart cache validation: `ETag`, `Last-Modified`, `\u003chead\u003e` fingerprint\n- 🔄 Proxy rotator — health checks + sticky sessions + per-host affinity\n- 📊 Web Vitals + per-fetch network breakdown (DNS / TCP / TLS / TTFB / download)\n- 🎬 ScriptSpec runner — declarative `Plan` execution with assertions\n- 🔧 Frontier with dedupe + rate-limit + retry policies\n- 📐 Wait strategies: `Load`, `DOMContentLoaded`, `NetworkIdle`, `Selector`, `Fixed`\n\n### 📡 Observability\n- 📜 NDJSON event stream — versioned envelope (`v: 1`)\n- 🎬 21 event kinds covering full lifecycle\n- 🔬 Embedded `WebVitals` summary on `render.completed`\n- ⏱️ Per-request timings on `fetch.completed` (ALPN, cipher, TLS version)\n- 🧾 `crawl.attempted` / `crawl.resolved` telemetry for HTTP → render → fallback ladders\n- 📸 Artifact descriptors with on-disk path on the wire\n- 🪝 Hooks: 12 lifecycle points × 3 languages (Rust / JS / Lua)\n- 📊 Prometheus metrics endpoint\n\n### 🔌 Integrations\n- 📦 npm + crates.io + GitHub Releases\n- 🦀 Rust library — embed `Crawler` directly\n- 📘 TypeScript types — strict, full envelope coverage\n- 🔌 SDK `crawl()` async iterator\n- 🧩 SDK `defineHooks()` bridge for JS/TS lifecycle hooks\n- 📚 docsify docs site (GitHub Pages)\n- 🧪 390+ lib tests, 27 fpjs compliance, TLS catalog roundtrip suite\n- 🔐 Optional Lua hooks (`mlua`)\n- 🪶 Two binaries: `crawlex` (full) + `crawlex-mini` (HTTP-only, no Chromium)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## 📡 NDJSON event stream\n\nEvery run emits one JSON envelope per line on stdout. Versioned, stable, 21 kinds:\n\n```jsonl\n{\"v\":1,\"event\":\"run.started\",\"ts\":\"2026-04-26T19:42:00.000Z\",\"run_id\":42,\"data\":{\"policy_profile\":\"strict\",\"max_concurrent_http\":8,\"max_concurrent_render\":2}}\n{\"v\":1,\"event\":\"job.started\",\"run_id\":42,\"url\":\"https://target.com/\",\"data\":{\"job_id\":\"j_001\",\"method\":\"render\",\"depth\":0,\"priority\":0,\"attempts\":0}}\n{\"v\":1,\"event\":\"fetch.completed\",\"run_id\":42,\"url\":\"https://target.com/\",\"data\":{\"final_url\":\"https://target.com/\",\"status\":200,\"bytes\":98234,\"body_truncated\":false,\"dns_ms\":12,\"tcp_connect_ms\":18,\"tls_handshake_ms\":24,\"ttfb_ms\":142,\"download_ms\":83,\"total_ms\":280,\"alpn\":\"h2\",\"tls_version\":\"TLSv1.3\",\"cipher\":\"TLS_AES_128_GCM_SHA256\"}}\n{\"v\":1,\"event\":\"crawl.attempted\",\"run_id\":42,\"url\":\"https://target.com/\",\"data\":{\"crawl_id\":42,\"attempt_index\":1,\"engine\":\"http_spoof\",\"status\":403,\"blocked\":true,\"block_reason\":\"Cloudflare challenge form\"}}\n{\"v\":1,\"event\":\"render.completed\",\"run_id\":42,\"session_id\":\"sess_abc\",\"url\":\"https://target.com/\",\"data\":{\"final_url\":\"https://target.com/\",\"status\":200,\"manifest\":true,\"service_workers\":1,\"is_spa\":true,\"vitals\":{\"ttfb_ms\":142,\"first_contentful_paint_ms\":380.5,\"largest_contentful_paint_ms\":920.1,\"cumulative_layout_shift\":0.03,\"total_blocking_time_ms\":50.0,\"dom_nodes\":1842,\"js_heap_used_bytes\":12345678,\"resource_count\":45,\"total_transfer_bytes\":982341}}}\n{\"v\":1,\"event\":\"artifact.saved\",\"run_id\":42,\"url\":\"https://target.com/\",\"data\":{\"kind\":\"screenshot.full_page\",\"mime\":\"image/png\",\"size\":1234567,\"sha256\":\"a1b2c3...\",\"path\":\"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png\"}}\n{\"v\":1,\"event\":\"challenge.detected\",\"run_id\":42,\"url\":\"https://protected.com/\",\"data\":{\"vendor\":\"cloudflare_turnstile\",\"level\":\"widget_present\"}}\n{\"v\":1,\"event\":\"decision.made\",\"run_id\":42,\"url\":\"https://protected.com/\",\"why\":\"render:js-challenge\",\"data\":{\"decision\":\"retry\",\"reason\":{\"code\":\"render:js-challenge\"}}}\n{\"v\":1,\"event\":\"crawl.resolved\",\"run_id\":42,\"url\":\"https://target.com/\",\"data\":{\"crawl_id\":42,\"attempts_count\":2,\"fallback_fetch_used\":false,\"resolved_by\":\"render\",\"success\":true}}\n{\"v\":1,\"event\":\"run.completed\",\"run_id\":42}\n```\n\n**Discriminator key:** `event` (snake_case) — TypeScript narrows via `switch (ev.event) { … }`. Fallback for malformed lines: `{ kind: 'raw', line }` so consumers can log/recover.\n\n---\n\n## 🪝 Hooks — 12 lifecycle points × 3 languages\n\n```\nbefore_each_request → after_dns → after_tls → after_first_byte → on_response_body\n   → after_load → after_idle → on_discovery → on_job_start → on_job_end\n   → on_error → on_robots_decision\n```\n\n| Language | API | Best for |\n|---|---|---|\n| **Rust** | `hooks.on_after_first_byte(closure)` — full `\u0026mut HookContext` access | Embedded library, latency-critical paths |\n| **JS / TS** | `defineHooks({...})` via SDK — IPC bridge, async closures | Production crawls, business logic |\n| **Lua** | `--hook-script foo.lua` — page-driving helpers (`page_click`, `page_eval`) | Ad-hoc scripts, no build step |\n\n**All three modes return the same decision:** `continue` / `skip` / `retry` / `abort`. Hooks can mutate `ctx.captured_urls`, inject extra URLs, write to `user_data` to communicate with downstream hooks, or override `robots_allowed`.\n\n---\n\n## 🎭 Personas — coherent identity bundles\n\nEach persona is a complete bundle — UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings — so every signal **matches**. No mismatched UA + WebGL combo gives you away.\n\n| Codename | OS | GPU | Locale | Form factor |\n|---|---|---|---|---|\n| 🐧 `tux` | Linux | Intel UHD 630 | en-US | desktop 1920×1080 |\n| 🏢 `office` | Windows 10 | Intel UHD 620 | en-US | laptop 1920×1080 (DPR 1.25) |\n| 🎮 `gamer` | Windows 10 | NVIDIA GTX 1060 | pt-BR | desktop 1920×1080 |\n| 🍎 `atlas` | macOS | Apple M1 | en-US | retina 1440×900 (DPR 2.0) |\n| 📱 `pixel` | Android 14 | Adreno 640 | pt-BR | **mobile** 412×823 (DPR 2.625) |\n\n```bash\ncrawlex pages run --seed https://target.com --persona atlas    # macOS\ncrawlex pages run --seed https://target.com --persona pixel    # mobile\n```\n\n---\n\n## 🏗️ Architecture\n\n```mermaid\nflowchart LR\n  S[Seeds] --\u003e Q[Frontier\u003cbr/\u003e+ dedupe + rate-limit]\n  Q --\u003e P[Policy Engine]\n  P --\u003e C[Cache Validator\u003cbr/\u003eETag + Last-Modified + head fingerprint]\n  C --\u003e|fresh| ST[Storage\u003cbr/\u003e5 traits]\n  C --\u003e|stale| F[ImpersonateClient\u003cbr/\u003eBoringSSL + h2 patched]\n  P --\u003e|http| F\n  P --\u003e|render| R[RenderPool\u003cbr/\u003eChromium + stealth shim]\n  F --\u003e X[Extractor\u003cbr/\u003e+ Asset Refs]\n  R --\u003e X\n  X --\u003e D[Discovery\u003cbr/\u003ePipeline]\n  X --\u003e ST\n  D --\u003e Q\n  P --\u003e EV[NDJSON Events\u003cbr/\u003e21 kinds]\n  R --\u003e H1[Rust Hooks]\n  R --\u003e H2[JS Bridge]\n  R --\u003e H3[Lua Scripts]\n```\n\n**Module map:**\n- `impersonate/` — TLS catalog + BoringSSL connector + ALPS + GREASE\n- `render/` — Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runner\n- `discovery/` — 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, …)\n- `policy/` — pure engine: `decide_pre_fetch`, `decide_post_fetch`, `decide_post_error`, `decide_post_challenge`\n- `antibot/` — vendor classifier + 4 captcha solver adapters\n- `cache_validator/` — cache freshness by HTTP validators and head fingerprints\n- `storage/` — 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)\n- `events/` — NDJSON envelope + sink (stdout / null / memory)\n- `hooks/` — registry + JS bridge + Lua host\n\n---\n\n## 🛠️ Tech stack\n\n| Layer | Implementation |\n|---|---|\n| TLS | `boring-sys` — BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768 |\n| HTTP/2 | Vendored `h2` crate with pseudo-header order patch (`vendor/h2`) |\n| CDP | chromiumoxide-derived, embedded behind `cdp-backend` feature |\n| Async | tokio multi-thread |\n| Storage | rusqlite (SQLite WAL), DashMap (memory), filesystem layout |\n| Discovery | hickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt) |\n| Lua | mlua 0.10 (optional, `lua-hooks` feature) |\n| SDK | Node 20+, CommonJS, zero runtime deps |\n\n**Two binaries** ship from one source tree:\n- `crawlex` — **full** build with HTTP impersonation + Chromium rendering + stealth shim + persistent queue\n- `crawlex-mini` — **HTTP-only** worker, no Chromium dependency, same CLI surface (browser-only flags return `Error::RenderDisabled`)\n\n---\n\n## 📊 Versus the alternatives\n\n| | crawlex | Playwright stealth | Puppeteer + plugins | curl-impersonate |\n|---|:-:|:-:|:-:|:-:|\n| TLS-perfect ClientHello | ✅ BoringSSL | ⚠️ relies on Chromium | ⚠️ relies on Chromium | ✅ |\n| H2 pseudo-header order | ✅ patched h2 | ⚠️ Chromium default | ⚠️ Chromium default | ❌ |\n| 29-section JS leak coverage | ✅ | ⚠️ partial | ⚠️ via plugins | ❌ no JS |\n| Worker-scope stealth | ✅ auto-attach | ⚠️ manual | ⚠️ manual | ❌ |\n| HTTP-only path (no browser) | ✅ `crawlex-mini` | ❌ | ❌ | ✅ |\n| Persistent queue + resume | ✅ SQLite/Redis | ❌ external | ❌ external | ❌ |\n| Discovery pipeline | ✅ 17 stages | ❌ | ❌ | ❌ |\n| Streaming NDJSON events | ✅ versioned | ❌ | ❌ | ❌ |\n| Rust embedding | ✅ | ❌ | ❌ | ⚠️ libcurl |\n| Single binary | ✅ | ❌ | ❌ | ✅ |\n\n---\n\n## 📚 Documentation\n\n- 🌐 **[forattini-dev.github.io/crawlex](https://forattini-dev.github.io/crawlex/)** — full docsify hub\n- 🏗️ [Architecture overview](https://forattini-dev.github.io/crawlex/#/architecture/00-overview)\n- 📖 [CLI reference](https://forattini-dev.github.io/crawlex/#/reference/cli)\n- ⚙️ [Config JSON schema](https://forattini-dev.github.io/crawlex/#/reference/config)\n- 📡 [NDJSON event envelope](https://forattini-dev.github.io/crawlex/#/reference/events)\n- 🎯 [Guides](https://forattini-dev.github.io/crawlex/#/guides/) — HTTP-only, rendered sessions, persistent runs\n- 🥷 [Stealth \u0026 proxies](https://forattini-dev.github.io/crawlex/#/features/proxy-stealth)\n\n---\n\n## 🤝 Contributing\n\n```bash\ngit clone https://github.com/forattini-dev/crawlex\ncd crawlex\n\n# Unit tests + offline shim compliance\ncargo test --lib                    # 390+ tests\ncargo test --test fpjs_compliance   # 27 cases\ncargo test --test tls_catalog_coverage --test tls_catalog_roundtrip\n\n# SDK tests\npnpm test                           # 21 node:test cases\n\n# Quality gates\ncargo fmt --check\ncargo clippy --all-features -- -D warnings\ncargo publish --dry-run --locked\n\n# Live integration tests (require system Chromium)\ncargo test --all-features --test stealth_runtime_live -- --ignored\ncargo test --all-features --test worker_shim_live -- --ignored\n```\n\nCI runs all of the above on every PR. Contributions welcome — issues, feature requests, and PRs all reviewed.\n\n---\n\n## 📄 License\n\nDual-licensed under **MIT OR Apache-2.0** at your option. SPDX: `MIT OR Apache-2.0`.\n\nThird-party attribution: see [`NOTICE`](NOTICE).\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n\u003csub\u003e**Built for crawlers who refuse to be detected.**\u003c/sub\u003e\n\n[Docs](https://forattini-dev.github.io/crawlex/) · [Releases](https://github.com/forattini-dev/crawlex/releases) · [Issues](https://github.com/forattini-dev/crawlex/issues) · [Discussions](https://github.com/forattini-dev/crawlex/discussions)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforattini-dev%2Fcrawlex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fforattini-dev%2Fcrawlex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforattini-dev%2Fcrawlex/lists"}