An open API service indexing awesome lists of open source software.

https://github.com/forattini-dev/crawlex

The stealth crawler that actually looks like Chrome.
https://github.com/forattini-dev/crawlex

crawler stealth

Last synced: about 1 month ago
JSON representation

The stealth crawler that actually looks like Chrome.

Awesome Lists containing this project

README

          

# ๐Ÿ•ธ๏ธ crawlex

### **The stealth crawler that actually looks like Chrome.**

TLS, HTTP/2, JS fingerprint โ€” every byte indistinguishable from real Chrome 149.

Rust core โ€ข Node SDK โ€ข Lua hooks โ€ข cross-platform binaries.

[![CI](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml/badge.svg)](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/crawlex.svg?logo=rust)](https://crates.io/crates/crawlex)
[![npm](https://img.shields.io/npm/v/crawlex.svg?logo=npm)](https://www.npmjs.com/package/crawlex)
[![docs](https://img.shields.io/badge/docs-docsify-success.svg)](https://forattini-dev.github.io/crawlex/)
[![downloads](https://img.shields.io/crates/d/crawlex.svg)](https://crates.io/crates/crawlex)
[![license](https://img.shields.io/badge/license-MIT%20%7C%20Apache--2.0-blue.svg)](#license)

```bash
pnpm add -g crawlex && crawlex pages run --seed https://example.com --method render
```

[**Quickstart**](#-quickstart) ยท [**Features**](#-features) ยท [**Examples**](#-examples) ยท [**Docs**](https://forattini-dev.github.io/crawlex/) ยท [**Why crawlex**](#-why-crawlex)

---

## โšก Why crawlex

Standard crawlers fail on the first Cloudflare wall. `crawlex` arrives the way **real Chrome** arrives โ€” every fingerprint surface is identical, not approximated.

LayerWhat we match โ€” exactly, not approximately
๐Ÿ” TLS ClientHelloExtension order, ALPS, GREASE values, permute_extensions, X25519MLKEM768, signature algorithms โ€” verified against tls.peet.ws and ja4db.com oracles
๐Ÿšฆ HTTP/2 framePseudo-header order :method :authority :scheme :path, SETTINGS frame parameters, WINDOW_UPDATE pattern โ€” passes Akamai BMP signature checks
๐ŸŽญ JS fingerprint29-section stealth shim: navigator, chrome.*, permissions, plugins, screen, timezone, battery, WebGL (vendor / params / extensions), canvas (zero-preserving noise), AudioContext (FFT + offline render), Function.prototype.toString proxy, WebGPU, performance.memory, sensors, iframe, requestAnimationFrame throttle, performance.now() 100ยตs grain, mediaDevices, fonts, WebRTC SDP/ICE/getStats scrub
๐Ÿค– BehaviorMouse jitter, scroll cadence, dwell time, idle drift โ€” coherent motion:: profiles per persona
๐Ÿ“ฆ Catalog30 Chrome stable ร— 30 Chromium ร— 20 Firefox ร— Edge ร— Safari fingerprints. Era-fallback resolution: ask for chrome-149-linux, get the closest captured profile
๐Ÿ› ๏ธ Worker scopeSame shim auto-attached to dedicated / shared / service workers via CDP Target.setAutoAttach โ€” Camoufox port

โ†’ Validated against [BrowserScan](https://browserscan.net), [CreepJS](https://abrahamjuliot.github.io/creepjs/), [Sannysoft](https://bot.sannysoft.com/), [tls.peet.ws](https://tls.peet.ws), [ja4db.com](https://ja4db.com).

---

## ๐Ÿš€ Install

```bash
# npm โ€” bundled binary download via postinstall
pnpm add -g crawlex

# Rust โ€” from source
cargo install crawlex

# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)
# https://github.com/forattini-dev/crawlex/releases/latest
```

> โš ๏ธ **Production crawls run locally**, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.

---

## ๐Ÿ†• Last 24h highlights

- `1.0.4` release line is live across npm/crates/GitHub Releases, with docsify publishing through GitHub Pages.
- JS/TS hooks now run through the SDK bridge, so `defineHooks()` can drive the same lifecycle decisions as embedded Rust hooks.
- NDJSON events now carry richer artifacts, Web Vitals, per-fetch timings, crawl-attempt telemetry and crawl-resolution summaries.
- `crawlex-mini` was hardened: CDP-only paths are gated cleanly in no-browser builds.
- Large crawl efficiency grew: cache validation, prefetch discovery mode and best-first URL scoring are now available from CLI/config.
- Render fallback grew: external CDP connection, GPU posture control, Shadow DOM flattening, overlay cleanup and last-resort fallback fetch are configurable.

---

## ๐Ÿƒ Quickstart

```bash
# Stealth render with persona, sitemap discovery, NDJSON event stream
crawlex pages run \
--seed https://target.com \
--method render \
--persona atlas \
--max-depth 3 \
--screenshot \
--emit ndjson > events.ndjson

# Live tail what just happened
jq -c 'select(.event == "fetch.completed" or .event == "render.completed")' events.ndjson
```

Three integration paths, your pick:

CLINode SDKEmbedded Rust

```bash
crawlex pages run \
--seed https://...\
--method render \
--persona pixel \
--emit ndjson
```

One-shot crawls, scripted pipelines.

```ts
import { crawl, defineHooks } from 'crawlex';

for await (const ev of crawl({
seeds: ['https://...'],
args: { method: 'render' },
})) { ... }
```

Production services with hook logic.

```rust
use crawlex::{Crawler, Config};
let crawler = Crawler::new(
Config::builder().build()?
)?;
crawler.run().await?;
```

In-process embedding, zero IPC.

---

## ๐ŸŽจ Examples

### 1. Hunt a SaaS product page with vitals + screenshot

```ts
import { crawl } from 'crawlex';

for await (const ev of crawl({
seeds: ['https://stripe.com/pricing'],
args: {
method: 'render',
persona: 'atlas', // macOS Apple M1, Retina, en-US
screenshot: true,
screenshotMode: 'fullpage',
storage: 'filesystem',
storagePath: './out',
waitStrategy: '{"NetworkIdle":{"idle_ms":1500}}',
},
})) {
if (!('event' in ev)) continue;
switch (ev.event) {
case 'render.completed':
console.log(`โœ… ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);
break;
case 'artifact.saved':
if (ev.data.kind === 'screenshot.full_page')
console.log(`๐Ÿ“ธ โ†’ out/${ev.data.path} (${(ev.data.size/1024).toFixed(0)}kB)`);
break;
case 'challenge.detected':
console.log(`๐Ÿšง ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);
break;
}
}
```

### 2. Crawl an entire domain with proxy rotation + retry policy

```ts
import { crawl, defineHooks } from 'crawlex';

const hooks = defineHooks({
// Rate-limit retry: 429/503 โ†’ re-enqueue (up to retry_max)
async onAfterFirstByte(ctx) {
if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';
return 'continue';
},
// Inject the canonical sitemap.xml for every host we touch
async onDiscovery(ctx) {
const host = new URL(ctx.url).host;
return {
decision: 'continue',
patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },
};
},
// Tag the crawl with custom metadata that lands in user_data
async onJobStart(ctx) {
return {
decision: 'continue',
patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },
};
},
});

for await (const ev of crawl({
seeds: ['https://target.com'],
args: {
method: 'auto', // policy engine picks http vs render
maxConcurrentHttp: 8,
maxConcurrentRender: 2,
maxDepth: 5,
crtsh: true, // certificate-transparency seeding
storage: 'sqlite',
storagePath: './crawl.db',
queue: 'sqlite',
queuePath: './crawl.db',
proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],
proxyStrategy: 'health-weighted',
proxyStickyPerHost: true,
},
hooks,
signal: AbortSignal.timeout(30 * 60_000),
})) {
if (!('event' in ev)) continue;
if (ev.event === 'job.failed') console.error(`โœ— ${ev.url} โ€” ${ev.data.error}`);
if (ev.event === 'run.completed') console.log('done.');
}
```

### 3. Embedded library with custom Rust hooks

```rust
use crawlex::{Config, Crawler, queue::FetchMethod};
use crawlex::hooks::{HookDecision, HookRegistry};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> crawlex::Result<()> {
let hooks = HookRegistry::new();
let pages_seen = Arc::new(AtomicUsize::new(0));

// Closure-captured counter โ€” observe without intervening
let counter = pages_seen.clone();
hooks.on_response_body(move |_ctx| {
let c = counter.clone();
Box::pin(async move {
c.fetch_add(1, Ordering::Relaxed);
Ok(HookDecision::Continue)
})
});

// Domain-level deny list โ€” short-circuit before fetch
hooks.on_before_each_request(|ctx| {
let url = ctx.url.clone();
Box::pin(async move {
if url.path().starts_with("/admin/") { return Ok(HookDecision::Skip); }
Ok(HookDecision::Continue)
})
});

let config = Config::builder()
.max_concurrent_http(16)
.build()?;

let crawler = Crawler::new(config)?.with_hooks(hooks);
crawler.seed_with(
vec!["https://target.com".parse().unwrap()],
FetchMethod::HttpSpoof,
).await?;
crawler.run().await?;

println!("Crawled {} pages", pages_seen.load(Ordering::Relaxed));
Ok(())
}
```

โ†’ Full runnable example: [`examples/embedded_with_hooks.rs`](examples/embedded_with_hooks.rs)

### 4. Pin a specific browser fingerprint from the catalog

```bash
# Browse 80+ ready-to-use fingerprints
crawlex stealth catalog list
crawlex stealth catalog list --filter chrome
crawlex stealth catalog show chrome-149-linux

# Pin a precise version + OS
crawlex pages run --seed https://target.com \
--profile chrome-149-linux

# Era fallback: chromium-122 not captured? falls back to closest era + warns
crawlex pages run --seed https://target.com \
--profile chromium-122-linux

# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://target.com \
--method render --persona pixel
```

### 5. Inspect what your stealth stack actually emits

```bash
# Print active IdentityBundle + TLS profile summary
crawlex stealth inspect --profile chrome-149-linux

# Verify ALPN/cipher/JA4 against built-in expectations
crawlex stealth test

# Compare against tls.peet.ws / ja4db.com via the live oracle
crawlex stealth catalog show chrome-149-linux --json
```

### 6. Large crawl: validate cache, prefetch links, score the frontier

```bash
crawlex pages run \
--seed https://docs.example.com \
--method auto \
--queue sqlite --queue-path state/queue.db \
--storage sqlite --storage-path state/crawl.db \
--cache-validate \
--cache-max-age-secs 86400 \
--prefetch \
--best-first \
--score-keyword docs \
--score-keyword api \
--emit ndjson
```

This mode is for discovery passes: reuse fresh cache rows, harvest links cheaply, and let higher-value URLs rise in the queue before expensive render passes.

---

## ๐ŸŽฏ Features

### ๐Ÿฅท Stealth core
- ๐Ÿ” Chrome 149 TLS via BoringSSL fork
- ๐Ÿšฆ H2 pseudo-header order patch
- ๐ŸŽญ 29-section JS shim โ€” full leak inventory covered
- ๐Ÿค– Worker scope shim (dedicated / shared / SW)
- ๐Ÿ“ฆ 80+ browser fingerprints from curl-impersonate + ja4db + tls.peet
- ๐ŸŒ 5 personas: `tux`, `office`, `gamer`, `atlas`, `pixel`
- ๐ŸŽฌ Coherent `motion::` profiles (mouse / scroll / dwell)
- ๐Ÿ•ธ๏ธ WebRTC scrub (SDP, ICE, getStats โ€” public-interface only)

### ๐Ÿ” Discovery
- ๐Ÿ—บ๏ธ Sitemap recursion + robots.txt parsing
- ๐Ÿ”Ž Certificate transparency (crt.sh)
- ๐ŸŒ DNS records + RDAP + Wayback CDX
- ๐Ÿ“œ PWA manifest + service worker probes
- ๐Ÿ“‚ `.well-known/*` enumeration
- ๐Ÿ”ฌ Tech fingerprinting (Wappalyzer-class)
- ๐Ÿ”Œ JS endpoint extraction from runtime
- ๐Ÿ›ก๏ธ security.txt parser
- ๐Ÿงฌ Asset-ref classification (JS / CSS / image / API / nav)
- โšก Prefetch mode for fast discovery-only passes
- ๐ŸŽฏ Best-first URL scoring with keyword bonuses
- ๐Ÿ”“ TCP port scan (opt-in, network-active)

### ๐Ÿ›ก๏ธ Antibot policy engine
- ๐Ÿšง Detect: Cloudflare, DataDome, PerimeterX, Akamai BMP, Imperva, hCaptcha, reCAPTCHA, Turnstile
- ๐Ÿ“Š Vendor telemetry observer (passive โ€” sees outbound calls to known endpoints)
- ๐Ÿ”„ Policy decisions: keep / drop / retry / scope-demote / proxy-rotate / give-up
- ๐Ÿงฑ Unified block classifier with attempt-level crawl stats
- ๐Ÿช‚ Fallback fetch command for last-resort HTML retrieval
- ๐ŸŽฏ 4 captcha solver adapters: in-house reCAPTCHA v3, 2captcha, anticaptcha, VLM

### โš™๏ธ Pipeline
- ๐ŸŽฏ Render pool โ€” Chromium auto-fetch + isolated user-data dirs
- ๐Ÿ”Œ External CDP endpoint support for managed/browser-farm Chrome
- ๐ŸŒ‘ Shadow DOM flattening + overlay / consent-popup cleanup
- ๐Ÿ–ฅ๏ธ GPU policy: compatibility mode or stealth-friendly GPU surfaces
- ๐Ÿ” Persistent queue: in-memory / SQLite / Redis backends
- ๐Ÿ’พ Storage: filesystem / SQLite / memory โ€” opt-in per concern (artifact, state, challenge, telemetry, intel)
- ๐Ÿง  Smart cache validation: `ETag`, `Last-Modified`, `` fingerprint
- ๐Ÿ”„ Proxy rotator โ€” health checks + sticky sessions + per-host affinity
- ๐Ÿ“Š Web Vitals + per-fetch network breakdown (DNS / TCP / TLS / TTFB / download)
- ๐ŸŽฌ ScriptSpec runner โ€” declarative `Plan` execution with assertions
- ๐Ÿ”ง Frontier with dedupe + rate-limit + retry policies
- ๐Ÿ“ Wait strategies: `Load`, `DOMContentLoaded`, `NetworkIdle`, `Selector`, `Fixed`

### ๐Ÿ“ก Observability
- ๐Ÿ“œ NDJSON event stream โ€” versioned envelope (`v: 1`)
- ๐ŸŽฌ 21 event kinds covering full lifecycle
- ๐Ÿ”ฌ Embedded `WebVitals` summary on `render.completed`
- โฑ๏ธ Per-request timings on `fetch.completed` (ALPN, cipher, TLS version)
- ๐Ÿงพ `crawl.attempted` / `crawl.resolved` telemetry for HTTP โ†’ render โ†’ fallback ladders
- ๐Ÿ“ธ Artifact descriptors with on-disk path on the wire
- ๐Ÿช Hooks: 12 lifecycle points ร— 3 languages (Rust / JS / Lua)
- ๐Ÿ“Š Prometheus metrics endpoint

### ๐Ÿ”Œ Integrations
- ๐Ÿ“ฆ npm + crates.io + GitHub Releases
- ๐Ÿฆ€ Rust library โ€” embed `Crawler` directly
- ๐Ÿ“˜ TypeScript types โ€” strict, full envelope coverage
- ๐Ÿ”Œ SDK `crawl()` async iterator
- ๐Ÿงฉ SDK `defineHooks()` bridge for JS/TS lifecycle hooks
- ๐Ÿ“š docsify docs site (GitHub Pages)
- ๐Ÿงช 390+ lib tests, 27 fpjs compliance, TLS catalog roundtrip suite
- ๐Ÿ” Optional Lua hooks (`mlua`)
- ๐Ÿชถ Two binaries: `crawlex` (full) + `crawlex-mini` (HTTP-only, no Chromium)

---

## ๐Ÿ“ก NDJSON event stream

Every run emits one JSON envelope per line on stdout. Versioned, stable, 21 kinds:

```jsonl
{"v":1,"event":"run.started","ts":"2026-04-26T19:42:00.000Z","run_id":42,"data":{"policy_profile":"strict","max_concurrent_http":8,"max_concurrent_render":2}}
{"v":1,"event":"job.started","run_id":42,"url":"https://target.com/","data":{"job_id":"j_001","method":"render","depth":0,"priority":0,"attempts":0}}
{"v":1,"event":"fetch.completed","run_id":42,"url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"bytes":98234,"body_truncated":false,"dns_ms":12,"tcp_connect_ms":18,"tls_handshake_ms":24,"ttfb_ms":142,"download_ms":83,"total_ms":280,"alpn":"h2","tls_version":"TLSv1.3","cipher":"TLS_AES_128_GCM_SHA256"}}
{"v":1,"event":"crawl.attempted","run_id":42,"url":"https://target.com/","data":{"crawl_id":42,"attempt_index":1,"engine":"http_spoof","status":403,"blocked":true,"block_reason":"Cloudflare challenge form"}}
{"v":1,"event":"render.completed","run_id":42,"session_id":"sess_abc","url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"manifest":true,"service_workers":1,"is_spa":true,"vitals":{"ttfb_ms":142,"first_contentful_paint_ms":380.5,"largest_contentful_paint_ms":920.1,"cumulative_layout_shift":0.03,"total_blocking_time_ms":50.0,"dom_nodes":1842,"js_heap_used_bytes":12345678,"resource_count":45,"total_transfer_bytes":982341}}}
{"v":1,"event":"artifact.saved","run_id":42,"url":"https://target.com/","data":{"kind":"screenshot.full_page","mime":"image/png","size":1234567,"sha256":"a1b2c3...","path":"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png"}}
{"v":1,"event":"challenge.detected","run_id":42,"url":"https://protected.com/","data":{"vendor":"cloudflare_turnstile","level":"widget_present"}}
{"v":1,"event":"decision.made","run_id":42,"url":"https://protected.com/","why":"render:js-challenge","data":{"decision":"retry","reason":{"code":"render:js-challenge"}}}
{"v":1,"event":"crawl.resolved","run_id":42,"url":"https://target.com/","data":{"crawl_id":42,"attempts_count":2,"fallback_fetch_used":false,"resolved_by":"render","success":true}}
{"v":1,"event":"run.completed","run_id":42}
```

**Discriminator key:** `event` (snake_case) โ€” TypeScript narrows via `switch (ev.event) { โ€ฆ }`. Fallback for malformed lines: `{ kind: 'raw', line }` so consumers can log/recover.

---

## ๐Ÿช Hooks โ€” 12 lifecycle points ร— 3 languages

```
before_each_request โ†’ after_dns โ†’ after_tls โ†’ after_first_byte โ†’ on_response_body
โ†’ after_load โ†’ after_idle โ†’ on_discovery โ†’ on_job_start โ†’ on_job_end
โ†’ on_error โ†’ on_robots_decision
```

| Language | API | Best for |
|---|---|---|
| **Rust** | `hooks.on_after_first_byte(closure)` โ€” full `&mut HookContext` access | Embedded library, latency-critical paths |
| **JS / TS** | `defineHooks({...})` via SDK โ€” IPC bridge, async closures | Production crawls, business logic |
| **Lua** | `--hook-script foo.lua` โ€” page-driving helpers (`page_click`, `page_eval`) | Ad-hoc scripts, no build step |

**All three modes return the same decision:** `continue` / `skip` / `retry` / `abort`. Hooks can mutate `ctx.captured_urls`, inject extra URLs, write to `user_data` to communicate with downstream hooks, or override `robots_allowed`.

---

## ๐ŸŽญ Personas โ€” coherent identity bundles

Each persona is a complete bundle โ€” UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings โ€” so every signal **matches**. No mismatched UA + WebGL combo gives you away.

| Codename | OS | GPU | Locale | Form factor |
|---|---|---|---|---|
| ๐Ÿง `tux` | Linux | Intel UHD 630 | en-US | desktop 1920ร—1080 |
| ๐Ÿข `office` | Windows 10 | Intel UHD 620 | en-US | laptop 1920ร—1080 (DPR 1.25) |
| ๐ŸŽฎ `gamer` | Windows 10 | NVIDIA GTX 1060 | pt-BR | desktop 1920ร—1080 |
| ๐ŸŽ `atlas` | macOS | Apple M1 | en-US | retina 1440ร—900 (DPR 2.0) |
| ๐Ÿ“ฑ `pixel` | Android 14 | Adreno 640 | pt-BR | **mobile** 412ร—823 (DPR 2.625) |

```bash
crawlex pages run --seed https://target.com --persona atlas # macOS
crawlex pages run --seed https://target.com --persona pixel # mobile
```

---

## ๐Ÿ—๏ธ Architecture

```mermaid
flowchart LR
S[Seeds] --> Q[Frontier
+ dedupe + rate-limit]
Q --> P[Policy Engine]
P --> C[Cache Validator
ETag + Last-Modified + head fingerprint]
C -->|fresh| ST[Storage
5 traits]
C -->|stale| F[ImpersonateClient
BoringSSL + h2 patched]
P -->|http| F
P -->|render| R[RenderPool
Chromium + stealth shim]
F --> X[Extractor
+ Asset Refs]
R --> X
X --> D[Discovery
Pipeline]
X --> ST
D --> Q
P --> EV[NDJSON Events
21 kinds]
R --> H1[Rust Hooks]
R --> H2[JS Bridge]
R --> H3[Lua Scripts]
```

**Module map:**
- `impersonate/` โ€” TLS catalog + BoringSSL connector + ALPS + GREASE
- `render/` โ€” Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runner
- `discovery/` โ€” 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, โ€ฆ)
- `policy/` โ€” pure engine: `decide_pre_fetch`, `decide_post_fetch`, `decide_post_error`, `decide_post_challenge`
- `antibot/` โ€” vendor classifier + 4 captcha solver adapters
- `cache_validator/` โ€” cache freshness by HTTP validators and head fingerprints
- `storage/` โ€” 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)
- `events/` โ€” NDJSON envelope + sink (stdout / null / memory)
- `hooks/` โ€” registry + JS bridge + Lua host

---

## ๐Ÿ› ๏ธ Tech stack

| Layer | Implementation |
|---|---|
| TLS | `boring-sys` โ€” BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768 |
| HTTP/2 | Vendored `h2` crate with pseudo-header order patch (`vendor/h2`) |
| CDP | chromiumoxide-derived, embedded behind `cdp-backend` feature |
| Async | tokio multi-thread |
| Storage | rusqlite (SQLite WAL), DashMap (memory), filesystem layout |
| Discovery | hickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt) |
| Lua | mlua 0.10 (optional, `lua-hooks` feature) |
| SDK | Node 20+, CommonJS, zero runtime deps |

**Two binaries** ship from one source tree:
- `crawlex` โ€” **full** build with HTTP impersonation + Chromium rendering + stealth shim + persistent queue
- `crawlex-mini` โ€” **HTTP-only** worker, no Chromium dependency, same CLI surface (browser-only flags return `Error::RenderDisabled`)

---

## ๐Ÿ“Š Versus the alternatives

| | crawlex | Playwright stealth | Puppeteer + plugins | curl-impersonate |
|---|:-:|:-:|:-:|:-:|
| TLS-perfect ClientHello | โœ… BoringSSL | โš ๏ธ relies on Chromium | โš ๏ธ relies on Chromium | โœ… |
| H2 pseudo-header order | โœ… patched h2 | โš ๏ธ Chromium default | โš ๏ธ Chromium default | โŒ |
| 29-section JS leak coverage | โœ… | โš ๏ธ partial | โš ๏ธ via plugins | โŒ no JS |
| Worker-scope stealth | โœ… auto-attach | โš ๏ธ manual | โš ๏ธ manual | โŒ |
| HTTP-only path (no browser) | โœ… `crawlex-mini` | โŒ | โŒ | โœ… |
| Persistent queue + resume | โœ… SQLite/Redis | โŒ external | โŒ external | โŒ |
| Discovery pipeline | โœ… 17 stages | โŒ | โŒ | โŒ |
| Streaming NDJSON events | โœ… versioned | โŒ | โŒ | โŒ |
| Rust embedding | โœ… | โŒ | โŒ | โš ๏ธ libcurl |
| Single binary | โœ… | โŒ | โŒ | โœ… |

---

## ๐Ÿ“š Documentation

- ๐ŸŒ **[forattini-dev.github.io/crawlex](https://forattini-dev.github.io/crawlex/)** โ€” full docsify hub
- ๐Ÿ—๏ธ [Architecture overview](https://forattini-dev.github.io/crawlex/#/architecture/00-overview)
- ๐Ÿ“– [CLI reference](https://forattini-dev.github.io/crawlex/#/reference/cli)
- โš™๏ธ [Config JSON schema](https://forattini-dev.github.io/crawlex/#/reference/config)
- ๐Ÿ“ก [NDJSON event envelope](https://forattini-dev.github.io/crawlex/#/reference/events)
- ๐ŸŽฏ [Guides](https://forattini-dev.github.io/crawlex/#/guides/) โ€” HTTP-only, rendered sessions, persistent runs
- ๐Ÿฅท [Stealth & proxies](https://forattini-dev.github.io/crawlex/#/features/proxy-stealth)

---

## ๐Ÿค Contributing

```bash
git clone https://github.com/forattini-dev/crawlex
cd crawlex

# Unit tests + offline shim compliance
cargo test --lib # 390+ tests
cargo test --test fpjs_compliance # 27 cases
cargo test --test tls_catalog_coverage --test tls_catalog_roundtrip

# SDK tests
pnpm test # 21 node:test cases

# Quality gates
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo publish --dry-run --locked

# Live integration tests (require system Chromium)
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored
```

CI runs all of the above on every PR. Contributions welcome โ€” issues, feature requests, and PRs all reviewed.

---

## ๐Ÿ“„ License

Dual-licensed under **MIT OR Apache-2.0** at your option. SPDX: `MIT OR Apache-2.0`.

Third-party attribution: see [`NOTICE`](NOTICE).

---

**Built for crawlers who refuse to be detected.**

[Docs](https://forattini-dev.github.io/crawlex/) ยท [Releases](https://github.com/forattini-dev/crawlex/releases) ยท [Issues](https://github.com/forattini-dev/crawlex/issues) ยท [Discussions](https://github.com/forattini-dev/crawlex/discussions)