{"id":51146699,"url":"https://github.com/chrisabruce/scrapling-rs","last_synced_at":"2026-06-26T03:02:49.540Z","repository":{"id":351580525,"uuid":"1211534628","full_name":"chrisabruce/scrapling-rs","owner":"chrisabruce","description":"Adaptive web scraping, built in Rust. A high-performance port of Python Scrapling.","archived":false,"fork":false,"pushed_at":"2026-04-15T16:15:36.000Z","size":333,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T20:44:37.878Z","etag":null,"topics":["ai","ai-scraping","automation","crawler","crawling","crawling-rust","data","data-extraction","mcp","mcp-server","playwright","rust-lang","scraping","selectors","stealth","web-scraper","web-scraping","web-scraping-rust","webscraping","xpath"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrisabruce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-15T13:42:35.000Z","updated_at":"2026-04-15T16:15:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/chrisabruce/scrapling-rs","commit_stats":null,"previous_names":["chrisabruce/scrapling-rs"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/chrisabruce/scrapling-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisabruce%2Fscrapling-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisabruce%2Fscrapling-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisabruce%2Fscrapling-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisabruce%2Fscrapling-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrisabruce","download_url":"https://codeload.github.com/chrisabruce/scrapling-rs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisabruce%2Fscrapling-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34801015,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-26T02:00:06.560Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-scraping","automation","crawler","crawling","crawling-rust","data","data-extraction","mcp","mcp-server","playwright","rust-lang","scraping","selectors","stealth","web-scraper","web-scraping","web-scraping-rust","webscraping","xpath"],"created_at":"2026-06-26T03:02:48.755Z","updated_at":"2026-06-26T03:02:49.517Z","avatar_url":"https://github.com/chrisabruce.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scrapling-rs\n\nThe Rust port of [Scrapling](https://github.com/D4Vinci/Scrapling), a web scraping framework that actually handles the messy reality of modern websites. Built for speed, built for stealth, built to keep working when sites change their HTML.\n\nIf you've used the Python version, you already know the API. If you haven't, here's the short version: Scrapling finds elements even after a website redesigns, impersonates real browsers so anti-bot systems can't tell you apart, and does it all fast enough to crawl thousands of pages concurrently.\n\nThis Rust port takes everything that makes Scrapling good and removes the performance ceiling. No GIL. No garbage collector. Native async. Single binary deployment.\n\n## What makes this different\n\nMost scraping libraries break the moment a website changes a CSS class or moves a div. Scrapling doesn't. It saves a structural fingerprint of every element you care about and uses a 12-factor similarity algorithm to find it again, even when the surrounding HTML looks completely different. That's the adaptive engine, and it's the reason people use Scrapling over everything else.\n\nThe other big thing: real browser fingerprint impersonation. Not just setting a User-Agent header. Full TLS fingerprint emulation (JA3/JA4, HTTP/2 settings, cipher order) through 135+ browser profiles so anti-bot systems see Chrome, Firefox, or Safari instead of a Rust HTTP client.\n\n## Features\n\n**HTML parsing and selection**\n- Fast DOM parsing via html5ever with CSS selector support, including `::text` and `::attr()` pseudo-elements\n- Full DOM navigation: parent, children, siblings, ancestors, descendants\n- Find elements by text content, regex patterns, or compound filters\n- Auto-generate unique CSS and XPath selectors for any element\n\n**Adaptive element relocation**\n- 12-factor structural similarity scoring (tag, text, attributes, path, parent, siblings, and more)\n- Survives DOM restructuring, class renames, ID changes, and wrapper element additions\n- SQLite-backed fingerprint storage across scraping sessions\n\n**HTTP fetching with browser impersonation**\n- 135+ browser emulation profiles (Chrome, Firefox, Safari, Edge, Opera, OkHttp) via wreq\n- TLS fingerprint impersonation (JA3/JA4/HTTP2 APERT)\n- Proxy rotation with pluggable strategies\n- Automatic retry with configurable backoff\n- Stealth headers with Google referer injection\n\n**Browser automation**\n- Playwright-based headless browser control\n- 99 Chromium stealth flags for anti-detection\n- Cloudflare Turnstile solver (non-interactive, managed, interactive, embedded challenges)\n- Resource and ad blocking (3,527 domain blocklist)\n- Network interception with domain suffix matching\n\n**Spider framework**\n- Concurrent crawler with configurable parallelism\n- Request deduplication via SHA-1 fingerprinting\n- Robots.txt compliance with crawl-delay support\n- Checkpoint/resume for long-running crawls\n- Development mode with response caching\n\n**Extras**\n- CLI for quick extraction jobs\n- MCP server for AI agent integration\n- Python bindings via PyO3\n- Curl command parser (paste from DevTools, get a request)\n- HTML to Markdown and plain text conversion\n\n## Quick start\n\n```rust\nuse scrapling::selector::Selector;\n\nfn main() {\n    let html = r#\"\n        \u003chtml\u003e\u003cbody\u003e\n            \u003ch1 class=\"title\"\u003eHello, Scrapling!\u003c/h1\u003e\n            \u003cdiv class=\"products\"\u003e\n                \u003cdiv class=\"product\" data-id=\"1\"\u003e\u003cspan class=\"price\"\u003e$10.99\u003c/span\u003e\u003c/div\u003e\n                \u003cdiv class=\"product\" data-id=\"2\"\u003e\u003cspan class=\"price\"\u003e$24.99\u003c/span\u003e\u003c/div\u003e\n            \u003c/div\u003e\n        \u003c/body\u003e\u003c/html\u003e\n    \"#;\n\n    let page = Selector::from_html(html);\n\n    // CSS selectors with pseudo-elements\n    let prices = page.css(\".price::text\");\n    for price in prices.iter() {\n        println!(\"{}\", price.text());\n    }\n\n    // Extract structured data\n    for product in page.css(\".product\").iter() {\n        let id = \u0026product.attrib()[\"data-id\"];\n        let price = product.css(\".price\").first().unwrap().text();\n        println!(\"Product {id}: {price}\");\n    }\n\n    // Find elements by text\n    let matches = page.find_by_text(\"$10\", true, false, false);\n    println!(\"Found {} elements containing '$10'\", matches.len());\n}\n```\n\n## HTTP fetching with impersonation\n\n```rust\nuse scrapling_fetch::{Fetcher, FetcherConfig, Impersonate};\n\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let fetcher = Fetcher::with_config(FetcherConfig {\n        impersonate: Impersonate::Single(\"chrome\".into()),\n        stealthy_headers: true,\n        ..Default::default()\n    });\n\n    let response = fetcher.get(\"https://example.com\", None).await?;\n    println!(\"Status: {}\", response.status);\n\n    // Response has full CSS selector support\n    let title = response.css(\"title::text\");\n    println!(\"Title: {}\", title.first().unwrap().text());\n\n    // Convert to markdown\n    println!(\"{}\", response.to_markdown());\n\n    Ok(())\n}\n```\n\n## Adaptive relocation\n\n```rust\nuse scrapling::selector::Selector;\nuse scrapling::storage::sqlite::SqliteStorage;\n\nfn main() {\n    let storage = SqliteStorage::new(\":memory:\", Some(\"https://example.com\")).unwrap();\n\n    // Save a fingerprint from the original page\n    let page = Selector::from_html(r#\"\u003cdiv id=\"price\" class=\"amount\"\u003e$42.99\u003c/div\u003e\"#);\n    page.css_adaptive(\"#price\", \u0026storage, false, true, Some(\"price\"), 0.0);\n\n    // Website redesigns, the ID is gone, class changed\n    let new_page = Selector::from_html(r#\"\u003cspan class=\"cost\" data-type=\"price\"\u003e$42.99\u003c/span\u003e\"#);\n\n    // Normal selector fails\n    assert!(new_page.css(\"#price\").is_empty());\n\n    // Adaptive finds it by structural similarity\n    let found = new_page.css_adaptive(\"#price\", \u0026storage, true, false, Some(\"price\"), 0.0);\n    assert!(!found.is_empty());\n}\n```\n\n## Project structure\n\n```\nscrapling-rs/\n├── crates/\n│   ├── scrapling/          Core: HTML parsing, selectors, adaptive engine\n│   ├── scrapling-fetch/    HTTP client with TLS impersonation (wreq)\n│   ├── scrapling-browser/  Playwright browser automation + stealth\n│   ├── scrapling-spider/   Concurrent crawler framework\n│   ├── scrapling-cli/      Command-line interface\n│   ├── scrapling-mcp/      MCP server for AI agents\n│   └── scrapling-python/   PyO3 Python bindings\n├── examples/               13 runnable examples\n├── fuzz/                   Fuzz testing targets\n└── .github/workflows/      CI (fmt, clippy, test)\n```\n\n## Installation\n\nAdd the crates you need:\n\n```toml\n[dependencies]\nscrapling = \"0.1\"                           # Core parsing + adaptive\nscrapling-fetch = \"0.1\"                     # HTTP fetching\nscrapling-browser = \"0.1\"                   # Browser automation\nscrapling-spider = \"0.1\"                    # Crawler framework\n```\n\n## Examples\n\nRun any of the 13 included examples:\n\n```bash\ncargo run -p scrapling-examples --example 01_parse_html\ncargo run -p scrapling-examples --example 07_adaptive\ncargo run -p scrapling-examples --example 09_http_fetch\n```\n\n## Status\n\nThis is a complete port. 279 tests passing, zero clippy warnings.\n\n| Component | Status |\n|-----------|--------|\n| HTML parsing, DOM traversal, CSS/XPath selectors | Complete |\n| Adaptive element relocation with SQLite storage | Complete |\n| HTTP fetcher with 135+ browser profiles | Complete |\n| Playwright browser automation + Cloudflare solver | Complete |\n| Spider framework with checkpointing + robots.txt | Complete |\n| CLI, MCP server, Python bindings | Complete |\n\n## Minimum Rust version\n\n1.85 or later.\n\n## Credits\n\nThis project is a Rust port of [Scrapling](https://github.com/D4Vinci/Scrapling) by [Karim Shoair](https://github.com/D4Vinci). The original architecture, API design, adaptive algorithms, and anti-detection strategies all come from the Python project. This port exists because those ideas deserved native performance.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisabruce%2Fscrapling-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrisabruce%2Fscrapling-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisabruce%2Fscrapling-rs/lists"}