https://github.com/spider-rs/spider
Web crawler and scraper for Rust
https://github.com/spider-rs/spider
automation crawler headless-chrome indexer rust scraping spider
Last synced: 2 months ago
JSON representation
Web crawler and scraper for Rust
- Host: GitHub
- URL: https://github.com/spider-rs/spider
- Owner: spider-rs
- License: mit
- Created: 2018-01-07T18:49:20.000Z (over 8 years ago)
- Default Branch: main
- Last Pushed: 2026-02-01T13:18:25.000Z (4 months ago)
- Last Synced: 2026-02-01T13:51:00.457Z (4 months ago)
- Topics: automation, crawler, headless-chrome, indexer, rust, scraping, spider
- Language: Rust
- Homepage: https://spider.cloud
- Size: 7.97 MB
- Stars: 2,216
- Watchers: 17
- Forks: 178
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-scrapers - Spider - 1000x faster than Python alternatives. | (🤖 AI-Powered Scraping)
- awesome-rust-list - Spider - rs/spider?style=social"/> : Spider: The fastest web crawler and indexer. [docs.rs/spider/](https://docs.rs/spider/latest/spider/) (Web Crawler)
- awesome-crawler - spider - The fastest web crawler and indexer. (Rust)
- awesome-ai-agents - spider-rs/spider - spider-rs/spider is a web crawler and scraper library for Rust, designed for high-performance and extensive web data extraction. (GUI & Computer Control AI Agents / Browser & Web Automation)
- awesome-rust-ai-libraries - spider - Rust-based web crawler/spider for dumping entire websites, useful for data collection in LLM training or RAG systems. ([Read more](/details/spider.md)) `Open Source` `Web Crawler` `Data Collection` (Tooling)
README
# Spider
[](https://github.com/spider-rs/spider/actions)
[](https://crates.io/crates/spider)
[](https://crates.io/crates/spider)
[](https://docs.rs/spider)
[](https://opensource.org/licenses/MIT)
[](https://discord.spider.cloud)
[Website](https://spider.cloud) |
[Guides](https://spider.cloud/guides) |
[API Docs](https://docs.rs/spider/latest/spider) |
[Examples](./examples/) |
[Discord](https://discord.spider.cloud)
A high-performance web crawler and scraper for Rust. [200-1000x faster](#benchmarks) than popular alternatives, with HTTP, headless Chrome, and WebDriver rendering in a single library.
- **Crawl 100k+ pages in minutes** on a single machine. [See benchmarks.](#benchmarks)
- **HTTP, Chrome CDP, WebDriver, and [AI automation](./spider_agent/)** in one dependency.
- **Production-ready** with caching, proxy rotation, anti-bot bypass, and [distributed crawling](./spider_worker/). [Feature-gated](https://doc.rust-lang.org/cargo/reference/features.html) so you only compile what you use.
## Quick Start
### Command Line
```bash
cargo install spider_cli
spider --url https://example.com
```
### Rust
```toml
[dependencies]
spider = "2"
```
```rust
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
website.crawl().await;
println!("Pages found: {}", website.get_links().len());
}
```
### Streaming
Process each page the moment it's crawled, not after:
```rust
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let mut rx = website.subscribe(0).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx.recv().await {
println!("- {}", page.get_url());
}
});
website.crawl().await;
website.unsubscribe();
}
```
### Headless Chrome
Add one feature flag to render JavaScript-heavy pages:
```toml
[dependencies]
spider = { version = "2", features = ["chrome"] }
```
```rust
use spider::features::chrome_common::RequestInterceptConfiguration;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com")
.with_chrome_intercept(RequestInterceptConfiguration::new(true))
.with_stealth(true)
.build()
.unwrap();
website.crawl().await;
}
```
> Also supports [WebDriver](./examples/webdriver.rs) (Selenium Grid, remote browsers) and [AI-driven automation](./spider_agent/). See [examples](./examples/) for more.
## Benchmarks
Crawling 185 pages ([source](./benches/BENCHMARKS.md), 10 samples averaged):
**Apple M1 Max** (10-core, 64 GB RAM):
| Crawler | Language | Time | vs Spider |
|---------|----------|-----:|----------:|
| **spider** | **Rust** | **73 ms** | **baseline** |
| node-crawler | JavaScript | 15 s | 205x slower |
| colly | Go | 32 s | 438x slower |
| wget | C | 70 s | 959x slower |
**Linux** (2-core, 7 GB RAM):
| Crawler | Language | Time | vs Spider |
|---------|----------|-----:|----------:|
| **spider** | **Rust** | **50 ms** | **baseline** |
| node-crawler | JavaScript | 3.4 s | 68x slower |
| colly | Go | 30 s | 600x slower |
| wget | C | 60 s | 1200x slower |
The gap grows with site size. Spider handles 100k+ pages in minutes where other crawlers take hours. This comes from Rust's async runtime ([tokio](https://tokio.rs)), lock-free data structures, and optional [io_uring](https://en.wikipedia.org/wiki/Io_uring) on Linux. [Full details](./benches/BENCHMARKS.md)
## Why Spider?
Most crawlers force a choice between fast HTTP-only or slow-but-flexible browser automation. Spider supports both, and you can mix them in the same crawl.
**Supports HTTP, Chrome, and WebDriver.** Switch rendering modes with a feature flag. Use HTTP for speed, Chrome CDP for JavaScript-heavy pages, and WebDriver for Selenium Grid or cross-browser testing.
**Built for production.** Caching (memory, disk, hybrid), proxy rotation, anti-bot fingerprinting, ad blocking, depth budgets, cron scheduling, and distributed workers. All of this has been hardened through [Spider Cloud](https://spider.cloud).
**AI automation included.** [spider_agent](./spider_agent/) adds multimodal LLM-driven automation: navigate pages, fill forms, solve challenges, and extract structured data with OpenAI or any compatible API.
## Features
Crawling
- Concurrent and streaming crawls with backpressure
- [Decentralized crawling](./spider_worker/) for horizontal scaling
- Caching: memory, disk (SQLite), or [hybrid Chrome cache](./examples/cache_chrome_hybrid.rs)
- Proxy support with rotation
- Cron job scheduling
- Depth budgeting, blacklisting, whitelisting
- Smart mode that auto-detects JS-rendered content and upgrades to Chrome
Browser Automation
- [Chrome DevTools Protocol](https://github.com/spider-rs/chromey): headless or headed, stealth mode, screenshots, request interception
- [WebDriver](./examples/webdriver.rs): Selenium Grid, remote browsers, cross-browser testing
- AI-powered challenge solving (deterministic + [Chrome built-in AI](https://developer.chrome.com/docs/ai/prompt-api))
- [Anti-bot fingerprinting](https://github.com/spider-rs/spider_fingerprint), [ad blocking](https://github.com/spider-rs/spider_network_blocker), [firewall](https://github.com/spider-rs/spider_firewall)
Data Processing
- [HTML transformations](https://github.com/spider-rs/spider_transformations) (Markdown, text, structured extraction)
- CSS/XPath scraping with [spider_utils](./spider_utils/README.md#CSS_Scraping)
- [OpenAI](./examples/openai.rs) and [Gemini](./examples/gemini.rs) integration for content analysis
AI Agent
- [spider_agent](./spider_agent/): concurrent-safe multimodal web automation agent
- Multiple LLM providers (OpenAI, any OpenAI-compatible API, Chrome built-in AI)
- Web research with search providers (Serper, Brave, Bing, Tavily)
- 110 built-in automation skills for web challenges
## Spider Cloud
For managed proxy rotation, anti-bot bypass, and CAPTCHA handling, [Spider Cloud](https://spider.cloud) plugs in with one line:
```rust
let mut website = Website::new("https://protected-site.com")
.with_spider_cloud("your-api-key") // enable with features = ["spider_cloud"]
.build()
.unwrap();
```
| Mode | Strategy | Best For |
|------|----------|----------|
| **Proxy** (default) | All traffic through Spider Cloud proxy | General crawling with IP rotation |
| **Smart** (recommended) | Proxy + auto-fallback on bot detection | Production (speed + reliability) |
| **Fallback** | Direct first, API on failure | Cost-efficient, most sites work without help |
| **Unblocker** | All requests through unblocker | Aggressive bot protection |
> Free credits on signup. [Get started at spider.cloud](https://spider.cloud)
### Spider Browser Cloud
Connect to a remote Rust-based browser via CDP over WebSocket for automation, scraping, and AI extraction:
```rust
use spider::configuration::SpiderBrowserConfig;
// Simple — just an API key
let mut website = Website::new("https://example.com")
.with_spider_browser("your-api-key") // features = ["spider_cloud", "chrome"]
.build()
.unwrap();
// Full config — stealth, country targeting, custom options
let browser_cfg = SpiderBrowserConfig::new("your-api-key")
.with_stealth(true)
.with_country("us");
let mut website = Website::new("https://example.com")
.with_spider_browser_config(browser_cfg)
.build()
.unwrap();
```
WebSocket endpoint: `wss://browser.spider.cloud/v1/browser` — supports CDP and WebDriver BiDi protocols.
### Parallel Backends
Race alternative browser engines alongside the primary crawl. The best HTML response wins — higher reliability and coverage for JS-heavy pages.
```rust
use spider::configuration::{BackendEndpoint, BackendEngine, ParallelBackendsConfig};
let mut website = Website::new("https://example.com");
// Race a secondary browser engine alongside the primary crawl.
website.configuration.parallel_backends = Some(ParallelBackendsConfig {
backends: vec![BackendEndpoint {
engine: BackendEngine::default(),
endpoint: Some("ws://127.0.0.1:9222".to_string()),
binary_path: None,
protocol: None,
proxy: None, // inherits from website proxies config
}],
grace_period_ms: 500, // wait up to 500ms for a better result
fast_accept_threshold: 80, // accept immediately if quality >= 80
..Default::default()
});
website.crawl().await;
```
## Get Spider
| Package | Language | Install |
|---------|----------|---------|
| [spider](https://crates.io/crates/spider) | Rust | `cargo add spider` |
| [spider_cli](./spider_cli/) | CLI | `cargo install spider_cli` |
| [spider-nodejs](https://github.com/spider-rs/spider-nodejs) | Node.js | `npm i @spider-rs/spider-rs` |
| [spider-py](https://github.com/spider-rs/spider-py) | Python | `pip install spider_rs` |
| [spider_agent](./spider_agent/) | Rust | `cargo add spider --features agent` |
| [spider_mcp](./spider_mcp/) | MCP | `cargo install spider_mcp` |
### MCP Server
Use Spider as tools in Claude Code, Claude Desktop, or any MCP client:
```bash
cargo install spider_mcp
```
```json
{ "mcpServers": { "spider": { "command": "spider-mcp" } } }
```
Then ask: *"Scrape https://example.com as markdown"* or *"Crawl https://example.com up to 5 pages"*
### Cloud and Remote
| Package | Description |
|---------|-------------|
| [Spider Cloud](https://spider.cloud) | Managed crawling infrastructure, no setup needed |
| [spider-clients](https://github.com/spider-rs/spider-clients) | SDKs for Spider Cloud in multiple languages |
| [spider-browser](https://github.com/spider-rs/spider-browser) | Remote access to Spider's Rust browser |
## Resources
- [64 examples](./examples/) covering crawling, Chrome, WebDriver, AI, caching, and more
- [API documentation](https://docs.rs/spider/latest/spider)
- [Benchmarks](./benches/BENCHMARKS.md)
- [Changelog](CHANGELOG.md)
## Contributing
Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for setup and guidelines.
Spider has been actively developed for the past 4 years. Join the [Discord](https://discord.spider.cloud) for questions and discussion.
## License
[MIT](LICENSE)