https://github.com/bitingsnakes/silkworm

Async web scraping framework on top of Rust. Works with Free-threaded Python (`PYTHON_GIL=0`).
https://github.com/bitingsnakes/silkworm
free-threaded-python rust scraping web-scraping
Last synced: about 2 months ago
JSON representation
Async web scraping framework on top of Rust. Works with Free-threaded Python (`PYTHON_GIL=0`).
Host: GitHub
URL: https://github.com/bitingsnakes/silkworm
Owner: BitingSnakes
License: mit
Created: 2025-12-08T15:38:01.000Z (7 months ago)
Default Branch: main
Last Pushed: 2026-03-31T14:57:00.000Z (3 months ago)
Last Synced: 2026-04-02T09:33:26.303Z (3 months ago)
Topics: free-threaded-python, rust, scraping, web-scraping
Language: Python
Homepage:
Size: 1.62 MB
Stars: 52
Watchers: 0
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project

README

          # silkworm-rs

[![PyPI - Version](https://img.shields.io/pypi/v/silkworm-rs)](https://pypi.org/project/silkworm-rs/)

[![Tests](https://github.com/BitingSnakes/silkworm/actions/workflows/tests.yml/badge.svg)](https://github.com/BitingSnakes/silkworm/actions/workflows/tests.yml)

[![PyPI Downloads](https://static.pepy.tech/personalized-badge/silkworm-rs?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/silkworm-rs)

Async-first web scraping framework built on [wreq](https://github.com/0x676e67/wreq-python) (HTTP with browser impersonation) and [scraper-rs](https://github.com/RustedBytes/scraper-rs) (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.

> **NEW**: Use [silkworm-mcp](https://github.com/BitingSnakes/silkworm-mcp) to build scrapers.

## Features

- Async engine with configurable concurrency, bounded queue backpressure (defaults to `concurrency * 10`), and per-request timeouts.

- wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via `request.meta["proxy"]`.

- Optional OnionLink client integration for scraping Tor v3 `.onion` sites without routing through wreq.

- Optional Servo rendering via `ServoFetchClient` for JavaScript-rendered pages without changing the default HTTP client.

- Typed spiders and callbacks that can return items or `Request` objects; `HTMLResponse` ships helper methods plus `Response.follow` to reuse callbacks.

- Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom), `SkipNonHTMLMiddleware` to drop non-HTML callbacks, and `CloudflareCrawlMiddleware` for Browser Rendering crawl jobs.

- Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.

- Structured logging via `logly` (`SILKWORM_LOG_LEVEL=DEBUG`), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).

## Installation

From PyPI with pip:

```bash

pip install silkworm-rs

```

From PyPI with uv (recommended for faster installs):

```bash

uv pip install silkworm-rs

# or if using uv's project management:

uv add silkworm-rs

```

From source:

```bash

uv venv  # install uv from https://docs.astral.sh/uv/getting-started/ if needed

source .venv/bin/activate  # Windows: .venv\Scripts\activate

uv pip install -e .

```

Targets Python 3.13+; dependencies are pinned in `pyproject.toml`.

## Quick start

Define a spider by subclassing `Spider`, implementing `parse`, and yielding items or follow-up `Request` objects. This example writes quotes to `data/quotes.jl` and enables basic user agent, retry, and non-HTML filtering middlewares.

```python

from silkworm import HTMLResponse, Response, Spider, run_spider

from silkworm.middlewares import (

    RetryMiddleware,

    SkipNonHTMLMiddleware,

    UserAgentMiddleware,

)

from silkworm.pipelines import JsonLinesPipeline

class QuotesSpider(Spider):

    name = "quotes"

    start_urls = ("https://quotes.toscrape.com/",)

    async def parse(self, response: Response):

        if not isinstance(response, HTMLResponse):

            return

        html = response

        for quote in await html.select(".quote"):

            text_el = await quote.select_first(".text")

            author_el = await quote.select_first(".author")

            if text_el is None or author_el is None:

                continue

            tags = await quote.select(".tag")

            yield {

                "text": text_el.text,

                "author": author_el.text,

                "tags": [t.text for t in tags],

            }

        if next_link := await html.select_first("li.next > a"):

            yield html.follow(next_link.attr("href"), callback=self.parse)

if __name__ == "__main__":

    run_spider(

        QuotesSpider,

        request_middlewares=[UserAgentMiddleware()],

        response_middlewares=[

            SkipNonHTMLMiddleware(),

            RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),

        ],

        item_pipelines=[JsonLinesPipeline("data/quotes.jl")],

        concurrency=16,

        request_timeout=10,

        log_stats_interval=30,

    )

```

`run_spider`/`crawl` knobs:

- `concurrency`: number of concurrent HTTP requests; default 16.

- `max_pending_requests`: queue bound to avoid unbounded memory use (defaults to `concurrency * 10`).

- `request_timeout`: per-request timeout (seconds).

- `keep_alive`: reuse HTTP connections when supported by the underlying client (sends `Connection: keep-alive`).

- `http_client`: use a custom client instance such as `OnionLinkClient(...)` or `ServoFetchClient(...)` instead of the default wreq-backed client.

- `html_max_size_bytes`: limit HTML parsed into `AsyncDocument` to avoid huge payloads.

- `log_stats_interval`: seconds between periodic stats logs; final stats are always emitted.

- `request_middlewares` / `response_middlewares` / `item_pipelines`: plug-ins run on every request/response/item.

- use `run_spider_rsloop(...)` instead of `run_spider(...)` to run under rsloop (requires `pip install silkworm-rs[rsloop]`).

- use `run_spider_uvloop(...)` instead of `run_spider(...)` to run under uvloop (requires `pip install silkworm-rs[uvloop]`).

- use `run_spider_winloop(...)` instead of `run_spider(...)` to run under winloop on Windows (requires `pip install silkworm-rs[winloop]`).

## Built-in middlewares and pipelines

```python

from silkworm.middlewares import (

    CloudflareCrawlMiddleware,

    DelayMiddleware,

    ProxyMiddleware,

    RetryMiddleware,

    SkipNonHTMLMiddleware,

    UserAgentMiddleware,

)

from silkworm.pipelines import (

    CallbackPipeline,  # invoke a custom callback function on each item

    CSVPipeline,

    JsonLinesPipeline,

    MsgPackPipeline,  # requires: pip install silkworm-rs[msgpack]

    SQLitePipeline,

    XMLPipeline,

    TaskiqPipeline,  # requires: pip install silkworm-rs[taskiq]

    PolarsPipeline,  # requires: pip install silkworm-rs[polars]

    ExcelPipeline,  # requires: pip install silkworm-rs[excel]

    YAMLPipeline,  # requires: pip install silkworm-rs[yaml]

    AvroPipeline,  # requires: pip install silkworm-rs[avro]

    ElasticsearchPipeline,  # requires: pip install silkworm-rs[elasticsearch]

    MongoDBPipeline,  # requires: pip install silkworm-rs[mongodb]

    MySQLPipeline,  # requires: pip install silkworm-rs[mysql]

    PostgreSQLPipeline,  # requires: pip install silkworm-rs[postgresql]

    S3JsonLinesPipeline,  # requires: pip install silkworm-rs[s3]

    VortexPipeline,  # requires: pip install silkworm-rs[vortex]

    WebhookPipeline,  # sends items to webhook endpoints using wreq

    GoogleSheetsPipeline,  # requires: pip install silkworm-rs[gsheets]

    SnowflakePipeline,  # requires: pip install silkworm-rs[snowflake]

    FTPPipeline,  # requires: pip install silkworm-rs[ftp]

    SFTPPipeline,  # requires: pip install silkworm-rs[sftp]

    CassandraPipeline,  # requires: pip install silkworm-rs[cassandra]

    CouchDBPipeline,  # requires: pip install silkworm-rs[couchdb]

    DynamoDBPipeline,  # requires: pip install silkworm-rs[dynamodb]

    DuckDBPipeline,  # requires: pip install silkworm-rs[duckdb]

)

run_spider(

    QuotesSpider,

    request_middlewares=[

        UserAgentMiddleware(),  # rotate/custom user agent

        DelayMiddleware(min_delay=0.3, max_delay=1.2),  # polite throttling

        # ProxyMiddleware with round-robin selection (default)

        # ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),

        # ProxyMiddleware with random selection

        # ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),

        # ProxyMiddleware from file with random selection

        # ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),

    ],

    response_middlewares=[

        RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]),  # backoff + retry

        SkipNonHTMLMiddleware(),  # drop callbacks for images/APIs/etc

    ],

    item_pipelines=[

        JsonLinesPipeline("data/quotes.jl"),

        SQLitePipeline("data/quotes.db", table="quotes"),

        XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),

        CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),

        MsgPackPipeline("data/quotes.msgpack"),

    ],

)

```

- `DelayMiddleware` strategies: `delay=1.0` (fixed), `min_delay/max_delay` (random), or `delay_func` (custom).

- `ProxyMiddleware` supports three modes:

  - **Round-robin (default)**: `ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"])` cycles through proxies in order.

  - **Random selection**: `ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True)` randomly selects a proxy for each request.

  - **From file**: `ProxyMiddleware(proxy_file="proxies.txt")` loads proxies from a file (one proxy per line, blank lines ignored). Combine with `random_selection=True` for random selection from the file.

- `RetryMiddleware` backs off with `asyncio.sleep`; any status in `sleep_http_codes` is retried even if not in `retry_http_codes`.

- `SkipNonHTMLMiddleware` checks `Content-Type` and optionally sniffs the body (`sniff_bytes`) to avoid running HTML callbacks on binary/API responses.

- `CloudflareCrawlMiddleware` is opt-in per request via `request.meta["cloudflare_crawl"]`; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSON `Response` with the final API payload.

- `JsonLinesPipeline` writes items to a local JSON Lines file and, when `opendal` is installed, appends asynchronously via the filesystem backend (`use_opendal=False` to stick to a regular file handle).

- `CSVPipeline` flattens nested dicts (e.g., `{"user": {"name": "Alice"}}` -> `user_name`) and joins lists with commas; `XMLPipeline` preserves nesting.

- `MsgPackPipeline` writes items in binary MessagePack format using [ormsgpack](https://github.com/aviramha/ormsgpack) for fast and compact serialization (requires `pip install silkworm-rs[msgpack]`).

- `TaskiqPipeline` sends items to a [Taskiq](https://taskiq-python.github.io/) queue for distributed processing (requires `pip install silkworm-rs[taskiq]`).

- `PolarsPipeline` writes items to a Parquet file using Polars for efficient columnar storage (requires `pip install silkworm-rs[polars]`).

- `ExcelPipeline` writes items to an Excel .xlsx file (requires `pip install silkworm-rs[excel]`).

- `YAMLPipeline` writes items to a YAML file (requires `pip install silkworm-rs[yaml]`).

- `AvroPipeline` writes items to an Avro file with optional schema (requires `pip install silkworm-rs[avro]`).

- `ElasticsearchPipeline` sends items to an Elasticsearch index (requires `pip install silkworm-rs[elasticsearch]`).

- `MongoDBPipeline` sends items to a MongoDB collection (requires `pip install silkworm-rs[mongodb]`).

- `MySQLPipeline` sends items to a MySQL database table as JSON (requires `pip install silkworm-rs[mysql]`).

- `PostgreSQLPipeline` sends items to a PostgreSQL database table as JSONB (requires `pip install silkworm-rs[postgresql]`).

- `S3JsonLinesPipeline` writes items to AWS S3 in JSON Lines format using async OpenDAL (requires `pip install silkworm-rs[s3]`).

- `VortexPipeline` writes items to a [Vortex](https://github.com/spiraldb/vortex) file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requires `pip install silkworm-rs[vortex]`).

- `WebhookPipeline` sends items to webhook endpoints via HTTP POST/PUT using wreq (same HTTP client as the spider) with support for batching and custom headers.

- `GoogleSheetsPipeline` appends items to Google Sheets with automatic flattening of nested data structures (requires `pip install silkworm-rs[gsheets]` and service account credentials).

- `SnowflakePipeline` sends items to Snowflake data warehouse tables as JSON (requires `pip install silkworm-rs[snowflake]`).

- `FTPPipeline` writes items to an FTP server in JSON Lines format (requires `pip install silkworm-rs[ftp]`).

- `SFTPPipeline` writes items to an SFTP server in JSON Lines format with support for password or key-based authentication (requires `pip install silkworm-rs[sftp]`).

- `CassandraPipeline` sends items to Apache Cassandra database tables (requires `pip install silkworm-rs[cassandra]`).

- `CouchDBPipeline` sends items to CouchDB databases as documents (requires `pip install silkworm-rs[couchdb]`).

- `DynamoDBPipeline` sends items to AWS DynamoDB tables with automatic table creation (requires `pip install silkworm-rs[dynamodb]`).

- `DuckDBPipeline` sends items to a DuckDB database table as JSON (requires `pip install silkworm-rs[duckdb]`).

- `CallbackPipeline` invokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.

## Using CallbackPipeline for custom processing

Process items with custom callback functions without creating a full pipeline class:

```python

from silkworm.pipelines import CallbackPipeline

# Sync callback

def print_item(item, spider):

    print(f"[{spider.name}] {item}")

    return item

# Async callback

async def validate_item(item, spider):

    # Could do async operations like database checks

    if len(item.get("text", "")) < 10:

        print(f"Warning: Short text in item")

    return item

# Modifying callback

def enrich_item(item, spider):

    item["spider_name"] = spider.name

    item["processed"] = True

    return item

run_spider(

    QuotesSpider,

    item_pipelines=[

        CallbackPipeline(callback=print_item),

        CallbackPipeline(callback=validate_item),

        CallbackPipeline(callback=enrich_item),

    ],

)

```

Callbacks receive `(item, spider)` and should return the processed item (or `None` to return the original item unchanged).

## Streaming items to a queue with TaskiqPipeline

Stream scraped items to a [Taskiq](https://taskiq-python.github.io/) queue for distributed processing:

```python

from taskiq import InMemoryBroker

from silkworm.pipelines import TaskiqPipeline

broker = InMemoryBroker()

@broker.task

async def process_item(item):

    # Your item processing logic here

    print(f"Processing: {item}")

    # Save to database, send to another service, etc.

pipeline = TaskiqPipeline(broker, task=process_item)

run_spider(MySpider, item_pipelines=[pipeline])

```

This enables distributed processing, retries, rate limiting, and other Taskiq features. See `examples/taskiq_quotes_spider.py` for a complete example.

## Handling non-HTML responses

Keep crawls cheap when URLs mix HTML and binaries/APIs:

```python

response_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)]

# Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs

run_spider(MySpider, html_max_size_bytes=1_000_000)

```

## Performance optimization with rsloop

For improved async performance, enable rsloop as a drop-in replacement for asyncio's event loop:

```bash

pip install silkworm-rs[rsloop]

# or with uv:

uv pip install silkworm-rs[rsloop]

```

Then call `run_spider_rsloop` (same signature as `run_spider`):

```python

from silkworm import run_spider_rsloop

run_spider_rsloop(

    QuotesSpider,

    concurrency=32,

)

```

## Performance optimization with uvloop

For improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):

```bash

pip install silkworm-rs[uvloop]

# or with uv:

uv pip install silkworm-rs[uvloop]

```

Then call `run_spider_uvloop` (same signature as `run_spider`):

```python

from silkworm import run_spider_uvloop

run_spider_uvloop(

    QuotesSpider,

    concurrency=32,

)

```

uvloop can provide 2-4x performance improvement for I/O-bound workloads.

## Performance optimization with winloop (Windows)

For Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):

```bash

pip install silkworm-rs[winloop]

# or with uv:

uv pip install silkworm-rs[winloop]

```

Then call `run_spider_winloop` (same signature as `run_spider`):

```python

from silkworm import run_spider_winloop

run_spider_winloop(

    QuotesSpider,

    concurrency=32,

)

```

winloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.

## Running spiders with trio

If you prefer trio over asyncio, you can use `run_spider_trio` instead of `run_spider`:

```bash

pip install silkworm-rs[trio]

# or with uv:

uv pip install silkworm-rs[trio]

```

Then use `run_spider_trio`:

```python

from silkworm import run_spider_trio

run_spider_trio(

    QuotesSpider,

    concurrency=16,

    request_timeout=10,

)

```

This runs your spider using trio as the async backend via trio-asyncio compatibility layer.

## JavaScript rendering with Servo

For pages that need JavaScript execution but do not require driving an external browser process, install the optional Servo renderer and pass `ServoFetchClient` as the spider HTTP client.

Install a wheel from this page: https://github.com/RustedBytes/servofetch-py/releases

```python

from silkworm import HTMLResponse, Response, ServoFetchClient, Spider, run_spider

class RenderedSpider(Spider):

    name = "rendered"

    start_urls = ("https://example.com/",)

    async def parse(self, response: Response):

        if isinstance(response, HTMLResponse):

            title = await response.select_first("title")

            yield {"title": title.text if title else ""}

run_spider(RenderedSpider, http_client=ServoFetchClient(settle_ms=500))

```

Per-request render options live in `Request.meta`: `servo_javascript`, `servo_settle_ms`, `servo_user_agent`, `servo_screenshot`, and `servo_full_page`. `Request.timeout` overrides the client timeout for that request.

`ServoFetchClient` embeds Servo through `servofetch`; the existing CDP client connects to an external Lightpanda/Chrome-compatible browser over WebSocket. Use the default wreq client when pages do not need client-side rendering.

## JavaScript rendering with Lightpanda (CDP)

For pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.

### Installation

```bash

pip install silkworm-rs[cdp]

# or with uv:

uv pip install silkworm-rs[cdp]

```

### Starting Lightpanda

```bash

lightpanda --remote-debugging-port=9222

```

Or use Chrome/Chromium:

```bash

chromium --remote-debugging-port=9222 --headless

```

### Using CDP in your spider

There are two ways to use CDP: the convenience API or custom spider integration.

#### Convenience API (simple one-off fetches)

```python

import asyncio

from silkworm import fetch_html_cdp

async def main():

    # Fetch HTML with JavaScript rendering

    text, doc = await fetch_html_cdp(

        "https://example.com",

        ws_endpoint="ws://127.0.0.1:9222",

        timeout=30.0

    )

    

    # Extract data from rendered page

    title = doc.select_first("title")

    print(title.text if title else "No title")

asyncio.run(main())

```

#### Full Spider Integration

```python

from silkworm import HTMLResponse, Request, Response, Spider

from silkworm.cdp import CDPClient

class LightpandaSpider(Spider):

    name = "lightpanda"

    start_urls = ("https://example.com/",)

    def __init__(self, **kwargs):

        super().__init__(**kwargs)

        self._cdp_client = None

    async def start_requests(self):

        # Connect to CDP endpoint

        self._cdp_client = CDPClient(

            ws_endpoint="ws://127.0.0.1:9222",

            timeout=30.0

        )

        await self._cdp_client.connect()

        

        for url in self.start_urls:

            yield Request(url=url, callback=self.parse)

    async def parse(self, response: Response):

        if not isinstance(response, HTMLResponse):

            return

        

        # Extract links from JavaScript-rendered page

        for link in await response.select("a"):

            href = link.attr("href")

            if href:

                yield {"url": href}

    async def close(self):

        if self._cdp_client:

            await self._cdp_client.close()

```

See `examples/lightpanda_simple.py` and `examples/lightpanda_spider.py` for complete working examples.

**Note:** CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.

## Onion services with OnionLink

For Tor v3 `.onion` sites, install the optional OnionLink extra and pass `OnionLinkClient` as the spider HTTP client:

```bash

pip install "silkworm-rs[onionlink]"

```

```python

from silkworm import HTMLResponse, OnionLinkClient, Response, Spider, run_spider

class OnionSpider(Spider):

    name = "onion"

    start_urls = ("http://exampleexampleexampleexampleexampleexampleexampleexampleexampleexample.onion/",)

    async def parse(self, response: Response):

        if isinstance(response, HTMLResponse):

            title = await response.select_first("title")

            yield {"title": title.text if title else ""}

run_spider(

    OnionSpider,

    http_client=OnionLinkClient(concurrency=4, timeout=30),

)

```

`OnionLinkClient` supports Silkworm `Request` headers, `params`, body/data, JSON payloads, redirects, HTML detection, and `request.meta["redirect_times"]`. Override OnionLink's response byte cap per request with `request.meta["onionlink_response_limit"]`.

## Logging and crawl statistics

- Structured logs via `logly`; set `SILKWORM_LOG_LEVEL=DEBUG` for verbose request/response/middleware output.

- Periodic statistics with `log_stats_interval`; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.

## Limitations

- By default, HTTP fetches are wreq-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see "JavaScript rendering with Lightpanda" section) or external browser automation tools. Tor v3 `.onion` sites can use the optional OnionLink integration.

- Request deduplication keys only on `Request.url`; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you set `dont_filter=True` or make the URL unique yourself.

- HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a `html_max_size_bytes`/`doc_max_size_bytes` cap (default 5 MB) in `scraper-rs` selectors, so very large pages may need a higher limit or preprocessing.

- Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.

- Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because `cassandra-driver` depends on libev there.

## Examples

- `python examples/quotes_spider.py` → `data/quotes.jl`

- `python examples/quotes_spider_trio.py` → `data/quotes_trio.jl` (demonstrates trio backend)

- `python examples/quotes_spider_winloop.py` → `data/quotes_winloop.jl` (demonstrates winloop backend for Windows)

- `python examples/hackernews_spider.py --pages 5` → `data/hackernews.jl`

- `python examples/lobsters_spider.py --pages 2` → `data/lobsters.jl`

- `python examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl` (includes `SkipNonHTMLMiddleware` and stricter HTML size limits)

- `python examples/exception_handling_spider.py` → `data/exception_handling.jl` (demonstrates `process_exception` and request `errback`)

- `SILKWORM_LOG_LEVEL=DEBUG python examples/logging_controls_demo.py --mode noisy` then `--mode quiet` → demonstrates noisy pipeline/URL logging and the quieter `EngineLogger` + pipeline `log_level=None` setup

- `python examples/export_formats_demo.py --pages 2` → JSONL, XML, and CSV outputs in `data/`

- `python examples/taskiq_quotes_spider.py --pages 2` → demonstrates TaskiqPipeline for queue-based processing

- `python examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50` → `data/sitemap_meta.jl` (extracts meta tags and Open Graph data from sitemap URLs)

- `python examples/lightpanda_simple.py` → demonstrates CDP/Lightpanda for JavaScript rendering (requires `pip install silkworm-rs[cdp]` and running Lightpanda)

- `python examples/lightpanda_spider.py` → full spider example using CDP/Lightpanda

## Convenience API

For one-off fetches without a full spider:

### Standard HTTP fetch

```python

import asyncio

from silkworm import fetch_html

async def main():

    text, doc = await fetch_html("https://example.com")

    title = await doc.select_first("title")

    print(title.text if title else "No title")

asyncio.run(main())

```

### CDP-based fetch (with JavaScript rendering)

```python

import asyncio

from silkworm import fetch_html_cdp

async def main():

    # Requires Lightpanda/Chrome running with CDP enabled

    text, doc = await fetch_html_cdp("https://example.com")

    title = await doc.select_first("title")

    print(title.text if title else "No title")

asyncio.run(main())

```

## Contributing

Pull requests and issues are welcome. To set up a dev environment, install [uv](https://docs.astral.sh/uv/getting-started/), create a Python 3.13 virtualenv, and sync dev dependencies:

```bash

uv venv --python python3.13

uv sync --group dev

```

Run the checks before opening a PR:

```bash

just fmt && just lint && just typecheck && just test

```

## Acknowledgements

Silkworm is built on top of excellent open-source projects:

- [wreq](https://github.com/0x676e67/wreq-python) - HTTP client with browser impersonation capabilities

- [onionlink](https://github.com/RustedBytes/onionlink-rs) - Tor v3 onion-service client

- [servofetch](https://github.com/RustedBytes/servofetch-py) - Bindings to the Servo browser

- [scraper-rs](https://github.com/RustedBytes/scraper-rs) - Fast HTML parsing library

- [logly](https://github.com/muhammad-fiaz/logly) - Structured logging

- [rxml](https://github.com/nephi-dev/rxml) - XML parsing and writing

We are grateful to the maintainers and contributors of these projects for their work.

## License

MIT License. See `LICENSE` for details.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bitingsnakes/silkworm

Awesome Lists containing this project

README