{"id":36953845,"url":"https://github.com/bitingsnakes/silkworm","last_synced_at":"2026-05-15T23:11:21.812Z","repository":{"id":328034217,"uuid":"1112414186","full_name":"BitingSnakes/silkworm","owner":"BitingSnakes","description":"Async web scraping framework on top of Rust. Works with Free-threaded Python (`PYTHON_GIL=0`).","archived":false,"fork":false,"pushed_at":"2026-03-31T14:57:00.000Z","size":1695,"stargazers_count":52,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-02T09:33:26.303Z","etag":null,"topics":["free-threaded-python","rust","scraping","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BitingSnakes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-12-08T15:38:01.000Z","updated_at":"2026-03-31T09:09:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/BitingSnakes/silkworm","commit_stats":null,"previous_names":["bitingsnakes/silkworm"],"tags_count":58,"template":false,"template_full_name":null,"purl":"pkg:github/BitingSnakes/silkworm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitingSnakes%2Fsilkworm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitingSnakes%2Fsilkworm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitingSnakes%2Fsilkworm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitingSnakes%2Fsilkworm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BitingSnakes","download_url":"https://codeload.github.com/BitingSnakes/silkworm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitingSnakes%2Fsilkworm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32571492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["free-threaded-python","rust","scraping","web-scraping"],"created_at":"2026-01-13T12:54:03.639Z","updated_at":"2026-05-15T23:11:21.805Z","avatar_url":"https://github.com/BitingSnakes.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# silkworm-rs\n\n[![PyPI - Version](https://img.shields.io/pypi/v/silkworm-rs)](https://pypi.org/project/silkworm-rs/)\n[![Tests](https://github.com/BitingSnakes/silkworm/actions/workflows/tests.yml/badge.svg)](https://github.com/BitingSnakes/silkworm/actions/workflows/tests.yml)\n[![PyPI Downloads](https://static.pepy.tech/personalized-badge/silkworm-rs?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=GREEN\u0026left_text=downloads)](https://pepy.tech/projects/silkworm-rs)\n\nAsync-first web scraping framework built on [wreq](https://github.com/0x676e67/wreq-python) (HTTP with browser impersonation) and [scraper-rs](https://github.com/RustedBytes/scraper-rs) (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.\n\n\u003e **NEW**: Use [silkworm-mcp](https://github.com/BitingSnakes/silkworm-mcp) to build scrapers.\n\n## Features\n- Async engine with configurable concurrency, bounded queue backpressure (defaults to `concurrency * 10`), and per-request timeouts.\n- wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via `request.meta[\"proxy\"]`.\n- Optional OnionLink client integration for scraping Tor v3 `.onion` sites without routing through wreq.\n- Optional Servo rendering via `ServoFetchClient` for JavaScript-rendered pages without changing the default HTTP client.\n- Typed spiders and callbacks that can return items or `Request` objects; `HTMLResponse` ships helper methods plus `Response.follow` to reuse callbacks.\n- Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom), `SkipNonHTMLMiddleware` to drop non-HTML callbacks, and `CloudflareCrawlMiddleware` for Browser Rendering crawl jobs.\n- Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.\n- Structured logging via `logly` (`SILKWORM_LOG_LEVEL=DEBUG`), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).\n\n## Installation\n\nFrom PyPI with pip:\n\n```bash\npip install silkworm-rs\n```\n\nFrom PyPI with uv (recommended for faster installs):\n\n```bash\nuv pip install silkworm-rs\n# or if using uv's project management:\nuv add silkworm-rs\n```\n\nFrom source:\n\n```bash\nuv venv  # install uv from https://docs.astral.sh/uv/getting-started/ if needed\nsource .venv/bin/activate  # Windows: .venv\\Scripts\\activate\nuv pip install -e .\n```\n\nTargets Python 3.13+; dependencies are pinned in `pyproject.toml`.\n\n## Quick start\nDefine a spider by subclassing `Spider`, implementing `parse`, and yielding items or follow-up `Request` objects. This example writes quotes to `data/quotes.jl` and enables basic user agent, retry, and non-HTML filtering middlewares.\n\n```python\nfrom silkworm import HTMLResponse, Response, Spider, run_spider\nfrom silkworm.middlewares import (\n    RetryMiddleware,\n    SkipNonHTMLMiddleware,\n    UserAgentMiddleware,\n)\nfrom silkworm.pipelines import JsonLinesPipeline\n\n\nclass QuotesSpider(Spider):\n    name = \"quotes\"\n    start_urls = (\"https://quotes.toscrape.com/\",)\n\n    async def parse(self, response: Response):\n        if not isinstance(response, HTMLResponse):\n            return\n\n        html = response\n        for quote in await html.select(\".quote\"):\n            text_el = await quote.select_first(\".text\")\n            author_el = await quote.select_first(\".author\")\n            if text_el is None or author_el is None:\n                continue\n            tags = await quote.select(\".tag\")\n            yield {\n                \"text\": text_el.text,\n                \"author\": author_el.text,\n                \"tags\": [t.text for t in tags],\n            }\n\n        if next_link := await html.select_first(\"li.next \u003e a\"):\n            yield html.follow(next_link.attr(\"href\"), callback=self.parse)\n\n\nif __name__ == \"__main__\":\n    run_spider(\n        QuotesSpider,\n        request_middlewares=[UserAgentMiddleware()],\n        response_middlewares=[\n            SkipNonHTMLMiddleware(),\n            RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),\n        ],\n        item_pipelines=[JsonLinesPipeline(\"data/quotes.jl\")],\n        concurrency=16,\n        request_timeout=10,\n        log_stats_interval=30,\n    )\n```\n\n`run_spider`/`crawl` knobs:\n- `concurrency`: number of concurrent HTTP requests; default 16.\n- `max_pending_requests`: queue bound to avoid unbounded memory use (defaults to `concurrency * 10`).\n- `request_timeout`: per-request timeout (seconds).\n- `keep_alive`: reuse HTTP connections when supported by the underlying client (sends `Connection: keep-alive`).\n- `http_client`: use a custom client instance such as `OnionLinkClient(...)` or `ServoFetchClient(...)` instead of the default wreq-backed client.\n- `html_max_size_bytes`: limit HTML parsed into `AsyncDocument` to avoid huge payloads.\n- `log_stats_interval`: seconds between periodic stats logs; final stats are always emitted.\n- `request_middlewares` / `response_middlewares` / `item_pipelines`: plug-ins run on every request/response/item.\n- use `run_spider_rsloop(...)` instead of `run_spider(...)` to run under rsloop (requires `pip install silkworm-rs[rsloop]`).\n- use `run_spider_uvloop(...)` instead of `run_spider(...)` to run under uvloop (requires `pip install silkworm-rs[uvloop]`).\n- use `run_spider_winloop(...)` instead of `run_spider(...)` to run under winloop on Windows (requires `pip install silkworm-rs[winloop]`).\n\n## Built-in middlewares and pipelines\n\n```python\nfrom silkworm.middlewares import (\n    CloudflareCrawlMiddleware,\n    DelayMiddleware,\n    ProxyMiddleware,\n    RetryMiddleware,\n    SkipNonHTMLMiddleware,\n    UserAgentMiddleware,\n)\nfrom silkworm.pipelines import (\n    CallbackPipeline,  # invoke a custom callback function on each item\n    CSVPipeline,\n    JsonLinesPipeline,\n    MsgPackPipeline,  # requires: pip install silkworm-rs[msgpack]\n    SQLitePipeline,\n    XMLPipeline,\n    TaskiqPipeline,  # requires: pip install silkworm-rs[taskiq]\n    PolarsPipeline,  # requires: pip install silkworm-rs[polars]\n    ExcelPipeline,  # requires: pip install silkworm-rs[excel]\n    YAMLPipeline,  # requires: pip install silkworm-rs[yaml]\n    AvroPipeline,  # requires: pip install silkworm-rs[avro]\n    ElasticsearchPipeline,  # requires: pip install silkworm-rs[elasticsearch]\n    MongoDBPipeline,  # requires: pip install silkworm-rs[mongodb]\n    MySQLPipeline,  # requires: pip install silkworm-rs[mysql]\n    PostgreSQLPipeline,  # requires: pip install silkworm-rs[postgresql]\n    S3JsonLinesPipeline,  # requires: pip install silkworm-rs[s3]\n    VortexPipeline,  # requires: pip install silkworm-rs[vortex]\n    WebhookPipeline,  # sends items to webhook endpoints using wreq\n    GoogleSheetsPipeline,  # requires: pip install silkworm-rs[gsheets]\n    SnowflakePipeline,  # requires: pip install silkworm-rs[snowflake]\n    FTPPipeline,  # requires: pip install silkworm-rs[ftp]\n    SFTPPipeline,  # requires: pip install silkworm-rs[sftp]\n    CassandraPipeline,  # requires: pip install silkworm-rs[cassandra]\n    CouchDBPipeline,  # requires: pip install silkworm-rs[couchdb]\n    DynamoDBPipeline,  # requires: pip install silkworm-rs[dynamodb]\n    DuckDBPipeline,  # requires: pip install silkworm-rs[duckdb]\n)\n\nrun_spider(\n    QuotesSpider,\n    request_middlewares=[\n        UserAgentMiddleware(),  # rotate/custom user agent\n        DelayMiddleware(min_delay=0.3, max_delay=1.2),  # polite throttling\n        # ProxyMiddleware with round-robin selection (default)\n        # ProxyMiddleware(proxies=[\"http://user:pass@proxy1:8080\", \"http://proxy2:8080\"]),\n        # ProxyMiddleware with random selection\n        # ProxyMiddleware(proxies=[\"http://proxy1:8080\", \"http://proxy2:8080\"], random_selection=True),\n        # ProxyMiddleware from file with random selection\n        # ProxyMiddleware(proxy_file=\"proxies.txt\", random_selection=True),\n    ],\n    response_middlewares=[\n        RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]),  # backoff + retry\n        SkipNonHTMLMiddleware(),  # drop callbacks for images/APIs/etc\n    ],\n    item_pipelines=[\n        JsonLinesPipeline(\"data/quotes.jl\"),\n        SQLitePipeline(\"data/quotes.db\", table=\"quotes\"),\n        XMLPipeline(\"data/quotes.xml\", root_element=\"quotes\", item_element=\"quote\"),\n        CSVPipeline(\"data/quotes.csv\", fieldnames=[\"author\", \"text\", \"tags\"]),\n        MsgPackPipeline(\"data/quotes.msgpack\"),\n    ],\n)\n```\n\n- `DelayMiddleware` strategies: `delay=1.0` (fixed), `min_delay/max_delay` (random), or `delay_func` (custom).\n- `ProxyMiddleware` supports three modes:\n  - **Round-robin (default)**: `ProxyMiddleware(proxies=[\"http://proxy1:8080\", \"http://proxy2:8080\"])` cycles through proxies in order.\n  - **Random selection**: `ProxyMiddleware(proxies=[\"http://proxy1:8080\", \"http://proxy2:8080\"], random_selection=True)` randomly selects a proxy for each request.\n  - **From file**: `ProxyMiddleware(proxy_file=\"proxies.txt\")` loads proxies from a file (one proxy per line, blank lines ignored). Combine with `random_selection=True` for random selection from the file.\n- `RetryMiddleware` backs off with `asyncio.sleep`; any status in `sleep_http_codes` is retried even if not in `retry_http_codes`.\n- `SkipNonHTMLMiddleware` checks `Content-Type` and optionally sniffs the body (`sniff_bytes`) to avoid running HTML callbacks on binary/API responses.\n- `CloudflareCrawlMiddleware` is opt-in per request via `request.meta[\"cloudflare_crawl\"]`; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSON `Response` with the final API payload.\n- `JsonLinesPipeline` writes items to a local JSON Lines file and, when `opendal` is installed, appends asynchronously via the filesystem backend (`use_opendal=False` to stick to a regular file handle).\n- `CSVPipeline` flattens nested dicts (e.g., `{\"user\": {\"name\": \"Alice\"}}` -\u003e `user_name`) and joins lists with commas; `XMLPipeline` preserves nesting.\n- `MsgPackPipeline` writes items in binary MessagePack format using [ormsgpack](https://github.com/aviramha/ormsgpack) for fast and compact serialization (requires `pip install silkworm-rs[msgpack]`).\n- `TaskiqPipeline` sends items to a [Taskiq](https://taskiq-python.github.io/) queue for distributed processing (requires `pip install silkworm-rs[taskiq]`).\n- `PolarsPipeline` writes items to a Parquet file using Polars for efficient columnar storage (requires `pip install silkworm-rs[polars]`).\n- `ExcelPipeline` writes items to an Excel .xlsx file (requires `pip install silkworm-rs[excel]`).\n- `YAMLPipeline` writes items to a YAML file (requires `pip install silkworm-rs[yaml]`).\n- `AvroPipeline` writes items to an Avro file with optional schema (requires `pip install silkworm-rs[avro]`).\n- `ElasticsearchPipeline` sends items to an Elasticsearch index (requires `pip install silkworm-rs[elasticsearch]`).\n- `MongoDBPipeline` sends items to a MongoDB collection (requires `pip install silkworm-rs[mongodb]`).\n- `MySQLPipeline` sends items to a MySQL database table as JSON (requires `pip install silkworm-rs[mysql]`).\n- `PostgreSQLPipeline` sends items to a PostgreSQL database table as JSONB (requires `pip install silkworm-rs[postgresql]`).\n- `S3JsonLinesPipeline` writes items to AWS S3 in JSON Lines format using async OpenDAL (requires `pip install silkworm-rs[s3]`).\n- `VortexPipeline` writes items to a [Vortex](https://github.com/spiraldb/vortex) file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requires `pip install silkworm-rs[vortex]`).\n- `WebhookPipeline` sends items to webhook endpoints via HTTP POST/PUT using wreq (same HTTP client as the spider) with support for batching and custom headers.\n- `GoogleSheetsPipeline` appends items to Google Sheets with automatic flattening of nested data structures (requires `pip install silkworm-rs[gsheets]` and service account credentials).\n- `SnowflakePipeline` sends items to Snowflake data warehouse tables as JSON (requires `pip install silkworm-rs[snowflake]`).\n- `FTPPipeline` writes items to an FTP server in JSON Lines format (requires `pip install silkworm-rs[ftp]`).\n- `SFTPPipeline` writes items to an SFTP server in JSON Lines format with support for password or key-based authentication (requires `pip install silkworm-rs[sftp]`).\n- `CassandraPipeline` sends items to Apache Cassandra database tables (requires `pip install silkworm-rs[cassandra]`).\n- `CouchDBPipeline` sends items to CouchDB databases as documents (requires `pip install silkworm-rs[couchdb]`).\n- `DynamoDBPipeline` sends items to AWS DynamoDB tables with automatic table creation (requires `pip install silkworm-rs[dynamodb]`).\n- `DuckDBPipeline` sends items to a DuckDB database table as JSON (requires `pip install silkworm-rs[duckdb]`).\n- `CallbackPipeline` invokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.\n\n## Using CallbackPipeline for custom processing\nProcess items with custom callback functions without creating a full pipeline class:\n\n```python\nfrom silkworm.pipelines import CallbackPipeline\n\n# Sync callback\ndef print_item(item, spider):\n    print(f\"[{spider.name}] {item}\")\n    return item\n\n# Async callback\nasync def validate_item(item, spider):\n    # Could do async operations like database checks\n    if len(item.get(\"text\", \"\")) \u003c 10:\n        print(f\"Warning: Short text in item\")\n    return item\n\n# Modifying callback\ndef enrich_item(item, spider):\n    item[\"spider_name\"] = spider.name\n    item[\"processed\"] = True\n    return item\n\nrun_spider(\n    QuotesSpider,\n    item_pipelines=[\n        CallbackPipeline(callback=print_item),\n        CallbackPipeline(callback=validate_item),\n        CallbackPipeline(callback=enrich_item),\n    ],\n)\n```\n\nCallbacks receive `(item, spider)` and should return the processed item (or `None` to return the original item unchanged).\n\n## Streaming items to a queue with TaskiqPipeline\nStream scraped items to a [Taskiq](https://taskiq-python.github.io/) queue for distributed processing:\n\n```python\nfrom taskiq import InMemoryBroker\nfrom silkworm.pipelines import TaskiqPipeline\n\nbroker = InMemoryBroker()\n\n@broker.task\nasync def process_item(item):\n    # Your item processing logic here\n    print(f\"Processing: {item}\")\n    # Save to database, send to another service, etc.\n\npipeline = TaskiqPipeline(broker, task=process_item)\nrun_spider(MySpider, item_pipelines=[pipeline])\n```\n\nThis enables distributed processing, retries, rate limiting, and other Taskiq features. See `examples/taskiq_quotes_spider.py` for a complete example.\n\n## Handling non-HTML responses\nKeep crawls cheap when URLs mix HTML and binaries/APIs:\n\n```python\nresponse_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)]\n# Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs\nrun_spider(MySpider, html_max_size_bytes=1_000_000)\n```\n\n## Performance optimization with rsloop\nFor improved async performance, enable rsloop as a drop-in replacement for asyncio's event loop:\n\n```bash\npip install silkworm-rs[rsloop]\n# or with uv:\nuv pip install silkworm-rs[rsloop]\n```\n\nThen call `run_spider_rsloop` (same signature as `run_spider`):\n\n```python\nfrom silkworm import run_spider_rsloop\n\nrun_spider_rsloop(\n    QuotesSpider,\n    concurrency=32,\n)\n```\n\n## Performance optimization with uvloop\nFor improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):\n\n```bash\npip install silkworm-rs[uvloop]\n# or with uv:\nuv pip install silkworm-rs[uvloop]\n```\n\nThen call `run_spider_uvloop` (same signature as `run_spider`):\n\n```python\nfrom silkworm import run_spider_uvloop\n\nrun_spider_uvloop(\n    QuotesSpider,\n    concurrency=32,\n)\n```\n\nuvloop can provide 2-4x performance improvement for I/O-bound workloads.\n\n## Performance optimization with winloop (Windows)\nFor Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):\n\n```bash\npip install silkworm-rs[winloop]\n# or with uv:\nuv pip install silkworm-rs[winloop]\n```\n\nThen call `run_spider_winloop` (same signature as `run_spider`):\n\n```python\nfrom silkworm import run_spider_winloop\n\nrun_spider_winloop(\n    QuotesSpider,\n    concurrency=32,\n)\n```\n\nwinloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.\n\n## Running spiders with trio\nIf you prefer trio over asyncio, you can use `run_spider_trio` instead of `run_spider`:\n\n```bash\npip install silkworm-rs[trio]\n# or with uv:\nuv pip install silkworm-rs[trio]\n```\n\nThen use `run_spider_trio`:\n\n```python\nfrom silkworm import run_spider_trio\n\nrun_spider_trio(\n    QuotesSpider,\n    concurrency=16,\n    request_timeout=10,\n)\n```\n\nThis runs your spider using trio as the async backend via trio-asyncio compatibility layer.\n\n## JavaScript rendering with Servo\nFor pages that need JavaScript execution but do not require driving an external browser process, install the optional Servo renderer and pass `ServoFetchClient` as the spider HTTP client.\n\nInstall a wheel from this page: https://github.com/RustedBytes/servofetch-py/releases\n\n```python\nfrom silkworm import HTMLResponse, Response, ServoFetchClient, Spider, run_spider\n\n\nclass RenderedSpider(Spider):\n    name = \"rendered\"\n    start_urls = (\"https://example.com/\",)\n\n    async def parse(self, response: Response):\n        if isinstance(response, HTMLResponse):\n            title = await response.select_first(\"title\")\n            yield {\"title\": title.text if title else \"\"}\n\n\nrun_spider(RenderedSpider, http_client=ServoFetchClient(settle_ms=500))\n```\n\nPer-request render options live in `Request.meta`: `servo_javascript`, `servo_settle_ms`, `servo_user_agent`, `servo_screenshot`, and `servo_full_page`. `Request.timeout` overrides the client timeout for that request.\n\n`ServoFetchClient` embeds Servo through `servofetch`; the existing CDP client connects to an external Lightpanda/Chrome-compatible browser over WebSocket. Use the default wreq client when pages do not need client-side rendering.\n\n## JavaScript rendering with Lightpanda (CDP)\nFor pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.\n\n### Installation\n```bash\npip install silkworm-rs[cdp]\n# or with uv:\nuv pip install silkworm-rs[cdp]\n```\n\n### Starting Lightpanda\n```bash\nlightpanda --remote-debugging-port=9222\n```\n\nOr use Chrome/Chromium:\n```bash\nchromium --remote-debugging-port=9222 --headless\n```\n\n### Using CDP in your spider\nThere are two ways to use CDP: the convenience API or custom spider integration.\n\n#### Convenience API (simple one-off fetches)\n```python\nimport asyncio\nfrom silkworm import fetch_html_cdp\n\nasync def main():\n    # Fetch HTML with JavaScript rendering\n    text, doc = await fetch_html_cdp(\n        \"https://example.com\",\n        ws_endpoint=\"ws://127.0.0.1:9222\",\n        timeout=30.0\n    )\n    \n    # Extract data from rendered page\n    title = doc.select_first(\"title\")\n    print(title.text if title else \"No title\")\n\nasyncio.run(main())\n```\n\n#### Full Spider Integration\n```python\nfrom silkworm import HTMLResponse, Request, Response, Spider\nfrom silkworm.cdp import CDPClient\n\nclass LightpandaSpider(Spider):\n    name = \"lightpanda\"\n    start_urls = (\"https://example.com/\",)\n\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        self._cdp_client = None\n\n    async def start_requests(self):\n        # Connect to CDP endpoint\n        self._cdp_client = CDPClient(\n            ws_endpoint=\"ws://127.0.0.1:9222\",\n            timeout=30.0\n        )\n        await self._cdp_client.connect()\n        \n        for url in self.start_urls:\n            yield Request(url=url, callback=self.parse)\n\n    async def parse(self, response: Response):\n        if not isinstance(response, HTMLResponse):\n            return\n        \n        # Extract links from JavaScript-rendered page\n        for link in await response.select(\"a\"):\n            href = link.attr(\"href\")\n            if href:\n                yield {\"url\": href}\n\n    async def close(self):\n        if self._cdp_client:\n            await self._cdp_client.close()\n```\n\nSee `examples/lightpanda_simple.py` and `examples/lightpanda_spider.py` for complete working examples.\n\n**Note:** CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.\n\n## Onion services with OnionLink\nFor Tor v3 `.onion` sites, install the optional OnionLink extra and pass `OnionLinkClient` as the spider HTTP client:\n\n```bash\npip install \"silkworm-rs[onionlink]\"\n```\n\n```python\nfrom silkworm import HTMLResponse, OnionLinkClient, Response, Spider, run_spider\n\n\nclass OnionSpider(Spider):\n    name = \"onion\"\n    start_urls = (\"http://exampleexampleexampleexampleexampleexampleexampleexampleexampleexample.onion/\",)\n\n    async def parse(self, response: Response):\n        if isinstance(response, HTMLResponse):\n            title = await response.select_first(\"title\")\n            yield {\"title\": title.text if title else \"\"}\n\n\nrun_spider(\n    OnionSpider,\n    http_client=OnionLinkClient(concurrency=4, timeout=30),\n)\n```\n\n`OnionLinkClient` supports Silkworm `Request` headers, `params`, body/data, JSON payloads, redirects, HTML detection, and `request.meta[\"redirect_times\"]`. Override OnionLink's response byte cap per request with `request.meta[\"onionlink_response_limit\"]`.\n\n## Logging and crawl statistics\n- Structured logs via `logly`; set `SILKWORM_LOG_LEVEL=DEBUG` for verbose request/response/middleware output.\n- Periodic statistics with `log_stats_interval`; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.\n\n## Limitations\n- By default, HTTP fetches are wreq-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see \"JavaScript rendering with Lightpanda\" section) or external browser automation tools. Tor v3 `.onion` sites can use the optional OnionLink integration.\n- Request deduplication keys only on `Request.url`; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you set `dont_filter=True` or make the URL unique yourself.\n- HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a `html_max_size_bytes`/`doc_max_size_bytes` cap (default 5 MB) in `scraper-rs` selectors, so very large pages may need a higher limit or preprocessing.\n- Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.\n- Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because `cassandra-driver` depends on libev there.\n\n## Examples\n- `python examples/quotes_spider.py` → `data/quotes.jl`\n- `python examples/quotes_spider_trio.py` → `data/quotes_trio.jl` (demonstrates trio backend)\n- `python examples/quotes_spider_winloop.py` → `data/quotes_winloop.jl` (demonstrates winloop backend for Windows)\n- `python examples/hackernews_spider.py --pages 5` → `data/hackernews.jl`\n- `python examples/lobsters_spider.py --pages 2` → `data/lobsters.jl`\n- `python examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl` (includes `SkipNonHTMLMiddleware` and stricter HTML size limits)\n- `python examples/exception_handling_spider.py` → `data/exception_handling.jl` (demonstrates `process_exception` and request `errback`)\n- `SILKWORM_LOG_LEVEL=DEBUG python examples/logging_controls_demo.py --mode noisy` then `--mode quiet` → demonstrates noisy pipeline/URL logging and the quieter `EngineLogger` + pipeline `log_level=None` setup\n- `python examples/export_formats_demo.py --pages 2` → JSONL, XML, and CSV outputs in `data/`\n- `python examples/taskiq_quotes_spider.py --pages 2` → demonstrates TaskiqPipeline for queue-based processing\n- `python examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50` → `data/sitemap_meta.jl` (extracts meta tags and Open Graph data from sitemap URLs)\n- `python examples/lightpanda_simple.py` → demonstrates CDP/Lightpanda for JavaScript rendering (requires `pip install silkworm-rs[cdp]` and running Lightpanda)\n- `python examples/lightpanda_spider.py` → full spider example using CDP/Lightpanda\n\n## Convenience API\nFor one-off fetches without a full spider:\n\n### Standard HTTP fetch\n```python\nimport asyncio\nfrom silkworm import fetch_html\n\nasync def main():\n    text, doc = await fetch_html(\"https://example.com\")\n    title = await doc.select_first(\"title\")\n    print(title.text if title else \"No title\")\n\nasyncio.run(main())\n```\n\n### CDP-based fetch (with JavaScript rendering)\n```python\nimport asyncio\nfrom silkworm import fetch_html_cdp\n\nasync def main():\n    # Requires Lightpanda/Chrome running with CDP enabled\n    text, doc = await fetch_html_cdp(\"https://example.com\")\n    title = await doc.select_first(\"title\")\n    print(title.text if title else \"No title\")\n\nasyncio.run(main())\n```\n\n## Contributing\nPull requests and issues are welcome. To set up a dev environment, install [uv](https://docs.astral.sh/uv/getting-started/), create a Python 3.13 virtualenv, and sync dev dependencies:\n\n```bash\nuv venv --python python3.13\nuv sync --group dev\n```\n\nRun the checks before opening a PR:\n\n```bash\njust fmt \u0026\u0026 just lint \u0026\u0026 just typecheck \u0026\u0026 just test\n```\n\n## Acknowledgements\nSilkworm is built on top of excellent open-source projects:\n\n- [wreq](https://github.com/0x676e67/wreq-python) - HTTP client with browser impersonation capabilities\n- [onionlink](https://github.com/RustedBytes/onionlink-rs) - Tor v3 onion-service client\n- [servofetch](https://github.com/RustedBytes/servofetch-py) - Bindings to the Servo browser\n- [scraper-rs](https://github.com/RustedBytes/scraper-rs) - Fast HTML parsing library\n- [logly](https://github.com/muhammad-fiaz/logly) - Structured logging\n- [rxml](https://github.com/nephi-dev/rxml) - XML parsing and writing\n\nWe are grateful to the maintainers and contributors of these projects for their work.\n\n## License\nMIT License. See `LICENSE` for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitingsnakes%2Fsilkworm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitingsnakes%2Fsilkworm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitingsnakes%2Fsilkworm/lists"}