https://github.com/fawadss1/scrapy-stealth

Stealthy Crawling. Maximum Results. A pluggable anti-bot and stealth framework for Scrapy.
https://github.com/fawadss1/scrapy-stealth
anti-bot cloudflare-bypass framework proxy-rotation scraping-python scrapy
Last synced: 2 months ago
JSON representation
Stealthy Crawling. Maximum Results. A pluggable anti-bot and stealth framework for Scrapy.
Host: GitHub
URL: https://github.com/fawadss1/scrapy-stealth
Owner: fawadss1
License: mit
Created: 2026-04-23T06:46:06.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-05-18T06:50:18.000Z (2 months ago)
Last Synced: 2026-05-18T08:38:47.634Z (2 months ago)
Topics: anti-bot, cloudflare-bypass, framework, proxy-rotation, scraping-python, scrapy
Language: Python
Homepage: https://pypi.org/project/scrapy-stealth
Size: 292 KB
Stars: 6
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project

README

          


  



scrapy-stealth


Stealthy Crawling. Maximum Results.


A pluggable anti-bot and stealth framework for Scrapy.


[![PyPI version](https://img.shields.io/pypi/v/scrapy-stealth?color=blue)](https://pypi.org/project/scrapy-stealth/)

[![Python versions](https://img.shields.io/pypi/pyversions/scrapy-stealth)](https://pypi.org/project/scrapy-stealth/)

[![Downloads](https://static.pepy.tech/badge/scrapy-stealth)](https://pepy.tech/project/scrapy-stealth)

[![GitHub release](https://img.shields.io/github/v/release/fawadss1/scrapy-stealth)](https://github.com/fawadss1/scrapy-stealth/releases)

[![License: MIT](https://img.shields.io/badge/license-MIT-green)](https://github.com/fawadss1/scrapy-stealth/blob/master/LICENSE)

[![Changelog](https://img.shields.io/badge/changelog-releases-informational)](https://github.com/fawadss1/scrapy-stealth/releases)

`scrapy-stealth` extends Scrapy with browser impersonation, proxy rotation, fingerprint cycling, and intelligent retry strategies —

designed for large-scale, production-grade crawling.

---

## 🧠 Why scrapy-stealth?

Scrapy is fast and powerful, but modern websites use advanced anti-bot protections such as:

* TLS fingerprinting

* Browser behavior detection

* Rate limiting and IP blocking

`scrapy-stealth` helps by adding:

* 🧬 Browser-level impersonation (TLS + HTTP/2 fingerprints)

* 🔁 Smarter retry strategies

* 🌐 Proxy and fingerprint rotation

* 🛡️ Anti-bot detection

### Result

* Higher success rate

* Lower proxy cost

* More stable crawls

---

## 📊 Comparison

| Feature                      | scrapy-stealth | scrapy-impersonate | scrapy-playwright | scrapy-splash | Scrapy (default) |

|------------------------------|:--------------:|:------------------:|:-----------------:|:-------------:|:----------------:|

| TLS fingerprint spoofing     |       ✅        |         ✅          |         ❌         |       ❌       |        ❌         |

| HTTP/2 support               |       ✅        |         ✅          |         ✅         |       ❌       |        ❌         |

| Browser impersonation        |       ✅        |         ✅          |    ⚠️ partial     |       ❌       |        ❌         |

| Proxy rotation (built-in)    |       ✅        |         ❌          |         ❌         |       ❌       |        ❌         |

| Fingerprint rotation         |       ✅        |         ❌          |         ❌         |       ❌       |        ❌         |

| Anti-bot detection           |       ✅        |         ❌          |         ❌         |       ❌       |        ❌         |

| Smart retry logic            |       ✅        |         ❌          |         ❌         |       ❌       |        ❌         |

| Per-request engine switching |       ✅        |         ❌          |         ❌         |       ❌       |        ❌         |

| Headless browser required    |       ✅        |         ❌          |         ✅         |       ✅       |        ❌         |

| JavaScript rendering         |       ️✅       |         ❌          |         ✅         |       ✅       |        ❌         |

| Screenshot / snapshot        |       ✅        |         ❌          |         ✅         |       ✅       |        ❌         |

| Native Scrapy integration    |       ✅        |         ✅          |         ✅         |       ✅       |        ✅         |

| Memory footprint             |     🟢 Low     |       🟢 Low       |      🔴 High      |    🔴 High    |      🟢 Low      |

> ⚠️ `scrapy-playwright` passes real browser TLS but does not spoof fingerprint profiles like `scrapy-stealth` does.

> `scrapy-impersonate` provides TLS/HTTP2 impersonation via `curl_cffi` but lacks built-in rotation, detection, or per-request engine switching.

> JavaScript rendering is available via the optional `browser` driver — use it selectively for pages that require a full browser.

---

## ✨ Features

* 🔌 Pluggable engine system (`scrapy`, `stealth`)

* 🧠 Per-request engine selection via `request.meta`

* 🌐 Proxy support and rotation

* 🧬 Browser fingerprint rotation

* 🔁 Smart retry logic

* 🛡️ Anti-bot detection (status + content-based, Cloudflare, Akamai)

* ⚡  Thread-safe async integration

* 🖥️ Real-browser engine (CDP) for JS-heavy pages

* 📸 Built-in snapshot decorator (`scrapy_stealth.decorators.snapshot`)

---

## 📦 Installation

```bash

pip install scrapy-stealth

```

> Requires Python 3.11+ and Scrapy 2.12–2.x

---

## ⚙️ Setup

### Option 1 — Global (`settings.py`)

```python

# 1. Enable the middleware

DOWNLOADER_MIDDLEWARES = {

    "scrapy_stealth.StealthDownloaderMiddleware": 950,

}

# 2. (Optional) Route ALL requests through stealth automatically — no meta needed per request

STEALTH_ENABLED = True

STEALTH_DRIVER  = "turbo"   # "basic" (default), "turbo", or "browser"

# 3. (Optional) Proxy list for automatic rotation

#    Used when rotate_proxy=True (per-request) or when STEALTH_ENABLED=True with rotate_proxy

#    Supported schemes: http, https, socks4, socks5

STEALTH_PROXIES = [

    "http://proxy1:8080",

    "http://proxy2:8080",

    "http://user:pass@proxy3:8080",  # with authentication

    "socks5://proxy4:1080",

]

```

### Option 2 — Per-spider (`custom_settings`)

Configure the middleware and all stealth settings directly on the spider — no changes to `settings.py` required.

```python

class MySpider(scrapy.Spider):

    name = "example"

    custom_settings = {

        "DOWNLOADER_MIDDLEWARES": {

            "scrapy_stealth.StealthDownloaderMiddleware": 950,

        },

        "STEALTH_ENABLED": True,

        "STEALTH_DRIVER": "turbo",

        "STEALTH_PROXIES": [

            "http://proxy1:8080",

            "http://user:pass@proxy2:8080",

            "socks5://proxy3:1080",

        ],

    }

```

> Proxies are validated at startup — invalid format or unsupported scheme raises `ValueError` immediately.

---

## 🚀 Quick Start

**Option A — Per-request** (stealth only on specific requests):

```python

yield scrapy.Request(

    url="https://example.com",

    meta={"stealth": {}},

)

```

**Option B — Global mode** (stealth on every request automatically):

```python

# settings.py or custom_settings

STEALTH_ENABLED = True

STEALTH_DRIVER  = "turbo"

```

```python

# No meta needed — all requests go through stealth

yield scrapy.Request(url="https://example.com")

# Opt out for a specific request

yield scrapy.Request(url="https://api.internal/health", meta={"stealth": False})

```

---

## 🔧 Global Configuration

Customise package-wide defaults via the shared `config` instance.

All settings must be applied **at module level**, before the spider class — the engine client is

created at middleware initialisation, so changes inside `start_requests` or `parse` will have no effect.

```python

# myspider.py

import scrapy

from scrapy_stealth.config import config

config.DEFAULT_ENGINE  = "stealth"      # "scrapy" (native) or "stealth" (browser impersonation)

config.DEFAULT_PROFILE = "chrome_147"   # browser profile when meta["stealth"]["profile"] is not set

config.DEFAULT_TIMEOUT = 30             # stealth request timeout in seconds

config.STEALTH_DRIVER  = "turbo"        # "basic" (default), "turbo", or "browser"

config.HTTP2           = True           # False for servers that only support HTTP/1.1

config.BLOCK_CODES    |= {407}          # extend blocked status codes (|= keeps defaults)

config.BLOCK_KEYWORDS.append("banned")  # extend blocked body-text patterns

config.BROWSER_HEADLESS = True          # browser driver: headless mode (False = visible window, more stealthy)

config.BROWSER_SETTLE_S = 4.0          # browser driver: seconds to wait after navigation for JS to finish

class MySpider(scrapy.Spider):

    name = "example"

    ...

```

```python

# ❌ wrong — too late, the engine client is already created

class MySpider(scrapy.Spider):

    def start_requests(self):

        config.HTTP2 = False  # has no effect

        ...

```

You can also read any value programmatically:

```python

config.get("DEFAULT_ENGINE")          # "scrapy"

config.get("MISSING_KEY", "default")  # "default"

```

| Attribute          | Type             | Default                           | Description                                                                                                  |

|--------------------|------------------|-----------------------------------|--------------------------------------------------------------------------------------------------------------|

| `DEFAULT_ENGINE`   | `str`            | `"scrapy"`                        | Engine used when `request.meta["stealth"]` key is absent                                                     |

| `DEFAULT_PROFILE`  | `str`            | `"chrome_147"`                    | Browser profile used when none is specified                                                                  |

| `DEFAULT_TIMEOUT`  | `int`            | `30`                              | Request timeout in seconds                                                                                   |

| `STEALTH_DRIVER`   | `str`            | `"basic"`                         | Default driver: `"basic"`, `"turbo"`, or `"browser"`. Also readable from Scrapy settings as `STEALTH_DRIVER` |

| `HTTP2`            | `bool`           | `True`                            | HTTP/2 mode; overridable per-request via `meta["stealth"]["http2"]`                                          |

| `BLOCK_CODES`      | `frozenset[int]` | `{403, 429, 503}`                 | HTTP status codes considered blocked                                                                         |

| `BLOCK_KEYWORDS`   | `list[str]`      | `["captcha", "access denied", …]` | Body-text patterns considered blocked                                                                        |

| `BROWSER_HEADLESS` | `bool`           | `True`                            | Browser driver: headless mode (`False` = visible window, more stealthy)                                      |

| `BROWSER_SETTLE_S` | `float`          | `4.0`                             | Browser driver: seconds to wait after navigation for JS to finish rendering                                  |

For one-off overrides on a single request, set `meta["stealth"]["driver"]` or `meta["stealth"]["http2"]` (see Per-Request Configuration below).

---

## ⚙️ Per-Request Configuration

All options are passed via `request.meta["stealth"]`.

The presence of `meta["stealth"]` (a dict) activates the stealth engine. Omit the key to use the default Scrapy engine.

When `STEALTH_ENABLED = True`, all requests are stealth by default — pass `meta={"stealth": False}` to opt out for a specific request.

```python

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "driver": "turbo",

            "profile": "chrome_147",

            "proxy": "http://user:pass@proxy:8080",

            "stealth_timeout": 60,

            "http2": True,

            "rotate_proxy": True,

            "rotate_profile": True,

        }

    },

)

```

| Key               | Type    | Description                                                                                                     |

|-------------------|---------|-----------------------------------------------------------------------------------------------------------------|

| `driver`          | `str`   | `"basic"`, `"turbo"`, or `"browser"` — overrides `config.STEALTH_DRIVER` per-request                            |

| `profile`         | `str`   | Browser profile (e.g. `"chrome_147"`, `"safari_ios_18_1_1"`)                                                    |

| `proxy`           | `str`   | Explicit proxy URL                                                                                              |

| `stealth_timeout` | `int`   | Per-request timeout in seconds (overrides default 30s)                                                          |

| `http2`           | `bool`  | `True` = HTTP/2, `False` = HTTP/1.1 (overrides `config.HTTP2` for this request)                                 |

| `rotate_proxy`    | `bool`  | Auto-pick a proxy from `STEALTH_PROXIES`                                                                        |

| `rotate_profile`  | `bool`  | Auto-pick a random browser profile                                                                              |

| `headless`        | `bool`  | Browser driver only: `True` = headless, `False` = visible window (more stealthy)                                |

| `settle`          | `float` | Browser driver only: seconds to wait for JS after navigation (default `4.0`)                                    |

| `snapshot`        | `bool`  | Browser driver only: capture a PNG snapshot — result available as `response.meta["snapshot_content"]` (`bytes`) |

---

## 🖥️ Browser Engine

For sites protected by Cloudflare JS challenges or heavy JavaScript rendering, use the `browser` driver.

It runs a real Chrome instance via the DevTools Protocol (no WebDriver), keeping one persistent browser

and opening a new tab per request.

**Per-request (most common):**

```python

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "driver": "browser",

            "headless": False,   # visible window — harder to detect (default: True)

            "settle": 4.0,       # seconds to wait for JS after page load

        }

    },

)

```

**Heavy Cloudflare sites — increase settle time:**

```python

meta={"stealth": {"driver": "browser", "headless": False, "settle": 12}}

```

**Global default (all stealth requests use browser engine):**

```python

from scrapy_stealth.config import config

config.STEALTH_DRIVER   = "browser"

config.BROWSER_HEADLESS = False   # more stealthy

config.BROWSER_SETTLE_S = 6.0    # longer wait for JS

```

> **Performance note**: the browser engine is slower than `basic`/`turbo` (~5-15s per page vs <2s).

> Use it selectively — route only JS-protected URLs to `"browser"` and keep everything else on `"turbo"`.

---

## 📸 Screenshots

Capture a PNG screenshot of any page rendered by the `browser` driver and save it to disk.

### Enable on the request

```python

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "driver": "browser",

            "snapshot": True,

        }

    },

    callback=self.parse,

)

```

The raw PNG bytes are available at `response.meta["snapshot_content"]` inside your callback.

### Auto-save with `snapshot` decorator

```python

from scrapy_stealth.decorators import snapshot

class MySpider(scrapy.Spider):

    @snapshot

    def parse(self, response): ...

    @snapshot(path="stealth_shots/page.png")

    def parse(self, response): ...

    @snapshot(path=lambda r: r.url.split("/")[-1] + ".png")

    def parse(self, response): ...

```

> **Note:** Requires `driver="browser"` and `snapshot=True` in the request meta.

> Logs an error if no snapshot data is found in the response.

### Custom handling (without the built-in helper)

The screenshot is just `bytes` in `response.meta["snapshot_content"]` — do anything you like with it:

```python

def parse(self, response):

    shot: bytes | None = response.meta.get("snapshot_content")

    if shot is None:

        return  # screenshot was not requested or capture failed

    # Save manually

    with open("page.png", "wb") as f:

        f.write(shot)

    # Pass to a pipeline via item

    yield {"url": response.url, "screenshot": shot}

```

---

## 🔁 Automatic Rotation

```python

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "rotate_proxy": True,

            "rotate_profile": True,

        }

    },

)

```

---

## 🧩 Strategies

### Proxy Rotation

```python

from scrapy_stealth.strategies.proxy import ProxyRotator

proxy_rotator = ProxyRotator([

    "http://proxy1:8080",

    "http://proxy2:8080",

])

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "proxy": proxy_rotator.get(),

        }

    },

)

```

---

### Fingerprint Rotation

```python

from scrapy_stealth.strategies.fingerprint import ProfileRotator

fp = ProfileRotator()

yield scrapy.Request(

    url,

    meta={

        "stealth": {

            "profile": fp.get(),

        }

    },

)

```

---

### Intelligent Retry

```python

from scrapy_stealth.strategies.retry import RetryHandler

retry = RetryHandler()

def parse(self, response):

    if retry.should_retry(response):

        yield retry.build(response.request)

        return

```

---

## 🛡️ Anti-Bot Detection

```python

from scrapy_stealth.detectors.antibot import AntiBotDetector

detector = AntiBotDetector()

if detector.is_blocked(response):

    print("Blocked!")

```

---

## 📊 Example

```python

import scrapy

class ExampleSpider(scrapy.Spider):

    name = "example"

    def start_requests(self):

        yield scrapy.Request(

            "https://example.com",

            meta={

                "stealth": {

                    "rotate_proxy": True,

                    "rotate_profile": True,

                }

            },

        )

    def parse(self, response):

        yield {

            "title": response.css("title::text").get(),

            "url": response.url,

        }

```

---

## ⚡ Performance Insight

Using stealth selectively:

* ⚡ Faster crawling (Scrapy for simple pages)

* 💰 Lower proxy cost

* 🛡️ Better success rate on protected pages

---

## 📜 Changelog

See [CHANGELOG.md](https://github.com/fawadss1/scrapy-stealth/blob/master/CHANGELOG.md) for a full history of changes, or browse [GitHub Releases](https://github.com/fawadss1/scrapy-stealth/releases).

---

## 🤝 Contributing

See [CONTRIBUTING.md](https://github.com/fawadss1/scrapy-stealth/blob/master/CONTRIBUTING.md) for guidelines on how to contribute.

---

## 📄 License

This project is licensed under the **MIT License** — free to use, modify, and distribute.

See [LICENSE](https://github.com/fawadss1/scrapy-stealth/blob/master/LICENSE) for the full text.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fawadss1/scrapy-stealth

Awesome Lists containing this project

README

scrapy-stealth