https://github.com/scrapy-plugins/scrapy-headless

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/scrapy-plugins/scrapy-headless
Owner: scrapy-plugins
License: bsd-3-clause
Created: 2020-02-03T13:37:14.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-04-28T07:45:51.000Z (about 5 years ago)
Last Synced: 2025-04-15T09:07:04.715Z (about 1 year ago)
Language: Python
Size: 21.5 KB
Stars: 29
Watchers: 3
Forks: 8
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Scrapy Headless

This is a plugin to make it easier to use scrapy with headless browsers, at the moment it only works with selenium grid as a driver.

## Installation

For now the project is in a private bit bucket repo, so install it from there:

```

pip install scrapy-headless

```

## Usage

You will first need to have a selenium grid server running, you may find some examples on:  https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose

The easiest way is by using docker-compose, here is a example docker-compose.yml file:

```yml

selenium-hub:

  image: selenium/hub

  ports:

  - 4444:4444

chrome:

  image: selenium/node-chrome

  links:

  - selenium-hub:hub

  environment:

    - HUB_PORT_4444_TCP_ADDR=hub

    - GRID_TIMEOUT=180 # Default timeout is 30s might be low for Selenium

  volumes:

  - /dev/shm:/dev/shm

```

And just,

```

$ docker-compose up -d

```

And, if you want more browser instances

```

$ docker-compose up -d --scale chrome=3 # For 3 browsers

```

On scrapy you will need to update your settings, for example:

```py

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub'  # Example for local grid with docker-compose

SELENIUM_NODES = 1  # Number of nodes(browsers) you are running on your grid

SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME  # Example for Chrome

# You need also to change the default download handlers, like so:

DOWNLOAD_HANDLERS = {

    "http": "scrapy_selenium.SeleniumDownloadHandler",

    "https": "scrapy_selenium.SeleniumDownloadHandler",

}

```

You may also set a proxy for your selenium requests:

```py

SELENIUM_PROXY = 'http://proxy.url:port'

```

Now all you need to do, is on your spider, for the requests you want handled by selenium use `HeadlessRequest` instead of scrapy's Request, for example:

```py

from scrapy import Spider

from scrapy_headless import HeadlessRequest

class SomeSpider(Spider):

    ...

    def some_parser(self, response):

        ...

        yield HeadlessRequest(some_url, callback=self.other_parser)

```

If you need to do something with the driver after getting the url you may also set a `driver_callback`:

```py

from scrapy import Spider

from scrapy_headless import HeadlessRequest

class SomeSpider(Spider):

    ...

    def some_parser(self, response):

        ...

        yield HeadlessRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver)

    def process_webdriver(self, driver):

        ...

```

## Future

Ideally this download handler should be able to use any of the following:

- [x] Selenium Grid

- [ ] Selenium (without grid)

- [ ] Pyppeteer

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapy-plugins/scrapy-headless

Awesome Lists containing this project

README