https://github.com/scrapy-plugins/scrapy-headless
https://github.com/scrapy-plugins/scrapy-headless
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-headless
- Owner: scrapy-plugins
- License: bsd-3-clause
- Created: 2020-02-03T13:37:14.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-04-28T07:45:51.000Z (about 5 years ago)
- Last Synced: 2025-04-15T09:07:04.715Z (about 1 year ago)
- Language: Python
- Size: 21.5 KB
- Stars: 29
- Watchers: 3
- Forks: 8
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrapy Headless
This is a plugin to make it easier to use scrapy with headless browsers, at the moment it only works with selenium grid as a driver.
## Installation
For now the project is in a private bit bucket repo, so install it from there:
```
pip install scrapy-headless
```
## Usage
You will first need to have a selenium grid server running, you may find some examples on: https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose
The easiest way is by using docker-compose, here is a example docker-compose.yml file:
```yml
selenium-hub:
image: selenium/hub
ports:
- 4444:4444
chrome:
image: selenium/node-chrome
links:
- selenium-hub:hub
environment:
- HUB_PORT_4444_TCP_ADDR=hub
- GRID_TIMEOUT=180 # Default timeout is 30s might be low for Selenium
volumes:
- /dev/shm:/dev/shm
```
And just,
```
$ docker-compose up -d
```
And, if you want more browser instances
```
$ docker-compose up -d --scale chrome=3 # For 3 browsers
```
On scrapy you will need to update your settings, for example:
```py
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub' # Example for local grid with docker-compose
SELENIUM_NODES = 1 # Number of nodes(browsers) you are running on your grid
SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME # Example for Chrome
# You need also to change the default download handlers, like so:
DOWNLOAD_HANDLERS = {
"http": "scrapy_selenium.SeleniumDownloadHandler",
"https": "scrapy_selenium.SeleniumDownloadHandler",
}
```
You may also set a proxy for your selenium requests:
```py
SELENIUM_PROXY = 'http://proxy.url:port'
```
Now all you need to do, is on your spider, for the requests you want handled by selenium use `HeadlessRequest` instead of scrapy's Request, for example:
```py
from scrapy import Spider
from scrapy_headless import HeadlessRequest
class SomeSpider(Spider):
...
def some_parser(self, response):
...
yield HeadlessRequest(some_url, callback=self.other_parser)
```
If you need to do something with the driver after getting the url you may also set a `driver_callback`:
```py
from scrapy import Spider
from scrapy_headless import HeadlessRequest
class SomeSpider(Spider):
...
def some_parser(self, response):
...
yield HeadlessRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver)
def process_webdriver(self, driver):
...
```
## Future
Ideally this download handler should be able to use any of the following:
- [x] Selenium Grid
- [ ] Selenium (without grid)
- [ ] Pyppeteer