Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/clemfromspace/scrapy-selenium
Scrapy middleware to handle javascript pages using selenium
https://github.com/clemfromspace/scrapy-selenium
crawling scrapy selenium
Last synced: 29 days ago
JSON representation
Scrapy middleware to handle javascript pages using selenium
- Host: GitHub
- URL: https://github.com/clemfromspace/scrapy-selenium
- Owner: clemfromspace
- License: wtfpl
- Created: 2018-02-11T08:53:10.000Z (over 6 years ago)
- Default Branch: develop
- Last Pushed: 2024-07-08T09:45:54.000Z (4 months ago)
- Last Synced: 2024-09-29T02:01:28.968Z (about 1 month ago)
- Topics: crawling, scrapy, selenium
- Language: Python
- Homepage:
- Size: 29.3 KB
- Stars: 918
- Watchers: 21
- Forks: 347
- Open Issues: 78
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome - scrapy-selenium - Scrapy middleware to handle javascript pages using selenium (Scrapy Middleware)
README
# Scrapy with selenium
[![PyPI](https://img.shields.io/pypi/v/scrapy-selenium.svg)](https://pypi.python.org/pypi/scrapy-selenium) [![Build Status](https://travis-ci.org/clemfromspace/scrapy-selenium.svg?branch=master)](https://travis-ci.org/clemfromspace/scrapy-selenium) [![Test Coverage](https://api.codeclimate.com/v1/badges/5c737098dc38a835ff96/test_coverage)](https://codeclimate.com/github/clemfromspace/scrapy-selenium/test_coverage) [![Maintainability](https://api.codeclimate.com/v1/badges/5c737098dc38a835ff96/maintainability)](https://codeclimate.com/github/clemfromspace/scrapy-selenium/maintainability)Scrapy middleware to handle javascript pages using selenium.
## Installation
```
$ pip install scrapy-selenium
```
You should use **python>=3.6**.
You will also need one of the Selenium [compatible browsers](http://www.seleniumhq.org/about/platforms.jsp).## Configuration
1. Add the browser to use, the path to the driver executable, and the arguments to pass to the executable to the scrapy settings:
```python
from shutil import whichSELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
```Optionally, set the path to the browser executable:
```python
SELENIUM_BROWSER_EXECUTABLE_PATH = which('firefox')
```In order to use a remote Selenium driver, specify `SELENIUM_COMMAND_EXECUTOR` instead of `SELENIUM_DRIVER_EXECUTABLE_PATH`:
```python
SELENIUM_COMMAND_EXECUTOR = 'http://localhost:4444/wd/hub'
```2. Add the `SeleniumMiddleware` to the downloader middlewares:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
```
## Usage
Use the `scrapy_selenium.SeleniumRequest` instead of the scrapy built-in `Request` like below:
```python
from scrapy_selenium import SeleniumRequestyield SeleniumRequest(url=url, callback=self.parse_result)
```
The request will be handled by selenium, and the request will have an additional `meta` key, named `driver` containing the selenium driver with the request processed.
```python
def parse_result(self, response):
print(response.request.meta['driver'].title)
```
For more information about the available driver methods and attributes, refer to the [selenium python documentation](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver)The `selector` response attribute work as usual (but contains the html processed by the selenium driver).
```python
def parse_result(self, response):
print(response.selector.xpath('//title/@text'))
```### Additional arguments
The `scrapy_selenium.SeleniumRequest` accept 4 additional arguments:#### `wait_time` / `wait_until`
When used, selenium will perform an [Explicit wait](http://selenium-python.readthedocs.io/waits.html#explicit-waits) before returning the response to the spider.
```python
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ECyield SeleniumRequest(
url=url,
callback=self.parse_result,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, 'someid'))
)
```#### `screenshot`
When used, selenium will take a screenshot of the page and the binary data of the .png captured will be added to the response `meta`:
```python
yield SeleniumRequest(
url=url,
callback=self.parse_result,
screenshot=True
)def parse_result(self, response):
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])
```#### `script`
When used, selenium will execute custom JavaScript code.
```python
yield SeleniumRequest(
url=url,
callback=self.parse_result,
script='window.scrollTo(0, document.body.scrollHeight);',
)
```