https://github.com/povilasb/scrapy-html-storage
Scrapy downloader middleware that stores response HTMLs to disk.
https://github.com/povilasb/scrapy-html-storage
middleware python scrapy
Last synced: 3 months ago
JSON representation
Scrapy downloader middleware that stores response HTMLs to disk.
- Host: GitHub
- URL: https://github.com/povilasb/scrapy-html-storage
- Owner: povilasb
- License: mit
- Created: 2016-03-29T13:41:35.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-05-14T22:15:13.000Z (about 1 year ago)
- Last Synced: 2025-03-16T00:14:24.543Z (3 months ago)
- Topics: middleware, python, scrapy
- Language: Python
- Size: 16.6 KB
- Stars: 18
- Watchers: 1
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- License: LICENSE.txt
Awesome Lists containing this project
README
=====
About
=====.. image:: https://travis-ci.org/povilasb/scrapy-html-storage.svg?branch=master
.. image:: https://coveralls.io/repos/github/povilasb/scrapy-html-storage/badge.svg?branch=master :target: https://coveralls.io/github/povilasb/scrapy-html-storage?branch=masterThis is Scrapy downloader middleware that stores response HTMLs to disk.
Usage
=====Turn downloader on, e.g. specifying it in `settings.py`::
DOWNLOADER_MIDDLEWARES = {
'scrapy_html_storage.HtmlStorageMiddleware': 10,
}None of responses by default are saved to disk.
You must select for which requests the response HTMLs will be saved::def parse(self, response):
"""Processes start urls.Args:
response (HtmlResponse): scrapy HTML response object.
"""
yield scrapy.Request(
'http://target.com',
callback=self.parse_target,
meta={
'save_html': True,
}
)The file path where HTML will be stored is resolved with spider method
`response_html_path`. E.g.::class TargetSpider(scrapy.Spider):
def response_html_path(self, request):
"""
Args:
request (scrapy.http.request.Request): request that produced the
response.
"""
return 'html/last_response.html'Configuration
=============HTML storage downloader middleware supports such options:
* **gzip_output** (bool) - if True, HTML output will be stored in gzip format.
Default is False.
* **save_html_on_status** (list) - if not empty, sets list of response codes
whitelisted for html saving. If list is empty or not provided, all response
codes will be allowed for html saving.Sample::
HTML_STORAGE = {
'gzip_output': True,
'save_html_on_status': [200, 202]
}