https://github.com/povilasb/scrapy-html-storage

Scrapy downloader middleware that stores response HTMLs to disk.
https://github.com/povilasb/scrapy-html-storage

middleware python scrapy

Last synced: 3 months ago
JSON representation

Scrapy downloader middleware that stores response HTMLs to disk.

Host: GitHub
URL: https://github.com/povilasb/scrapy-html-storage
Owner: povilasb
License: mit
Created: 2016-03-29T13:41:35.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-05-14T22:15:13.000Z (about 1 year ago)
Last Synced: 2025-03-16T00:14:24.543Z (3 months ago)
Topics: middleware, python, scrapy
Language: Python
Size: 16.6 KB
Stars: 18
Watchers: 1
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- License: LICENSE.txt

Awesome Lists containing this project

README

        =====

About

=====

.. image:: https://travis-ci.org/povilasb/scrapy-html-storage.svg?branch=master

.. image:: https://coveralls.io/repos/github/povilasb/scrapy-html-storage/badge.svg?branch=master :target: https://coveralls.io/github/povilasb/scrapy-html-storage?branch=master

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

=====

Turn downloader on, e.g. specifying it in `settings.py`::

    DOWNLOADER_MIDDLEWARES = {

        'scrapy_html_storage.HtmlStorageMiddleware': 10,

    }

None of responses by default are saved to disk.

You must select for which requests the response HTMLs will be saved::

   def parse(self, response):

        """Processes start urls.

        Args:

            response (HtmlResponse): scrapy HTML response object.

        """

        yield scrapy.Request(

            'http://target.com',

            callback=self.parse_target,

            meta={

              'save_html': True,

            }

        )

The file path where HTML will be stored is resolved with spider method

`response_html_path`. E.g.::

    class TargetSpider(scrapy.Spider):

        def response_html_path(self, request):

            """

            Args:

                request (scrapy.http.request.Request): request that produced the

                    response.

            """

            return 'html/last_response.html'

Configuration

=============

HTML storage downloader middleware supports such options:

* **gzip_output** (bool) - if True, HTML output will be stored in gzip format.

  Default is False.

* **save_html_on_status** (list) - if not empty, sets list of response codes

  whitelisted for html saving. If list is empty or not provided, all response

  codes will be allowed for html saving.

Sample::

    HTML_STORAGE = {

        'gzip_output': True,

        'save_html_on_status': [200, 202]

    }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/povilasb/scrapy-html-storage

Awesome Lists containing this project

README