Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapy-plugins/scrapy-pagestorage
A scrapy extension to store requests and responses information in storage service
https://github.com/scrapy-plugins/scrapy-pagestorage
Last synced: about 1 month ago
JSON representation
A scrapy extension to store requests and responses information in storage service
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-pagestorage
- Owner: scrapy-plugins
- License: bsd-3-clause
- Created: 2016-01-13T12:46:04.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2022-03-11T11:46:32.000Z (almost 3 years ago)
- Last Synced: 2024-11-01T01:16:13.752Z (about 2 months ago)
- Language: Python
- Size: 41 KB
- Stars: 26
- Watchers: 6
- Forks: 6
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-scrapy - scrapy-pagestorage
README
==================
scrapy-pagestorage
==================.. image:: https://img.shields.io/pypi/v/scrapy-pagestorage.svg
:target: https://pypi.python.org/pypi/scrapy-pagestorage
:alt: PyPI Version.. image:: https://img.shields.io/pypi/pyversions/scrapy-pagestorage.svg
:target: https://pypi.python.org/pypi/scrapy-pagestorage
:alt: Python Versions.. image:: https://github.com/scrapy-plugins/scrapy-pagestorage/actions/workflows/tests.yml/badge.svg
:target: https://github.com/scrapy-plugins/scrapy-pagestorage/actions/workflows/tests.yml
:alt: Build Status.. image:: https://img.shields.io/codecov/c/github/scrapy-plugins/scrapy-pagestorage/master.svg
:target: https://codecov.io/github/scrapy-plugins/scrapy-pagestorage
:alt: Coverage reportA scrapy extension to store requests and responses information in storage service.
Installation
============You can install scrapy-pagestorage using pip::
pip install scrapy-pagestorage
You can then enable the middleware in your `settings.py`::
SPIDER_MIDDLEWARES = {
...
'scrapy_pagestorage.PageStorageMiddleware': 900
}How to use it
=============Enable extension through `settings.py`::
PAGE_STORAGE_ENABLED = True
PAGE_STORAGE_ON_ERROR_ENABLED = TrueConfigure the exension through `settings.py`::
PAGE_STORAGE_MODE = "VERSIONED_CACHE"
PAGE_STORAGE_LIMIT = 100
PAGE_STORAGE_ON_ERROR_LIMIT = 100
PAGE_STORAGE_TRIM_HTML = TrueThe extension is auto-enabled for Portia spiders (``SHUB_SPIDER_TYPE=portia``).
Settings
========PAGE_STORAGE_MODE
-----------------
Default: ``None``A string which specifies if the extension will store information using cache store or
versioned cache store (set `PAGE_STORAGE_MODE="VERSIONED_CACHE"` to use versioned one).PAGE_STORAGE_LIMIT
------------------
An integer to set a limit of visited pages amount to store.PAGE_STORAGE_ON_ERROR_LIMIT
---------------------------
An integer to set a limit for page errors amount to store.PAGE_STORAGE_TRIM_HTML
----------------------
Default: ``False``Remove whitespace from the start and end of the HTML to reduce file size.