https://github.com/scrapy-plugins/scrapy-streamitem
Scrapy support for working with streamcorpus Stream Items.
https://github.com/scrapy-plugins/scrapy-streamitem
Last synced: about 1 year ago
JSON representation
Scrapy support for working with streamcorpus Stream Items.
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-streamitem
- Owner: scrapy-plugins
- Created: 2014-12-23T15:01:22.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2015-01-07T16:26:59.000Z (over 11 years ago)
- Last Synced: 2025-04-07T08:02:05.231Z (about 1 year ago)
- Language: Python
- Size: 840 KB
- Stars: 11
- Watchers: 56
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
=================
scrapy-streamitem
=================
.. image:: https://badge.fury.io/py/scrapy-streamitem.png
:target: http://badge.fury.io/py/scrapy-streamitem
.. image:: https://api.travis-ci.org/scrapinghub/scrapy-streamitem.png?branch=master
:target: http://travis-ci.org/scrapinghub/scrapy-streamitem
Overview
========
Scrapy support for working with streamcorpus_ StreamItems_.
Includes the following:
- **StreamItem**: Scrapy Stream Item definition. ``streamitem.items.StreamItem``
- **StreamItemLoader**: Scrapy Itemloader for ``StreamItem``. ``streamitem.loaders.StreamItemLoader``
- **StreamItemExporter**: Scrapy ItemExporter to .sc file. ``streamitem.exporters.StreamItemExporter``
- **StreamItemFileFeedStorage**: Scrapy FileFeedStorage to handle .sc files. ``streamitem.storages.StreamItemFileFeedStorage``
Stream Items
============
Scrapy Stream Item will be populated from response with the following fields:
- **url**: A string containing the URL of the response.
- **body**: A string containing the body of this Response.
- **source_url**: If response has been redirected, a string containing the URL of the original page. Defaults to None.
- **redirect_urls**: If response has been redirected, a list containing the URLs of all the redirected pages, including the current one. Defaults to None.
- **http_status**: An integer representing the HTTP status of the response. Example: 200, 404.
- **content_type**: A string containing the Content-Type HTTP header of the response.
- **response_size**: An integer representing the response body size in bytes.
- **metadata**: A dict containing arbitrary metadata for this page.
If items are exported they will generate streamcorpus StreamItem_ items filling the following fields:
- **abs_url**: item.url
- **source_url**: item.source_url
- **body.raw**: item.body
- **body.media_type**: item.content_type
- **body.language.code**: item.metadata.language_code
- **body.language.name**: item.metadata.language_name
- **source_metadata['redirect_urls']**: item.redirect_urls
- **source_metadata['response_size']**: item.response_size
- **source_metadata**: will be filled with all fields in item.metadata
How to use it
=============
An example of use from a spider::
def parse_page(self, response):
loader = StreamItemLoader(item=StreamItem(), response=response)
return loader.load_item()
Settings for exporting::
FEED_URI = ".exports/streamitems.sc"
FEED_FORMAT = "streamcorpus"
FEED_EXPORTERS = {
'streamcorpus': 'scrapylib.streamitem.exporters.StreamItemExporter',
}
FEED_STORAGES = {
'': 'scrapylib.streamitem.storages.StreamItemFileFeedStorage',
}
You can also add additional info to your item using the ``metadata`` field.
For example from a Item pipeline::
def process_item(self, item, spider):
item['metadata']['my_custom_field'] = 'whatever'
return item
Requirements
============
* Scrapy_ >= 0.22.0
* streamcorpus_
Install
=======
using pypi::
pip install scrapy-streamitem
.. _streamcorpus: https://github.com/trec-kba/streamcorpus
.. _StreamItem: http://streamcorpus.org/sphinx-docs/streamcorpus.html#stream-items
.. _StreamItems: http://streamcorpus.org/sphinx-docs/streamcorpus.html#stream-items
.. _Scrapy: https://github.com/scrapinghub/scrapy