Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/scrapy-plugins/scrapy-magicfields

Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
https://github.com/scrapy-plugins/scrapy-magicfields

Last synced: 14 days ago
JSON representation

Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.

Awesome Lists containing this project

README

        

==================
scrapy-magicfields
==================

.. image:: https://travis-ci.org/scrapy-plugins/scrapy-magicfields.svg?branch=master
:target: https://travis-ci.org/scrapy-plugins/scrapy-magicfields

.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-magicfields/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapy-plugins/scrapy-magicfields

This is a Scrapy spider middleware to add extra fields to items,
based on the configuration settings ``MAGIC_FIELDS`` and ``MAGIC_FIELDS_OVERRIDE``.

Installation
============

Install scrapy-magicfields using ``pip``::

$ pip install scrapy-magicfields

Configuration
=============

1. Add MagicFieldsMiddleware by including it in ``SPIDER_MIDDLEWARES``
in your ``settings.py`` file::

SPIDER_MIDDLEWARES = {
'scrapy_magicfields.MagicFieldsMiddleware': 100,
}

Here, priority ``100`` is just an example.
Set its value depending on other middlewares you may have enabled already.

2. Enable the middleware using ``MAGIC_FIELDS`` (and optionally ``MAGIC_FIELDS_OVERRIDE``)
in your ``setting.py``.

Usage
=====

Both settings ``MAGIC_FIELDS`` and ``MAGIC_FIELDS_OVERRIDE`` are dicts:

* the keys are the destination field names,
* their value is a string which accepts **magic variables**,
— identified by a starting ``$`` (dollar sign),
which will be substituted by a corresponding value at runtime.

Some magic variables also accept arguments, and are specified after the magic name,
using a ``:`` (column) as separator.

You can set project-global magics with ``MAGIC_FIELDS``,
and tune them for a specific spider using ``MAGIC_FIELDS_OVERRIDE``.

In case there is more than one argument, they must come separated by ``,`` (comma sign).
So the generic magic format is::

$[:arg1,arg2,...]

Supported magic variables
-------------------------

``$time``
the UTC timestamp at which the item was scraped, in format ``'%Y-%m-%d %H:%M:%S'``.

``$unixtime``
the unixtime (number of seconds since the Epoch, i.e. ``time.time()``)
at which the item was scraped.

``$isotime``
the UTC timestamp at which the item was scraped, with format ``'%Y-%m-%dT%H:%M:%S"``.

``$spider``
must be followed by an argument,
which is the name of an attribute of the spider (like an argument passed to it).

``$env``
the value of an environment variable.
It acccepts as argument the name of the variable.

``$jobid``
the job id (shortcut for ``$env:SCRAPY_JOB``)

``$jobtime``
the UTC timestamp at which the job started, in format ``'%Y-%m-%d %H:%M:%S'``.

``$response``
Access to some response properties.

``$response:url``
The url from where the item was extracted from.

``$response:status``
Response http status.

``$response:headers``
Response http headers.

``$setting``
Access the given Scrapy setting. It accepts one argument: the name of the setting.

``$field``
Allows to copy the value of one field to another
Its argument is the source field.
Effects are unpredicable if you use as source a field that is filled
using magic fields.

Examples
--------

The following configuration will add two fields to each scraped item:

- ``'timestamp'``, which will be filled with the string ``'item scraped at '``,
- and ``'spider'``, which will contain the spider name

::

MAGIC_FIELDS = {
"timestamp": "item scraped at $time",
"spider": "$spider:name"
}

The following configuration will copy the url to the field sku::

MAGIC_FIELDS = {
"sku": "$field:url"
}

Magics also accept a regular expression argument which allows to extract
and assign only part of the value generated by the magic.
You have to specify it using the ``r''`` notation.

Let's pretend that the urls of your items look like ``'http://www.example.com/product.html?item_no=345'``
and you want to assign to the ``sku`` field only the item number.

The following example, similar to the previous one but with a second regular expression argument,
will do the task::

MAGIC_FIELDS = {
"sku": "$field:url,r'item_no=(\d+)'"
}