Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapy-plugins/scrapy-magicfields
Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
https://github.com/scrapy-plugins/scrapy-magicfields
Last synced: 14 days ago
JSON representation
Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-magicfields
- Owner: scrapy-plugins
- License: bsd-3-clause
- Created: 2016-06-29T13:05:56.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-03-16T01:17:57.000Z (over 2 years ago)
- Last Synced: 2024-09-21T16:29:25.072Z (about 2 months ago)
- Language: Python
- Size: 14.6 KB
- Stars: 56
- Watchers: 10
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
- awesome - scrapy-magicfields - Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc. (Scrapy Middleware)
- awesome-scrapy - scrapy-magicfields
README
==================
scrapy-magicfields
==================.. image:: https://travis-ci.org/scrapy-plugins/scrapy-magicfields.svg?branch=master
:target: https://travis-ci.org/scrapy-plugins/scrapy-magicfields.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-magicfields/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapy-plugins/scrapy-magicfieldsThis is a Scrapy spider middleware to add extra fields to items,
based on the configuration settings ``MAGIC_FIELDS`` and ``MAGIC_FIELDS_OVERRIDE``.Installation
============Install scrapy-magicfields using ``pip``::
$ pip install scrapy-magicfields
Configuration
=============1. Add MagicFieldsMiddleware by including it in ``SPIDER_MIDDLEWARES``
in your ``settings.py`` file::SPIDER_MIDDLEWARES = {
'scrapy_magicfields.MagicFieldsMiddleware': 100,
}Here, priority ``100`` is just an example.
Set its value depending on other middlewares you may have enabled already.2. Enable the middleware using ``MAGIC_FIELDS`` (and optionally ``MAGIC_FIELDS_OVERRIDE``)
in your ``setting.py``.Usage
=====Both settings ``MAGIC_FIELDS`` and ``MAGIC_FIELDS_OVERRIDE`` are dicts:
* the keys are the destination field names,
* their value is a string which accepts **magic variables**,
— identified by a starting ``$`` (dollar sign),
which will be substituted by a corresponding value at runtime.Some magic variables also accept arguments, and are specified after the magic name,
using a ``:`` (column) as separator.You can set project-global magics with ``MAGIC_FIELDS``,
and tune them for a specific spider using ``MAGIC_FIELDS_OVERRIDE``.In case there is more than one argument, they must come separated by ``,`` (comma sign).
So the generic magic format is::$[:arg1,arg2,...]
Supported magic variables
-------------------------``$time``
the UTC timestamp at which the item was scraped, in format ``'%Y-%m-%d %H:%M:%S'``.``$unixtime``
the unixtime (number of seconds since the Epoch, i.e. ``time.time()``)
at which the item was scraped.``$isotime``
the UTC timestamp at which the item was scraped, with format ``'%Y-%m-%dT%H:%M:%S"``.``$spider``
must be followed by an argument,
which is the name of an attribute of the spider (like an argument passed to it).``$env``
the value of an environment variable.
It acccepts as argument the name of the variable.``$jobid``
the job id (shortcut for ``$env:SCRAPY_JOB``)``$jobtime``
the UTC timestamp at which the job started, in format ``'%Y-%m-%d %H:%M:%S'``.``$response``
Access to some response properties.``$response:url``
The url from where the item was extracted from.``$response:status``
Response http status.``$response:headers``
Response http headers.``$setting``
Access the given Scrapy setting. It accepts one argument: the name of the setting.``$field``
Allows to copy the value of one field to another
Its argument is the source field.
Effects are unpredicable if you use as source a field that is filled
using magic fields.Examples
--------The following configuration will add two fields to each scraped item:
- ``'timestamp'``, which will be filled with the string ``'item scraped at '``,
- and ``'spider'``, which will contain the spider name::
MAGIC_FIELDS = {
"timestamp": "item scraped at $time",
"spider": "$spider:name"
}The following configuration will copy the url to the field sku::
MAGIC_FIELDS = {
"sku": "$field:url"
}Magics also accept a regular expression argument which allows to extract
and assign only part of the value generated by the magic.
You have to specify it using the ``r''`` notation.Let's pretend that the urls of your items look like ``'http://www.example.com/product.html?item_no=345'``
and you want to assign to the ``sku`` field only the item number.The following example, similar to the previous one but with a second regular expression argument,
will do the task::MAGIC_FIELDS = {
"sku": "$field:url,r'item_no=(\d+)'"
}