https://github.com/scrapy-plugins/scrapy-bigml

Scrapy pipeline for writing items to BigML datasets
https://github.com/scrapy-plugins/scrapy-bigml

Last synced: about 1 year ago
JSON representation

Scrapy pipeline for writing items to BigML datasets

Host: GitHub
URL: https://github.com/scrapy-plugins/scrapy-bigml
Owner: scrapy-plugins
Created: 2015-11-12T20:24:15.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-11-17T02:34:01.000Z (over 10 years ago)
Last Synced: 2025-04-07T07:52:36.690Z (about 1 year ago)
Language: Python
Size: 6.84 KB
Stars: 4
Watchers: 4
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.rst

Awesome Lists containing this project

README

          ============

scrapy-bigml

============

scrapy-bigml facilitates creating `BigML `_ sources and

datasets from `Scrapy `_ crawls. It can be used both as a

feed storage or as a pipeline.

BigML configuration

===================

Credentials

-----------

For both usage methods (feed storage or pipeline), you need to supply your

BigML credentials. You can do this either by supplying them as environment

variables::

    # in shell

    export BIGML_USERNAME=your_username

    export BIGML_API_KEY=your_apikey

Or by supplying them as Scrapy settings::

    BIGML_USERNAME = 'your_username'

    BIGML_API_KEY = 'your_api_key'

If you use scrapy-bigml as a feed storage, you can also provide them by adding

them to your feed URI::

    FEED_URI = 'bigml://your_username:your_api_key@your_source_name'

Development mode

----------------

During development, you probably want to enable BigML's dev mode::

    BIGML_DEVMODE = True

Usage as feed storage

=====================

scrapy-bigml can be used as storage backend on top of Scrapy's `feed exports

`_. To use it, adjust

your Scrapy settings by setting the feed format to either ``csv`` (preferred)

or ``json``, enabling the ``bigml`` feed storage and providing a corresponding

feed URI with the name you wish to use for your BigML source::

    FEED_FORMAT = 'csv'

    FEED_STORAGES = {'bigml': 'scrapy_bigml.BigMLFeedStorage'}

    FEED_URI = 'bigml://your_source_name'

A spider with example configuration can be found in

``example_spider_feedstorage.py``.

Usage as pipeline

=================

If you wish to use scrapy-bigml as a pipeline, all you need to do is enable the

pipeline::

    ITEM_PIPELINES = {'scrapy_bigml.BigMLPipeline': 500}

You should also set a name for your BigML source (if not, scrapy-bigml will

default to "Scrapy")::

    BIGML_SOURCE_NAME = 'Your source name'

A spider with example configuration can be found in

``example_spider_pipeline.py``.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapy-plugins/scrapy-bigml

Awesome Lists containing this project

README