https://github.com/scrapy-plugins/scrapy-bigml
Scrapy pipeline for writing items to BigML datasets
https://github.com/scrapy-plugins/scrapy-bigml
Last synced: about 1 year ago
JSON representation
Scrapy pipeline for writing items to BigML datasets
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-bigml
- Owner: scrapy-plugins
- Created: 2015-11-12T20:24:15.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-11-17T02:34:01.000Z (over 10 years ago)
- Last Synced: 2025-04-07T07:52:36.690Z (about 1 year ago)
- Language: Python
- Size: 6.84 KB
- Stars: 4
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
============
scrapy-bigml
============
scrapy-bigml facilitates creating `BigML `_ sources and
datasets from `Scrapy `_ crawls. It can be used both as a
feed storage or as a pipeline.
BigML configuration
===================
Credentials
-----------
For both usage methods (feed storage or pipeline), you need to supply your
BigML credentials. You can do this either by supplying them as environment
variables::
# in shell
export BIGML_USERNAME=your_username
export BIGML_API_KEY=your_apikey
Or by supplying them as Scrapy settings::
BIGML_USERNAME = 'your_username'
BIGML_API_KEY = 'your_api_key'
If you use scrapy-bigml as a feed storage, you can also provide them by adding
them to your feed URI::
FEED_URI = 'bigml://your_username:your_api_key@your_source_name'
Development mode
----------------
During development, you probably want to enable BigML's dev mode::
BIGML_DEVMODE = True
Usage as feed storage
=====================
scrapy-bigml can be used as storage backend on top of Scrapy's `feed exports
`_. To use it, adjust
your Scrapy settings by setting the feed format to either ``csv`` (preferred)
or ``json``, enabling the ``bigml`` feed storage and providing a corresponding
feed URI with the name you wish to use for your BigML source::
FEED_FORMAT = 'csv'
FEED_STORAGES = {'bigml': 'scrapy_bigml.BigMLFeedStorage'}
FEED_URI = 'bigml://your_source_name'
A spider with example configuration can be found in
``example_spider_feedstorage.py``.
Usage as pipeline
=================
If you wish to use scrapy-bigml as a pipeline, all you need to do is enable the
pipeline::
ITEM_PIPELINES = {'scrapy_bigml.BigMLPipeline': 500}
You should also set a name for your BigML source (if not, scrapy-bigml will
default to "Scrapy")::
BIGML_SOURCE_NAME = 'Your source name'
A spider with example configuration can be found in
``example_spider_pipeline.py``.