An open API service indexing awesome lists of open source software.

https://github.com/rmax/scrapydo

Crochet-based blocking API for Scrapy.
https://github.com/rmax/scrapydo

Last synced: 8 months ago
JSON representation

Crochet-based blocking API for Scrapy.

Awesome Lists containing this project

README

          

ScrapyDo
========

Crochet_-based blocking API for Scrapy_.

This module provides function helpers to run Scrapy_ in a blocking fashion. See
the `scrapydo-overview.ipynb `_
notebook for a quick overview of this module.

Installation
============

Using ``pip``::

pip install scrapydo

Usage
=====

The function ``scrapydo.setup`` must be called once to initialize the reactor.

Example:

.. code:: python

import scrapydo
scrapydo.setup()

scrapydo.default_settings.update({
'LOG_LEVEL': 'DEBUG',
'CLOSESPIDER_PAGECOUNT': 10,
})

# Enable logging display
import logging
logging.basicConfig(level=logging.DEBUG)

# Fetch a single URL.
response = scrapydo.fetch("http://example.com")

# Crawl an URL with given callback.
def parse_page(response):
yield {
'title': response.css('title').extract(),
'url': response.url,
}
for href in response.css('a::attr(href)'):
url = response.urljoin(href)
yield Request(url, callback=parse_page)

items = scrapydo.crawl('http://example.com', callback)

# Run an existing spider class.
spider_args = {'foo': 'bar'}
items = scrapydo.run_spider(MySpider, **spider_args)

Available Functions
===================

``scrapydo.setup()``
Initialize reactor.

``scrapydo.fetch(url, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``
Fetches an URL and returns the response.

``scrapydo.crawl(url, callback, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``
Crawls an URL with given callback and returns the scraped items.

``scrapydo.run_spider(spider_cls, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT, **kwargs)``
Runs a spider and returns the scraped items.

``highlight(code, lexer='html', formatter='html', output_wrapper=None)``
Highlights given code using pygments. This function is suitable for use in a IPython notebook.

.. _Scrapy: http://scrapy.org
.. _Crochet: https://github.com/itamarst/crochet