https://github.com/rmax/scrapydo

Crochet-based blocking API for Scrapy.
https://github.com/rmax/scrapydo

Last synced: 10 months ago
JSON representation

Crochet-based blocking API for Scrapy.

Host: GitHub
URL: https://github.com/rmax/scrapydo
Owner: rmax
License: mit
Created: 2015-07-27T03:34:19.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2017-02-24T03:19:26.000Z (over 9 years ago)
Last Synced: 2025-05-18T08:07:28.900Z (about 1 year ago)
Language: Jupyter Notebook
Size: 112 KB
Stars: 46
Watchers: 4
Forks: 11
Open Issues: 4
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- License: LICENSE

Awesome Lists containing this project

README

          ScrapyDo

========

Crochet_-based blocking API for Scrapy_.

This module provides function helpers to run Scrapy_ in a blocking fashion. See

the `scrapydo-overview.ipynb `_

notebook for a quick overview of this module.

Installation

============

Using ``pip``::

  pip install scrapydo

Usage

=====

The function ``scrapydo.setup`` must be called once to initialize the reactor.

Example:

.. code:: python

    import scrapydo

    scrapydo.setup()

    scrapydo.default_settings.update({

        'LOG_LEVEL': 'DEBUG',

        'CLOSESPIDER_PAGECOUNT': 10,

    })

    # Enable logging display

    import logging

    logging.basicConfig(level=logging.DEBUG)

    # Fetch a single URL.

    response = scrapydo.fetch("http://example.com")

    # Crawl an URL with given callback.

    def parse_page(response):

        yield {

            'title': response.css('title').extract(),

            'url': response.url,

        }

        for href in response.css('a::attr(href)'):

            url = response.urljoin(href)

            yield Request(url, callback=parse_page)

    items = scrapydo.crawl('http://example.com', callback)

    # Run an existing spider class.

    spider_args = {'foo': 'bar'}

    items = scrapydo.run_spider(MySpider, **spider_args)

Available Functions

===================

``scrapydo.setup()``

    Initialize reactor.

``scrapydo.fetch(url, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``

    Fetches an URL and returns the response.

``scrapydo.crawl(url, callback, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``

    Crawls an URL with given callback and returns the scraped items.

``scrapydo.run_spider(spider_cls, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT, **kwargs)``

    Runs a spider and returns the scraped items.

``highlight(code, lexer='html', formatter='html', output_wrapper=None)``

    Highlights given code using pygments. This function is suitable for use in a IPython notebook.

.. _Scrapy: http://scrapy.org

.. _Crochet: https://github.com/itamarst/crochet

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rmax/scrapydo

Awesome Lists containing this project

README