https://github.com/rmax/scrapydo
Crochet-based blocking API for Scrapy.
https://github.com/rmax/scrapydo
Last synced: 8 months ago
JSON representation
Crochet-based blocking API for Scrapy.
- Host: GitHub
- URL: https://github.com/rmax/scrapydo
- Owner: rmax
- License: mit
- Created: 2015-07-27T03:34:19.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2017-02-24T03:19:26.000Z (over 9 years ago)
- Last Synced: 2025-05-18T08:07:28.900Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 112 KB
- Stars: 46
- Watchers: 4
- Forks: 11
- Open Issues: 4
-
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- License: LICENSE
Awesome Lists containing this project
README
ScrapyDo
========
Crochet_-based blocking API for Scrapy_.
This module provides function helpers to run Scrapy_ in a blocking fashion. See
the `scrapydo-overview.ipynb `_
notebook for a quick overview of this module.
Installation
============
Using ``pip``::
pip install scrapydo
Usage
=====
The function ``scrapydo.setup`` must be called once to initialize the reactor.
Example:
.. code:: python
import scrapydo
scrapydo.setup()
scrapydo.default_settings.update({
'LOG_LEVEL': 'DEBUG',
'CLOSESPIDER_PAGECOUNT': 10,
})
# Enable logging display
import logging
logging.basicConfig(level=logging.DEBUG)
# Fetch a single URL.
response = scrapydo.fetch("http://example.com")
# Crawl an URL with given callback.
def parse_page(response):
yield {
'title': response.css('title').extract(),
'url': response.url,
}
for href in response.css('a::attr(href)'):
url = response.urljoin(href)
yield Request(url, callback=parse_page)
items = scrapydo.crawl('http://example.com', callback)
# Run an existing spider class.
spider_args = {'foo': 'bar'}
items = scrapydo.run_spider(MySpider, **spider_args)
Available Functions
===================
``scrapydo.setup()``
Initialize reactor.
``scrapydo.fetch(url, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``
Fetches an URL and returns the response.
``scrapydo.crawl(url, callback, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)``
Crawls an URL with given callback and returns the scraped items.
``scrapydo.run_spider(spider_cls, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT, **kwargs)``
Runs a spider and returns the scraped items.
``highlight(code, lexer='html', formatter='html', output_wrapper=None)``
Highlights given code using pygments. This function is suitable for use in a IPython notebook.
.. _Scrapy: http://scrapy.org
.. _Crochet: https://github.com/itamarst/crochet