https://github.com/jadbin/xpaw

Async web scraping framework
https://github.com/jadbin/xpaw

async crawler spider

Last synced: 5 months ago
JSON representation

Async web scraping framework

Host: GitHub
URL: https://github.com/jadbin/xpaw
Owner: jadbin
License: apache-2.0
Created: 2017-05-01T03:05:25.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2020-08-05T16:44:19.000Z (almost 6 years ago)
Last Synced: 2025-11-29T01:25:13.661Z (7 months ago)
Topics: async, crawler, spider
Language: Python
Homepage: http://xpaw.readthedocs.io
Size: 772 KB
Stars: 6
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

          ====

xpaw

====

.. image:: https://travis-ci.org/jadbin/xpaw.svg?branch=master

    :target: https://travis-ci.org/jadbin/xpaw

.. image:: https://coveralls.io/repos/jadbin/xpaw/badge.svg?branch=master

    :target: https://coveralls.io/github/jadbin/xpaw?branch=master

.. image:: https://img.shields.io/badge/license-Apache 2-blue.svg

    :target: https://github.com/jadbin/xpaw/blob/master/LICENSE

Key Features

============

- A web scraping framework used to crawl web pages

- Data extraction tools used to extract structured data from web pages

Spider Example

==============

以下是我们的一个爬虫类示例，其作用为爬取 `百度新闻 `_ 的热点要闻:

.. code-block:: python

    from xpaw import Spider, HttpRequest, Selector, run_spider

    class BaiduNewsSpider(Spider):

        def start_requests(self):

            yield HttpRequest("http://news.baidu.com/", callback=self.parse)

        def parse(self, response):

            selector = Selector(response.text)

            hot = selector.css("div.hotnews a").text

            self.log("Hot News:")

            for i in range(len(hot)):

                self.log("%s: %s", i + 1, hot[i])

    if __name__ == '__main__':

        run_spider(BaiduNewsSpider)

在爬虫类中我们定义了一些方法：

- ``start_requests``: 返回爬虫初始请求。

- ``parse``: 处理请求得到的页面，这里借助 ``Selector`` 及CSS Selector语法提取到了我们所需的数据。

Documentation

=============

http://xpaw.readthedocs.io/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jadbin/xpaw

Awesome Lists containing this project

README