https://github.com/jadbin/xpaw
Async web scraping framework
https://github.com/jadbin/xpaw
async crawler spider
Last synced: 5 months ago
JSON representation
Async web scraping framework
- Host: GitHub
- URL: https://github.com/jadbin/xpaw
- Owner: jadbin
- License: apache-2.0
- Created: 2017-05-01T03:05:25.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2020-08-05T16:44:19.000Z (almost 6 years ago)
- Last Synced: 2025-11-29T01:25:13.661Z (7 months ago)
- Topics: async, crawler, spider
- Language: Python
- Homepage: http://xpaw.readthedocs.io
- Size: 772 KB
- Stars: 6
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
====
xpaw
====
.. image:: https://travis-ci.org/jadbin/xpaw.svg?branch=master
:target: https://travis-ci.org/jadbin/xpaw
.. image:: https://coveralls.io/repos/jadbin/xpaw/badge.svg?branch=master
:target: https://coveralls.io/github/jadbin/xpaw?branch=master
.. image:: https://img.shields.io/badge/license-Apache 2-blue.svg
:target: https://github.com/jadbin/xpaw/blob/master/LICENSE
Key Features
============
- A web scraping framework used to crawl web pages
- Data extraction tools used to extract structured data from web pages
Spider Example
==============
以下是我们的一个爬虫类示例,其作用为爬取 `百度新闻 `_ 的热点要闻:
.. code-block:: python
from xpaw import Spider, HttpRequest, Selector, run_spider
class BaiduNewsSpider(Spider):
def start_requests(self):
yield HttpRequest("http://news.baidu.com/", callback=self.parse)
def parse(self, response):
selector = Selector(response.text)
hot = selector.css("div.hotnews a").text
self.log("Hot News:")
for i in range(len(hot)):
self.log("%s: %s", i + 1, hot[i])
if __name__ == '__main__':
run_spider(BaiduNewsSpider)
在爬虫类中我们定义了一些方法:
- ``start_requests``: 返回爬虫初始请求。
- ``parse``: 处理请求得到的页面,这里借助 ``Selector`` 及CSS Selector语法提取到了我们所需的数据。
Documentation
=============
http://xpaw.readthedocs.io/