An open API service indexing awesome lists of open source software.

https://github.com/ShichaoMa/structure_spider

组合多请求,抓取结构化数据,基于scrapy组件
https://github.com/ShichaoMa/structure_spider

crawl scrapy spider structure

Last synced: 5 months ago
JSON representation

组合多请求,抓取结构化数据,基于scrapy组件

Awesome Lists containing this project

README

          

# 结构化爬虫
通过组建Item请求树抓取结构化数据

![](https://github.com/ShichaoMa/structure_spider/blob/master/resources/item-collector.jpg)
# USAGE
### 安装structure_spider
```
dev@ubuntu:~$ pip install structure-spider
```
### 生成项目
```
dev@ubuntu:~$ structure-spider create project -n myapp
New structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:
/home/dev/myapp

You can start the spider with:
cd myapp
custom-redis-server -ll INFO -lf
scrapy crawl douban
```
### 开始简单redis,可以使用正式版redis,只需把settings.py中的`CUSTOM_REDIS=True`注释掉即可
```
dev@ubuntu:~$ custom-redis-server -ll INFO -lf
```
### 生成自定义spider及item
使用createspider可以生成直接可用的spider,-s指定spider名称,随后创建要抓取的字段及其规则
,使用=连接。规则可以是正则表达式,xpath, css。

如需进一步增加复杂规则或进行数据清洗,请参考wiki。
```
dev@ubuntu:~$ cd myapp/myapp/
dev@ubuntu:~/myapp/myapp$ ls
items settings.py spiders
dev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin "product_id=/(\d+)\\.htm" "job=//h1/text()" "salary=//a/../../strong/text()" 'city=//ul[@class="terminal-ul clearfix"]//strong/a/text()' 'education=//span[contains(text(), "学历")]/following-sibling::strong/text()' "company=h2 > a" -ip '//td[@class="zwmc"]/div/a[1]/@href' -pp '//li[@class="pagesDown-pos"]/a/@href'
ZhaopinSpdier and ZhaopinItem have been created.
dev@ubuntu:~/myapp/myapp$
```

参考资料:[使用structure_spider多请求组合抓取结构化数据](https://zhuanlan.zhihu.com/p/28636195)
### 启动爬虫
```
dev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin
```
### 投入任务
```
dev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u "https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97&kw=%E9%94%80%E5%94%AE&sm=0&p=1" -c zhaopin --custom # --custom代表使用的是简单redis
```
### 查看任务状态
```
dev@ubuntu:~/myapp$ structure-spider check zhaopin --custom
```
更多资源:

[[structure_spider每周一练]:一键下载百度mp3](https://zhuanlan.zhihu.com/p/29076630)

[个性化爬虫一键生成,想抓哪里点哪里!](https://zhuanlan.zhihu.com/p/33561576)

[scrapy进阶,组合多请求抓取Item利器ItemCollector详解!](https://zhuanlan.zhihu.com/p/33699058)