Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gangly/webcrawler
基于scrapy的大规模定向爬虫
https://github.com/gangly/webcrawler
Last synced: 4 days ago
JSON representation
基于scrapy的大规模定向爬虫
- Host: GitHub
- URL: https://github.com/gangly/webcrawler
- Owner: gangly
- Created: 2016-02-26T02:29:02.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-02-26T02:36:27.000Z (over 8 years ago)
- Last Synced: 2024-03-19T14:43:57.172Z (8 months ago)
- Language: Python
- Size: 22.5 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# webcrawler
基于scrapy的大规模定向爬虫最大特色是使用代理ip,避免了网站的反爬虫策略。
其中ipport_spider实现了爬取公网代理ip,存储在redis中
soufang_spider实现了定向爬取搜房网信息和图片,从redis中获取代理ip
爬取的数据存储在mysql和mongo中