Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wondervictor/spiderman
2017 Software Course Project
https://github.com/wondervictor/spiderman
crawler distribute-crawler zhihu-crawler
Last synced: 21 days ago
JSON representation
2017 Software Course Project
- Host: GitHub
- URL: https://github.com/wondervictor/spiderman
- Owner: wondervictor
- License: mit
- Created: 2017-10-12T12:03:29.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-01-09T11:58:50.000Z (about 7 years ago)
- Last Synced: 2024-12-08T15:34:05.909Z (2 months ago)
- Topics: crawler, distribute-crawler, zhihu-crawler
- Language: Python
- Homepage:
- Size: 62.5 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SpiderMan
Spiderman is the god### Introduction
**2017~2018 软件课程设计--知乎爬虫**
* 简易的分布式爬虫(使用multiprocessing和queue实现)
* Selenium获取网页,解决网页动态加载问题
* PyTorch情感分析模型
### Architecture
#### **Distributed Version**
![](./arch/overall_arch.png)
##### Master
* URL Pool
* URL Filter (Based on BloomFilter and Regex, to remove duplicates or illegal urls)##### Worker
![](./arch/worker.png)
* Request with URLs from Master Node (Based on selenium and phantomjs webdriver)
* Parse the html content (questions, answers, topics, people)
* Save the parsed content to local storage.#### Thread Manager
![](./arch/Thread.png)
* 使用Queue和threading封装线程池。
---
### Runing Process
![](./arch/process.png)
----
### Usage
````python
# run distributed version
# start master
python master.py
# start worker
python main.py# run single version
python master.py````
### Licence
This project is under the **MIT** licence