https://github.com/wondervictor/spiderman
2017 Software Course Project
https://github.com/wondervictor/spiderman
crawler distribute-crawler zhihu-crawler
Last synced: 2 months ago
JSON representation
2017 Software Course Project
- Host: GitHub
- URL: https://github.com/wondervictor/spiderman
- Owner: wondervictor
- License: mit
- Created: 2017-10-12T12:03:29.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-01-09T11:58:50.000Z (over 8 years ago)
- Last Synced: 2025-01-17T08:35:58.779Z (over 1 year ago)
- Topics: crawler, distribute-crawler, zhihu-crawler
- Language: Python
- Homepage:
- Size: 62.5 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SpiderMan
Spiderman is the god
### Introduction
**2017~2018 软件课程设计--知乎爬虫**
* 简易的分布式爬虫(使用multiprocessing和queue实现)
* Selenium获取网页,解决网页动态加载问题
* PyTorch情感分析模型
### Architecture
#### **Distributed Version**

##### Master
* URL Pool
* URL Filter (Based on BloomFilter and Regex, to remove duplicates or illegal urls)
##### Worker

* Request with URLs from Master Node (Based on selenium and phantomjs webdriver)
* Parse the html content (questions, answers, topics, people)
* Save the parsed content to local storage.
#### Thread Manager

* 使用Queue和threading封装线程池。
---
### Runing Process

----
### Usage
````python
# run distributed version
# start master
python master.py
# start worker
python main.py
# run single version
python master.py
````
### Licence
This project is under the **MIT** licence