Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/SmileXie/zhihu_crawler
Crawler of zhihu.com
https://github.com/SmileXie/zhihu_crawler
Last synced: 2 months ago
JSON representation
Crawler of zhihu.com
- Host: GitHub
- URL: https://github.com/SmileXie/zhihu_crawler
- Owner: SmileXie
- License: mit
- Created: 2015-02-17T14:07:56.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2017-04-20T07:22:46.000Z (over 7 years ago)
- Last Synced: 2024-07-31T21:53:16.499Z (5 months ago)
- Language: Python
- Homepage:
- Size: 75.2 KB
- Stars: 268
- Watchers: 40
- Forks: 139
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome - zhihu_crawler - Crawler of zhihu.com (Crawler)
README
小趴趴--知乎版
================================
对知乎精华回答的爬虫收集与分析。* 20160502:近日知乎登录添加了验证码机制,当前的代码已无法实现自动登录知乎。可以修改代码通过保存cookie的方式登录知乎,再开始爬虫。
## 算法简述
* 收集范围:知乎各话题下的精华回答。
* 爬虫算法:
* 以[根话题的话题树](https://www.zhihu.com/topic/19776749/organize/entire)为启始,按广度优先遍历各子话题,深度为3。
![目录树](https://raw.githubusercontent.com/SmileXie/zhihu_crawler/master/images/topic_tree.png)
* 各话题下的精华回答,按页遍历,例如从 https://www.zhihu.com/topic/19776749/top-answers?page=1
遍历到
https://www.zhihu.com/topic/19776749/top-answers?page=50
解析各精华回答
* 解析精华回答的各项属性,包括:
* 精华回答的点赞数,答案长度;
* 答题用户的id,获得的点赞数,地区,性别,学历,学校,专业等信息## 统计结果
* 统计结果请见:[http://www.jianshu.com/p/6d53b34165d2](http://www.jianshu.com/p/6d53b34165d2)