{"id":13694870,"url":"https://github.com/wycm/zhihu-crawler","last_synced_at":"2025-05-03T04:31:04.717Z","repository":{"id":111790183,"uuid":"57126883","full_name":"wycm/zhihu-crawler","owner":"wycm","description":"zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目","archived":false,"fork":false,"pushed_at":"2019-04-02T08:35:00.000Z","size":5871,"stargazers_count":913,"open_issues_count":3,"forks_count":373,"subscribers_count":60,"default_branch":"3.0","last_synced_at":"2024-11-12T21:39:27.815Z","etag":null,"topics":["crawler","java","spider","zhihu"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wycm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-04-26T12:36:34.000Z","updated_at":"2024-11-11T14:56:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"ed67efc9-6b0d-4498-abf0-68410a872e17","html_url":"https://github.com/wycm/zhihu-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wycm%2Fzhihu-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wycm%2Fzhihu-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wycm%2Fzhihu-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wycm%2Fzhihu-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wycm","download_url":"https://codeload.github.com/wycm/zhihu-crawler/tar.gz/refs/heads/3.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252144545,"owners_count":21701428,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","java","spider","zhihu"],"created_at":"2024-08-02T17:01:46.122Z","updated_at":"2025-05-03T04:31:02.257Z","avatar_url":"https://github.com/wycm.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"知乎爬虫\n====\nzhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式抓取爬虫项目，主要功能是抓取知乎用户、话题、问题、答案、文章等数据，如果觉得不错，请给个star。\n## 爬取结果\n* 下图为爬取117w知乎用户数据的简单统计\u003cbr\u003e\n![](https://github.com/wycm/zhihu-crawler/blob/2.0/src/main/resources/img/zhihu-charts.png)\n* 详细统计见 https://www.vwycm.cn/zhihu/charts\n\n## 需要\n1. jdk 1.8\n2. redis\n3. mongodb\n\n## 快速开始\n1. 修改```zhihu/src/main/resources/application.yaml```redis、mongodb相关配置，[application.yaml](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/resources/application.yaml)\n2. 初始化```zhihu/src/main/resources/mongo-init.sql```mongodb脚步，[mongo-init.sql](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/resources/mongo-init.sql)\n3. 设置日志路径，默认在`/var/www/logs`[logback-spring.xml](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/resources/logback-spring.xml)\n4. Run with [ZhihuCrawlerApplication.java](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/java/com/github/wycm/zhihu/ZhihuCrawlerApplication.java )\n\n## 使用到的接口\n* 地址(url)：```https://www.zhihu.com/api/v4/members/${userid}/followees```\n* 请求类型：GET\n* **请求参数**\n\n| 参数名 |类型 | 必填 | 值 | 说明|\n| :------------ | :------------ | :------------ | :----- | :------------ |\n| include | String | 是| ```data[*]answer_count,articles_count``` |需要返回的字段（这个值可以改根据需要增加一些字段，见如下示例url） |\n| offset  | int    | 是| 0 | 偏移量（通过调整这个值可以获取到一个用户的```所有关注用户```资料） |\n| limit   | int    | 是| 20 | 返回用户数（最大20，超过20无效） |\n\n* url示例：```https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data[*].educations,employments,answer_count,business,locations,articles_count,follower_count,gender,following_count,question_count,voteup_count,thanked_count,is_followed,is_following,badge[?(type=best_answerer)].topics\u0026offset=0\u0026limit=20```\n* 响应：json数据，会有关注用户资料\n\n## 特性\n* 大量使用http代理，突破同一个客户端访问量限制（注：使用的都是网上公开的免费代理，近期测试来看，部分免费代理网站都做了反爬，可用的免费代理比以前少了很多，抓取速度相比以前慢了很多）。\n* 支持持久化(mongodb)。\n* 多线程、高性能、支持横向扩展分布式爬取。\n\n## TODO\n* 新增问题、答案、文章抓取\n* 支持实时抓取，每小时更新知乎全站所有热门内容\n\n## 更新\n\n### 2019.02.21\n* 基于Spring Boot重构项目，支持横向扩展，分布式抓取\n* 数据持久化采用mongodb\n* 采用基于Netty的AsyncHttpClient代替HttpClient4.5\n\n#### 2018.07.09\n* 知乎网站更新，不再需要authorization验证\n* 完善单测\n* 修复已知bug\n\n#### 2017.11.05\n* 知乎authorization文件更新，修改authorization获取方式。\n\n#### 2017.05.26\n* 修复代理返回错误数据，导致java.lang.reflect.UndeclaredThrowableException异常。\n\n#### 2017.03.30\n* 知乎api变更，关注列表页不能获取到关注人数，导致线程池任务不能持续下去。抓取模式切换成原来ListPageThreadPool和DetailPageThreadPool的方式。\n\n#### 2017.01.17\n* 增加代理序列化。\n* 调整项目结构，大幅度提高爬取速度。不再使用ListPageThreadPool和DetailPageThreadPool的方式。直接下载关注列表页，可以直接获取到用户详细资料。\n\n#### 2017.01.10\n* 不再采用登录抓取，并移除登录抓取相关模块，模拟登录的主要逻辑代码见[ModelLogin.java](https://github.com/wycm/zhihu-crawler/blob/2.0/src/main/java/com/crawl/zhihu/ModelLogin.java)。\n* 优化项目结构，加快爬取速度。采用ListPageThreadPool和DetailPageThreadPool两个线程池。ListPageThreadPool负责下载”关注用户“列表页，解析出关注用户，将关注用户的url去重，然后放到DetailPageThreadPool线程池。\nDetailPageThreadPool负责下载用户详情页面，解析出用户基本信息并入库，获取该用户的\"关注用户\"的列表页url并放到ListPageThreadPool。\n\n#### 2016.12.26\n* 移除未使用的包，修复ConcurrentModificationException和NoSuchElementException异常问题。\n* 增加游客（免登录）模式抓取。\n* 增加代理抓取模块。\n\n## 免责申明\n* 本项目仅供个人学习与交流使用，严禁用于商业以及不良用途。\n\n## 最后\n* 有问题的请提issue。\n* 欢迎贡献代码。\n* 爬虫交流群：633925314，欢迎交流。\n* 需要数据的，关注公众号即可(117w知乎用户基本信息资料，该数据仅供个人学习与交流使用，严禁用于商业以及不良用途)：lwndso\u003cbr\u003e\n![一个程序员日常分享，包括但不限于爬虫、Java后端技术，欢迎关注](https://raw.githubusercontent.com/wycm/md-image/master/2019-02-28/9.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwycm%2Fzhihu-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwycm%2Fzhihu-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwycm%2Fzhihu-crawler/lists"}