{"id":13941609,"url":"https://github.com/f111fei/article_spider","last_synced_at":"2025-06-28T11:06:25.567Z","repository":{"id":49385964,"uuid":"116339464","full_name":"f111fei/article_spider","owner":"f111fei","description":"微信公众号爬虫","archived":false,"fork":false,"pushed_at":"2018-03-08T06:26:22.000Z","size":511,"stargazers_count":327,"open_issues_count":6,"forks_count":72,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-20T12:07:09.237Z","etag":null,"topics":["javascript","spider","typescript","wechat"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/f111fei.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-01-05T04:01:24.000Z","updated_at":"2025-03-28T18:23:55.000Z","dependencies_parsed_at":"2022-08-25T16:20:41.602Z","dependency_job_id":null,"html_url":"https://github.com/f111fei/article_spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/f111fei/article_spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f111fei%2Farticle_spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f111fei%2Farticle_spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f111fei%2Farticle_spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f111fei%2Farticle_spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/f111fei","download_url":"https://codeload.github.com/f111fei/article_spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f111fei%2Farticle_spider/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262419807,"owners_count":23308100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","spider","typescript","wechat"],"created_at":"2024-08-08T02:01:22.518Z","updated_at":"2025-06-28T11:06:25.540Z","avatar_url":"https://github.com/f111fei.png","language":"TypeScript","readme":"## 微信公众号爬虫\n\n- [x] 爬取公众号所有文章数据\n- [x] 支持自动识别验证码\n- [x] 离线数据库，包含文章原始信息，文章图片\n- [x] 微信文章预览\n- [ ] 命令行方式调用\n\n### 效果预览\n\n控制台输出：\n\n![](/images/demo_1.png)\n\n文章数据：\n\n![](/images/demo_2.png)\n\nHTML预览：\n\n![](/images/demo_3.png)\n\n### 实现原理\n\n目前支持两种方式爬取文章。\n\n#### 1.搜狗微信\n\n通过[搜狗微信](http://weixin.sogou.com/)的搜索结果来抓取文章。\n\n优点：这种方式不需要登录认证，操作简单。\n\n缺点：只能抓取最近10条数据。\n\n使用场景：适合配置定时抓取任务，来获取大量数据。\n\n#### 2.Ajax请求\n\n截获微信公众号文章列表的Ajax请求参数，模拟微信客户端读取文章列表和文章信息。\n\n优点：能获取公众号所有文章数据。\n\n缺点：需要登录微信，并且通过工具手动设置Cookie等参数，才能使用。\n\n使用场景：一次性大量抓取公众号数据，抓取完后配合搜狗法更新数据。\n\n\n### 如何使用\n\n#### 必要环境\n\nNodeJS \u0026 NPM, Chrome浏览器, 微信桌面客户端(Mac 或者 Windows都可)\n\n#### 初始化环境和编译\n\n    git clone git@github.com:f111fei/article_spider.git\n    cd article_spider\n    npm install typescript -g\n    npm install\n    tsc\n\n#### 配置Config\n\n设置项目根目录下的`config.json`文件。字段定义如下：\n\n```\ninterface Config {\n    // 必填，要抓取的微信公众号名称。\n    name: string;\n    // 可选，若快打码平台的账号密码。用于搜狗抓取模式下自动识别验证码。\n    ruokuai: {\n        username: string;\n        password: string;\n    };\n    wechat: {\n        // 可选，要抓取文章的起始页，默认0\n        start?: number;\n        // 可选，要抓取的文章数，默认不限制\n        maxNum?: number;\n        // 可选，抓取模式(sougou, all)。默认all\n        mode?: string;\n        // 抓取模式为all时有效，公众号的biz字段，获取方法参见下面\n        biz?: string;\n        // 抓取模式为all时有效，当前cookie字段，获取方法参见下面\n        cookie?: string;\n        // 抓取模式为all时有效，当前appmsg_token字段，获取方法参见下面\n        appmsg_token?: string;\n    };\n}\n```\n\n#### 获取Ajax请求参数\n\n如果抓取模式为`sougou`，请跳过此节。\n\n要获取文章列表的Ajax请求数据，需要对获取文章列表数据的请求进行抓包，找到biz，cookie，appmsg_token等关键参数。下面介绍如何抓取请求参数。\n\n以抓取`NASA爱好者`这个公众号为例。\n\n1.打开公众号 --- 右上角 --- 点击查看历史消息\n\n![](/images/1.png)\n\n\u003e 注意： 配置里面的`name`字段，应该填写这里的微信号`nasawatch`，而不是`NASA爱好者`。\n\n2.在打开的窗口中，点击菜单栏上的用默认浏览器(Chrome)打开，使用Chrome打开文章列表页。\n\n![](/images/2.png)\n\n3.如果在浏览器中打开出现 `请在微信客户端打开链接。` 的提示，说明这个URL经过加密了，请按照下面操作获取正确的URL。否则跳过此步。\n\n关闭微信客户端，找到微信桌面客户端可执行程序的位置。使用命令行启动程序：\n\nWindows下通常是:\n\n    \"C:\\Program Files (x86)\\Tencent\\WeChat\\WeChat.exe\" --remote-debugging-port=9222\n\nMac下通常是:\n\n    \"/Applications/WeChat.app/Contents/MacOS/WeChat\" --remote-debugging-port=9222\n\n按照步骤1打开历史消息页。\n\n使用Chrome浏览器打开URL  `http://127.0.0.1:9222/json`。\n\n![](/images/3.png)\n\n复制url字段，在新标签页中打开，就可以看到正确的历史消息页了。\n\n4.在历史消息页中，点击右键 ---- 检查，打开Chrome开发者工具 ---- 切换到Network页签 ---- 刷新浏览器。在右侧找到cookie, biz, appmsg_token等字段填入`config.json`中。\n\n\u003e 需要向下滚动列表页加载下一页找到 `https://mp.weixin.qq.com/mp/profile_ext?action=getmsg` 开头的请求，查看其参数。\n\n![](/images/4.png)\n\n这些字段可能几个小时之后就会失效，可以重新按照以上步骤重新获取。\n\n#### 启动爬虫\n\n    npm start\n\n爬到的文章信息，图片，文章原始数据会存入项目根目录的db文件夹下。\n","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ff111fei%2Farticle_spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ff111fei%2Farticle_spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ff111fei%2Farticle_spider/lists"}