{"id":20714168,"url":"https://github.com/ksky521/mpspider","last_synced_at":"2025-04-23T08:12:08.673Z","repository":{"id":57303609,"uuid":"125501570","full_name":"ksky521/mpspider","owner":"ksky521","description":"公众号文章抓取\u0026生成kindle电子书","archived":false,"fork":false,"pushed_at":"2021-10-25T01:21:10.000Z","size":622,"stargazers_count":59,"open_issues_count":1,"forks_count":8,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-23T08:12:02.283Z","etag":null,"topics":["ebook","gitbook","kindle","website","wechat-spider"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ksky521.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-16T10:31:27.000Z","updated_at":"2025-02-17T09:28:56.000Z","dependencies_parsed_at":"2022-08-24T17:11:32.061Z","dependency_job_id":null,"html_url":"https://github.com/ksky521/mpspider","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksky521%2Fmpspider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksky521%2Fmpspider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksky521%2Fmpspider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksky521%2Fmpspider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ksky521","download_url":"https://codeload.github.com/ksky521/mpspider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250395288,"owners_count":21423400,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ebook","gitbook","kindle","website","wechat-spider"],"created_at":"2024-11-17T02:30:04.745Z","updated_at":"2025-04-23T08:12:08.650Z","avatar_url":"https://github.com/ksky521.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 公众号文章抓取\u0026生成 kindle 电子书\n\n抓取公众号历史文章，解析成 markdown 文件，生成 gitbook 项目，最后可生成 kindle 书籍。\n\n**PS**：\n\n1. 需要 ebook-convert 依赖\n2. gitbook 需要在 node 6.x 版本，8.x 不能用，其他没测试\n3. 生成 mobi 需要配置下`book.json`\n\n## 抓取方式\n\n支持两种抓取方式：\n\n1. 从公众号的一篇汇总文章开始，有些公众号会有年度总结文章，比如 [这篇文章](https://mp.weixin.qq.com/s/CIPosICgva9haqstMDIHag)\n2. 使用 anyproxy 做代理，抓取公众号历史消息文章，忽略非图文类、小标题类文章\n\n**PS**：汇总文章指的是一个公众号的文章页面，比如「架构师之路」的 [这篇文章](https://mp.weixin.qq.com/s/CIPosICgva9haqstMDIHag)\n\n## 流程介绍\n\n1. 抓取文章\n2. 解析文章内链的外链「公众号文章」\n3. 继续抓取外链文章\n4. 替换外链文章到本地相对地址\n5. 抓取文章内的图片\n6. 替换文章图片到本地相对地址\n7. 生成 gitbook 项目\n8. 使用 gitbook+ebook-convert 生成 kindle 文件\n\n1~6 步是全自动的，7 是看自己情况\n\n## 安装\n\n```bash\nnpm i mpspider -g\n```\n\n### 执行方式\n\n```bash\n# 第一种方式\nmpspider article https://mp.weixin.qq.com/s/CIPosICgva9haqstMDIHag -d dest_path\n# 第二种方式，需要手动配置代理，点击公众号「查看历史文章」，详见下面介绍，支持手机微信和 pc 微信列表\nmpspider proxy -d dest_path -p proxy_port\n```\n\n抓取后，会在`dest_path`创建 gitbook 项目\n\n### 生成电子书\n\n执行命令\n\n```bash\n# 进入抓取后gitbook的地址\ncd dest_path\n# 创建readme.md，gitbook不创建会报错\ntouch README.md\n# 有必要可以创建book.json，参考gitbook文档\ngitbook serve\n# 访问地址查看效果\n# -------\n# 生成电子书\ngitbook mobi ./ name.mobi\n\n```\n\n## 如何配置 anyproxy 代理抓取 https 页面\n\n### 配置 anyproxy https 证书\n\n参考：http://anyproxy.io/cn/#%E8%AF%81%E4%B9%A6%E9%85%8D%E7%BD%AE\n\n### 启动 anyproxy\n\n```bash\nanyproxy --rule lib/anyproxyRule.js\n```\n\n## 使用配置文件`mpspider.config.js`\n\n支持配置项：\n\n-   book.json 配置项：'author', 'title', 'description'\n-   summarySort：文章排序函数，方法等同 `Array.sort`写法，传入`item`对象，有`mid`、`title`、`content`、`release`、`uri`等选项，release 是拼音文件名，**默认根据 release 排序**\n-   `filter`：文章内容过滤函数，将文章列表数组`items`通过 `items.filter(option.filter)` 过滤一遍，item 内容包括：`mid`、`title`、`content`\n-   `listFilter`：列表文章过滤，只用在 proxy 模式下，根据文章列表的 json 对象过滤数据，常用对象字段为\n    -   app_msg_ext_info：`author`、`title`、`copyright_stat`、`content_url`、`source_url`、`digest`、`content`、`cover`、`is_multi`等\n    -   comm_msg_info：`datetime`发布时间戳\n-   `turndown`：支持`keep`、`remove`、`rule`、`plugins` 四个选项，分别对应 turndown 的四个配置项\n-   `afterConverter`：turndown 将 html 转为 markdown 内容之后，将`content`字符串传入该函数，处理结束后，`return`处理后的字符串\n\n示例：\n\n```js\nconst turndownPluginGfm = require('turndown-plugin-gfm');\nmodule.exports = {\n    filter: item =\u003e {\n        if (item.title.indexOf('广告') !== -1) {\n            return false;\n        }\n        return true;\n    },\n    turndown: {\n        keep: 'span',\n        remove: 'span',\n        rule: {\n            strikethrough: {\n                filter: ['del', 's', 'strike'],\n                replacement: function(content) {\n                    return '~' + content + '~';\n                }\n            }\n        },\n        plugins: [turndownPluginGfm.gfm, turndownPluginGfm.tables]\n    },\n    afterConverter: content =\u003e {\n        return content.replace(/\u003c(.+?)\u003e/g, (i, m) =\u003e {\n            return `\u0026lt;${m}\u003e`;\n        });\n    }\n};\n```\n\n## 二次开发\n\ngit clone 源码后，进入文件夹，执行`npm i`\n\n-   index.js 入口文件，使用`commander`和`ora`进行命令处理\n-   getList.js 根据汇总文件提取文章列表\n-   proxySpider.js 根据 anyproxy 代理方式抓取\n-   dealMPList.js 根据代理抓取使用的文件\n-   unfetchMids.js 提取文章列表中内链的文章\n-   getImages.js 抓取文章中的图片地址，并且替换为本地地址\n-   createBook.js 生成 gitbook markdown 文件和`summary.md`，替换内链的文内容\n\n## 电子书依赖\n\n-   ebook-convert：`brew install caskroom/cask/calibre`\n-   gitbook：`npm i gitbook-cli -g`\n\n## kindle 效果截图\n\n![目录列表](./screen_capture/1.jpeg)\n\n![带图文章](./screen_capture/2.jpeg)\n\n![普通文章](./screen_capture/3.jpeg)\n\n## 给作者加鸡腿\n\n![加鸡腿](./wechat.jpeg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksky521%2Fmpspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fksky521%2Fmpspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksky521%2Fmpspider/lists"}