{"id":17311988,"url":"https://github.com/boris-code/boris-spider","last_synced_at":"2025-08-19T10:31:32.098Z","repository":{"id":44862207,"uuid":"257936664","full_name":"Boris-code/boris-spider","owner":"Boris-code","description":"boris-spider是一款使用Python语言编写的爬虫框架，于多年的爬虫业务中不断磨合而诞生，相比于scrapy，该框架更易上手，且又满足复杂的需求，支持分布式及批次采集。","archived":false,"fork":false,"pushed_at":"2022-01-21T20:23:43.000Z","size":262,"stargazers_count":83,"open_issues_count":3,"forks_count":24,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-12-09T18:13:14.933Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://spider-doc.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Boris-code.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-22T15:08:13.000Z","updated_at":"2024-12-09T14:25:17.000Z","dependencies_parsed_at":"2022-09-14T01:14:14.728Z","dependency_job_id":null,"html_url":"https://github.com/Boris-code/boris-spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Boris-code%2Fboris-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Boris-code%2Fboris-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Boris-code%2Fboris-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Boris-code%2Fboris-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Boris-code","download_url":"https://codeload.github.com/Boris-code/boris-spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230345782,"owners_count":18211997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T12:42:02.111Z","updated_at":"2024-12-18T22:08:04.197Z","avatar_url":"https://github.com/Boris-code.png","language":"Python","readme":"# boris-spider\n\n![](https://img.shields.io/badge/python-3.6-brightgreen)\n\n## 重要声明\n\n⚠️⚠️⚠️ 本框架已重命名为feapder，本项目已废弃，直接看新项目即可\n\n新项目地址：[https://github.com/Boris-code/feapder](https://github.com/Boris-code/feapder)\n\n文档地址：[https://boris.org.cn/feapder](https://boris.org.cn/feapder)\n\n## 简介\n\n**boris-spide**r是一款使用Python语言编写的爬虫框架，于多年的爬虫业务中不断磨合而诞生，相比于scrapy，该框架更易上手，且又满足复杂的需求，支持分布式及批次采集。\n\n官方文档：https://spider-doc.readthedocs.io\n\n爬虫开发的一些经验分享：https://mp.weixin.qq.com/s/cIUNatRCUtlAi0HAkbmcwA\n\n## 特性\n\n### 1. 支持周期性采集\n\n周期性抓取是爬虫中常见的需求，如每日抓取一次商品的销量等，我们把每个周期称为一个批次。\n\n这类爬虫，普遍做法是设置个定时任务，每天启动一次。但你有没有想过，若由于某种原因，定时任务启动程序时没启动起来怎么办？比如服务器资源不够了，启动起来直接被kill了。\n\n另外如何保证每条数据在每个批次内都得以更新呢？\n\n本框架支持批次采集，引入了批次表的概念，详细记录了每一批次的抓取状态\n\n![-w899](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084680404224.jpg?x-oss-process=style/markdown-media)\n\n### 2. 支持分布式\n\n面对海量的数据，分布式采集必不可少的，本框架原生支持分布式，且可随时重启爬虫，任务不丢失\n\n### 3. 完善的报警机制\n\n为了保证数据的全量性、准确性、时效性，本框架内置报警机制，有了这些报警，我们可以实时掌握爬虫状态\n\n![-w657](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718683378.jpg?x-oss-process=style/markdown-media)\n\n![-w501](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/20/16084718974597.jpg?x-oss-process=style/markdown-media)\n\n![-w416](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/12/29/16092335882158.jpg?x-oss-process=style/markdown-media)\n\n\n## 框架流程图\n\n![boris-spider -1-](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/06/08/borisspider-1.png?x-oss-process=style/markdown-media)\n\n### 模块说明：\n\n* spider **框架调度核心**\n* parser_control **模版控制器**，负责调度parser\n* collector **任务收集器**，负责从任务队里中批量取任务到内存，以缓冲对任务队列数据库的访问频率及并发量\n* parser **数据解析器**\n* start_request 初始任务下发函数\n* item_buffer **数据缓冲队列**，批量将数据存储到数据库中\n* request_buffer **请求任务缓冲队列**，批量将请求任务存储到任务队列中\n* request **数据下载器**，封装了requests，用于从互联网上下载数据\n* response **数据返回体**，封装了response, 支持xpath、css、re等解析方式。自动处理中文乱码\n\n### 流程说明\n\n1. spider调度**start_request**生产任务\n2. **start_request**下发任务到request_buffer中\n3. spider调度**request_buffer**批量将任务存储到任务队列数据库中\n4. spider调度**collector**从任务队列中批量获取任务到内存队列\n5. spider调度**parser_control**从collector的内存队列中获取任务\n6. **parser_control**调度**request**请求数据\n7. **request**请求与下载数据\n8. request将下载后的数据给**response**，进一步封装\n9. 将封装好的**response**返回给**parser_control**（图示为多个parser_control，表示多线程）\n10. parser_control调度对应的**parser**，解析返回的response（图示多组parser表示不同的网站解析器）\n11. parser_control将parser解析到的数据item及新产生的request分发到**item_buffer**与**request_buffer**\n12. spider调度**item_buffer**与**request_buffer**将数据批量入库\n\n\n\n## 环境要求：\n\n- Python 3.6.0+\n- Works on Linux, Windows, macOS\n\n## 安装\n\nFrom PyPi:\n\n    pip3 install boris-spider\n\nFrom Git:\n\n    pip3 install git+https://github.com/Boris-code/boris-spider.git\n    \n\nwindow下若报bitarray安装错误，可手动安装bitarray，然后再安装此框架。安装步骤：\n\n    下载解压：https://github.com/ilanschnell/bitarray/archive/1.5.3.zip\n    cd bitarray-1.5.3\n    python setup.py install\n\n\n## 快速上手\n\n创建爬虫\n\n    spider create -p first_spider    \n\n创建后的爬虫代码如下：\n\n\n    import spider\n\n\n    class FirstSpider(spider.SingleSpider):\n        def start_requests(self, *args, **kws):\n            yield spider.Request(\"https://www.baidu.com\")\n    \n        def parser(self, request, response):\n            # print(response.text)\n            print(response.xpath('//input[@type=\"submit\"]/@value').extract_first())\n    \n    \n    if __name__ == \"__main__\":\n        FirstSpider().start()\n        \n直接运行，打印如下：\n\n    Thread-2|2020-05-19 18:23:41,128|request.py|get_response|line:283|DEBUG| \n                    -------------- FirstSpider.parser request for ----------------\n                    url  = https://www.baidu.com\n                    method = GET\n                    body = {'timeout': 22, 'stream': True, 'verify': False, 'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'}}\n                    \n    百度一下\n    Thread-2|2020-05-19 18:23:41,727|parser_control.py|run|line:415|INFO| parser 等待任务 ...\n    FirstSpider|2020-05-19 18:23:44,735|single_spider.py|run|line:83|DEBUG| 无任务，爬虫结束\n    \n\n## 福利\n\n框架内的utils/tools.py模块下积累了作者多年的工具类函数，种类达到100+，且之后还会不定期更新，具有搬砖价值! \n    \n## 学习交流\n\n想了解更多框架使用详情，可访问官方文档：https://spider-doc.readthedocs.io\n\n如学习中遇到问题，可加下面的QQ群\n\n群号:750614606\n\n![WechatIMG188](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/04/08/wechatimg188.jpeg)\n\n知识星球：\n\n![知识星球](http://markdown-media.oss-cn-beijing.aliyuncs.com/2020/02/16/zhi-shi-xing-qiu.jpeg)\n\n星球会不定时分享爬虫技术干货，涉及的领域包括但不限于js逆向技巧、爬虫框架刨析、爬虫技术分享等","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboris-code%2Fboris-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fboris-code%2Fboris-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboris-code%2Fboris-spider/lists"}