{"id":21314109,"url":"https://github.com/bitxx/pholcus","last_synced_at":"2025-07-12T01:31:08.992Z","repository":{"id":40293092,"uuid":"190396598","full_name":"bitxx/pholcus","owner":"bitxx","description":"对基于golang的henrylee2cn/pholcusl爬虫框架的修复和完善，满足自身需要","archived":false,"fork":false,"pushed_at":"2023-10-11T21:13:26.000Z","size":3301,"stargazers_count":24,"open_issues_count":5,"forks_count":25,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-19T14:54:09.225Z","etag":null,"topics":["crawler","golang","pholcus"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitxx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-05T13:08:20.000Z","updated_at":"2023-11-26T08:15:38.000Z","dependencies_parsed_at":"2024-06-18T22:55:50.899Z","dependency_job_id":"5542d7da-b0f0-43f0-afb8-b69c39ac3b84","html_url":"https://github.com/bitxx/pholcus","commit_stats":null,"previous_names":["bitxx/pholcus","jason-wj/pholcus"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitxx%2Fpholcus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitxx%2Fpholcus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitxx%2Fpholcus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitxx%2Fpholcus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitxx","download_url":"https://codeload.github.com/bitxx/pholcus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225778849,"owners_count":17522710,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","golang","pholcus"],"created_at":"2024-11-21T18:10:38.595Z","updated_at":"2024-11-21T18:10:39.367Z","avatar_url":"https://github.com/bitxx.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pholcus爬虫框架（改造版）\n原官方爬虫框架[henrylee2cn/pholcus](https://github.com/henrylee2cn/pholcus)基本停止更新，由于个人学习需要，对本项目做了一些修改和完善。\n有兴趣对可以看看原官方项目的介绍。\n感谢：[henrylee2cn](https://github.com/henrylee2cn)\n因爬虫项目在政策上的敏感性，本项目将根据原作者的项目状态而随时做出调整（如项目删除、免责声明等）\n再次声明：本项目仅供学习\n\n# 免责声明\n本软件仅用于学术研究，使用者需遵守其所在地的相关法律法规，请勿用于非法用途！！ 如在中国大陆频频爆出爬虫开发者涉诉与违规的[新闻](https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China)。\n郑重声明：因违法违规使用造成的一切后果，使用者自行承担！！\n\n## 2019-08-17更新\n1. 当历史记录中有错误信息时，用户可以在Spider中设置是否只爬取这些历史记录，若不设置，则不仅会爬取这些错误信息，同时还会继续爬取定义的规则。该功能主要是针对增量更新时，只爬取一遍网站，收集到失败记录后，只爬取失败记录即可，没必要为了爬取败记录就同时爬取整个网站网。\n\n## 2019-07-21更新\n1. 多处细节和异常更新\n2. 代理模块重构，简化代理逻辑\n3. views的操作，请参考 /pholcus/web/bindata_assetfs_usage，使用前请现在该目录下将views.zip解压并根据自己需要修改页面\n4. 目前可在/pholcus/config/config.go中根据需要修改日志文件名称，方便区分每个网站的爬虫\n5. 其余众多功能改造，就不一一细说了\n\n## 以下历史记录，部分可供参考\n\n## 使用技巧\n1. 每个链接的请求，最好设置一下链接`DialTimeout`和`ConnTimeout`，默认框架提供的是2分钟，这个大批量爬取时候，这个时间影响还是很大的，我控制在15秒左右。\n2. 使用代理，结合上面的方式是最佳选择\n\n## 该项目个人使用感悟\n重量级爬取，性能和体验比很多开源项目要好很多，无需怀疑。\n\n## 当前如下修改和完善\n1. 为方便在爬虫规则中调用ctx.Sleep时能够动态切换爬虫间隔频率，在`pholcus/app/crawler/crawler.go`第200行处加入：\n```\nself.setPauseTime()\n```  \n\n2. 该框架是使用map将爬虫结果导入到mongo中的，原先爬虫上下文中的参数默认值不能为空（只能搞成\"\"空字符串），这就回导致json解析数据时候同时解析了该参数，后面不仅浪费空间（etl时候会将该空字段再次加入到mongo，很浪费资源），而且容易造成误理解。  \n为此，在`pholcus/app/downloader/request/request.go`第282行将如下代码注释：\n```\nif defaultValue == nil {\n\tpanic(\"*Request.GetTemp()的defaultValue不能为nil，错误位置：key=\" + key)\n}\n```  \n\n3. (待定，当前已还原)多次发起请求时候，head会被重复利用，这样有的爬虫规则下， 会造成请求错误，始终无法继续（会误以为是ip被封），为此，注释掉`pholcus/app/downloader/surfer/param.go`第60行如下代码：\n```\nparam.header = req.GetHeader()\n```\n4. 当页面请求错误，获取不到数据时候（提示`convert err xxx`），此时如果错误数量超过goroutine限制的上限，则会陷入死锁状态，为此需要在`pholcus/app/spider/context.go`第643行加入如下一行代码：\n```\nself.text = []byte(\"\") //防止self.text为nil\n```\n\n5. 新增方法，用于获取请求到的页面的原始[]byte数据。原先没有提供时，如果要获取图片或者自行处理数据是很难搞的，为此在`pholcus/app/spider/context.go`第579行处加入如下代码：  \n```  \n// GetBytes returns plain bytes crawled.\nfunc (self *Context) GetBytes() []byte {\n\tif self.text == nil {\n\t\tself.initText()\n\t}\n\treturn self.text\n}\n```\n\n6. 当response编码为\"image/jpeg\"或者没有指定编码时，不要进行转码操作（默认会转为utf-8，会影响图片等内容等展示）,需要在`pholcus/app/spider/context.go`第641行加入一项：\n```  \n\"image/jpeg\",\"\"\n```\n\n7. 部分网站可能会发生url变化，此时继续爬取，会被识别为新的url来爬取，会造成和旧的url爬的数据重复。为了解决这个问题，需要在`pholcus/app/downloader/request/request.go`第19行加入：\n```  \nUrlAlias      string          //url别名，主要是为了防止网站url发生变化，影响去重。（若网站url变化，只需要在此处加入旧的url就行）\n```\n同时在142行加入：\n```  \n// 请求的唯一识别码\nfunc (self *Request) Unique() string {\n\tif self.unique == \"\" {\n\t\tif self.UrlAlias != \"\" {\n\t\t\tblock := md5.Sum([]byte(self.Spider + self.Rule + self.UrlAlias + self.Method))\n\t\t\tself.unique = hex.EncodeToString(block[:])\n\t\t} else {\n\t\t\tblock := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))\n\t\t\tself.unique = hex.EncodeToString(block[:])\n\t\t}\n\t}\n\treturn self.unique\n}\n```\n以后只要在Request中，指定`UrlAlias`的旧根地址即可\n\n8. 为更直观展示代理使用时候的错误提示，在`pholcus/app/aid/proxy/proxy.go`第239行加入：\nps：曾犯下一个错误，代理测试始终报错，后来才知道，代理ip需要加上`http://`前缀，就是因为源码中忽略了下面的错误提示\n```  \nif err != nil {\n\tlogs.Log.Informational(\" *     [%v]代理测试发生错误：\" + err.Error())\n}\n```\n\n9. 将[henrylee2cn/teleport](https://github.com/henrylee2cn/teleport)和[henrylee2cn/goutil](https://pholcus/common/goutil)两个辅助源码直接放在`/pholcus/common`目录中\n\n10. 加入爬虫规则示例包到项目根目录\n\n11. 可手动判断是否要某条链接作为去重处理，Request中加入参数：NeedUrlUnique，加入库中，默认不去重:\n\n```txt\napp/scheduler/matrix.go 169行\nif ok \u0026\u0026 req.NeedUrlUnique {\n\n}\n\n```\n\n12. mongo完善，支持用admin的username和password来加密村粗，若username为空，则认为不需要账号和密码\n\n13. 爬虫进行中的web页面无法自动打开问题处理(IE11偶尔能自动打开)，目前，通过异步监听socket的open状态来改进打开页面机制。 \n问题点：原先官方在`web/views/js/app.js`的home()方法中，如果先前关闭浏览器的时候已经在爬取页面，那mode=-1，此时调用home()时，就会直接调用ws.send()方法，但此时ws还没成功open建立链接，造成页面无法正常打开。\n    \n改进：`web/views/js/app.js`的home()方法中，异步监听ws的open建立成功：\n```html\n//添加事件监听\nws.addEventListener('open', function () {\n    Open('refresh');\n});\n```\n\n\n\n剩余调整将会根据后续需要来逐步调整。。。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitxx%2Fpholcus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitxx%2Fpholcus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitxx%2Fpholcus/lists"}