{"id":20578793,"url":"https://github.com/cclient/goworkscript","last_synced_at":"2026-06-10T07:31:12.212Z","repository":{"id":93977305,"uuid":"89773258","full_name":"cclient/goworkscript","owner":"cclient","description":"业余时间写的go小工具脚本","archived":false,"fork":false,"pushed_at":"2017-09-06T13:31:58.000Z","size":2221,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-06T11:43:48.961Z","etag":null,"topics":["clawler","go","spider"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cclient.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-29T08:32:56.000Z","updated_at":"2019-03-15T03:08:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"8f018366-b4fd-48d5-9d72-813ff4f11472","html_url":"https://github.com/cclient/goworkscript","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cclient/goworkscript","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cclient%2Fgoworkscript","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cclient%2Fgoworkscript/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cclient%2Fgoworkscript/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cclient%2Fgoworkscript/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cclient","download_url":"https://codeload.github.com/cclient/goworkscript/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cclient%2Fgoworkscript/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34142637,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clawler","go","spider"],"created_at":"2024-11-16T06:14:36.313Z","updated_at":"2026-06-10T07:31:12.195Z","avatar_url":"https://github.com/cclient.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"go的一些脚本工具\n\n个人实际把go当一个简版的c,轻量的java,无回调地狱的nodejs,性能更高的python来用\n\n这个项目不是什么大工程，都很轻，更像是类似python的业余脚本\n\n#### src/client/pdf\n\n下载极邦客网站pdf\n\n索引来源(无意中搜到一个，发现可以搜到好多，就写这个脚本都拿下来)\n\nhttp://cn.bing.com/search?q=site:ppt.geekbang.org+ppt\n\ngoogle和bing搜索的结果基本一致\n\n建表语句在\nsrc/client/pdf/db.sql\n\n\n用网站源pdf id 作排重\n\n如下例\n\n'1','Bing','580984e1c891a','如何打造大规模互联网企业的 监控告警平台','http://ppt.geekbang.org/slide/download/467/580984e1c891a.pdf','如何打造大规模互联网企业的 监控告警平台-- 以携程hickwall为例 author rhtang@ctrip.com'\n\n网站id 为580984e1c891a\n\n\ngoogle,!@#$%\u0026,不是每个人都,!@#$%\u0026,为用适用更多人(也为省点代理流量),用bing作搜索引擎\n\n最大的瓶颈在下载文件的网络io,并行无甚意义，所以解析和下载完全串行化\n\n能被搜索引擎爬到,说明极邦客允许无登录访问。不知网站是有意为知,还是权限部分设计缺陷所致\n\n\n爬虫拿到的部分pdf文件信息示例\n\n\n## src/client/spider \n\n后台并行网页抓取(只保留了基本的get请求,需要加代理,加header头,多节点分发的可以自已补充)\n\n通过redis解耦\n\n客户端提交url信息至redis\n\n后台批量并行请求url(默认并行请求数20条)\n\n流程说明\n\n1共提交1000条至redis(往list前入)\n\n2每隔10秒从redis取100条(从list后取)\n\n2因并行数是20，100条共分为5组,每组之间FIFO\n\n3取完后移100条出list(从list后移出)\n\n4完成后继续从redis取100条(从list前取)\n\ngo http 底层会复用tcp连接,请求效率很高\n\n并发比较粗糙，实际可以封装的通用一些。\n\n\n## src/client/wangpiao\n\n网票网抓取内容解析，网站改版，部分已失效\n\n启动方式\n\n修改src/app.go\n\n执行 go run src/app.go\n\n也可以打docker启用\n\n*————————————————————————————*\n\n\n实际以上具备了爬虫及解析最基本的功能。\n\nhttps://github.com/cclient/gowebframework +  + src/client/spider \n\n可以组合成简版的爬虫服务后台，提交需要请求url，然后再从redis直接取网站内容解析。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcclient%2Fgoworkscript","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcclient%2Fgoworkscript","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcclient%2Fgoworkscript/lists"}