{"id":13464558,"url":"https://github.com/hu17889/go_spider","last_synced_at":"2026-04-08T12:03:15.754Z","repository":{"id":17161000,"uuid":"19928024","full_name":"hu17889/go_spider","owner":"hu17889","description":"[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only. ","archived":false,"fork":false,"pushed_at":"2017-11-16T01:58:55.000Z","size":2710,"stargazers_count":1830,"open_issues_count":23,"forks_count":471,"subscribers_count":154,"default_branch":"master","last_synced_at":"2025-08-13T20:43:30.412Z","etag":null,"topics":["crawler","go","pipeline","schedule","spider"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"CyanogenMod/android_packages_apps_AudioFX","license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hu17889.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-19T03:23:26.000Z","updated_at":"2025-08-04T15:28:35.000Z","dependencies_parsed_at":"2022-07-14T04:00:41.203Z","dependency_job_id":null,"html_url":"https://github.com/hu17889/go_spider","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/hu17889/go_spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hu17889%2Fgo_spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hu17889%2Fgo_spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hu17889%2Fgo_spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hu17889%2Fgo_spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hu17889","download_url":"https://codeload.github.com/hu17889/go_spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hu17889%2Fgo_spider/sbom","scorecard":{"id":471345,"data":{"date":"2025-08-11","repo":{"name":"github.com/hu17889/go_spider","commit":"85ede20bf88b6861235d89765dacbc732695ab59"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.6,"checks":[{"name":"Code-Review","score":4,"reason":"Found 6/15 approved changesets -- score normalized to 4","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Mozilla Public License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 21 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T13:57:34.181Z","repository_id":17161000,"created_at":"2025-08-19T13:57:34.181Z","updated_at":"2025-08-19T13:57:34.181Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31554110,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T10:21:54.569Z","status":"ssl_error","status_checked_at":"2026-04-08T10:21:38.171Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","go","pipeline","schedule","spider"],"created_at":"2024-07-31T14:00:46.273Z","updated_at":"2026-04-08T12:03:15.713Z","avatar_url":"https://github.com/hu17889.png","language":"Go","funding_links":[],"categories":["All","开源类库","Go","Misc","Open source library","Repositories"],"sub_categories":["爬虫","Crawlers"],"readme":"go_spider\n=========\n\n[![Build Status](https://travis-ci.org/hu17889/go_spider.svg)](https://travis-ci.org/hu17889/go_spider)\n\n\nA crawler of vertical communities achieved by GOLANG. \n\n![image](https://raw.githubusercontent.com/hu17889/doc/master/go_spider/img/logo.png)\n\n\nLatest stable Release: [Version 1.2 (Sep 23, 2014)](https://github.com/hu17889/go_spider/releases).\n\n\n* [![go_spider讨论群](http://pub.idqqimg.com/wpa/images/group.png)](http://shang.qq.com/wpa/qunwpa?idkey=29f4d06e7fa2b401bc231274d08ada879db777bbf955a44c0e598aaf3d574963) QQ群号：337344607\n\n\n## Features\n\n* Concurrent \n* Fit for vertical communities\n* Flexible, Modular\n* Native Go implementation\n* Can be expanded to an individualized crawler easily\n\n\n## Requirements\n\n* Go 1.2 or higher\n\n## Documentation\n\n[中文文档](https://github.com/hu17889/go_spider/wiki/%E4%B8%AD%E6%96%87%E6%96%87%E6%A1%A3) \u0026\u0026 [常见问题](https://github.com/hu17889/go_spider/wiki/%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E4%B8%8E%E5%8A%9F%E8%83%BD%E8%AF%B4%E6%98%8E).\n\n\n## Installation\n\n```\ngo get github.com/hu17889/go_spider\ngo get github.com/PuerkitoBio/goquery\ngo get github.com/bitly/go-simplejson\ngo get golang.org/x/net/html/charset\n```\n\nThis project is based on [simplejson](https://github.com/bitly/go-simplejson/blob/master/simplejson.go), [goquery](https://github.com/PuerkitoBio/goquery).\n\nYou can download packages from [http://gopm.io/](http://gopm.io/) in China.\n\n## Use example\n\nHere is an example for crawling github content. You can have a try of the crawl process.\n* `go install github.com/hu17889/go_spider/example/github_repo_page_processor`\n* `./bin/github_repo_page_processor`\n\nMore examples here: [examples](https://github.com/hu17889/go_spider/tree/master/example).\n\n\n## Make your spider\n\n``` Go\n    // Spider input:\n    //  PageProcesser ;\n    //  Task name used in Pipeline for record;\n    spider.NewSpider(NewMyPageProcesser(), \"TaskName\").\n        AddUrl(\"https://github.com/hu17889?tab=repositories\", \"html\"). // Start url, html is the responce type (\"html\" or \"json\")\n        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen\n        SetThreadnum(3).                                               // Crawl request by three Coroutines\n        Run()\n```\n\n- Use default modules \n\n - Downloader：HttpDownloader\n - Scheduler：QueueScheduler\n - Pipeline：PipelineConsole，PipelineFile\n\n- Use your modules\n\nJust copy the default modules and modify it!\n\nIf you make a Downloader module, you can use it by `Spider.SetDownloader(your_downloader)`.\n\nIf you make a Pipeline module, you can use it by `Spider.AddPipeline(your_pipeline)`.\n\nIf you make a Scheduler module, you can use it by `Spider.SetScheduler(your_scheduler)`.\n\n\n## Extensions\n\nExtensions folder include modulers or other tools someone sharing. You can push your code without bugs.\n\n## Modulers\n\n### Spider\n\n**Summary:** Crawler initialization, concurrent management, default moduler, moduler management, config setting.\n\n**Functions:** \n\n- Clawler startup functions: Get, GetAll, Run\n- Add request: AddUrl, AddUrls, AddRequest, AddRequests\n- Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader\n- Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)\n- Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by **mlog** package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace\n\n### Downloader\n\n**Summary:** Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser.\nHtml parsing is based on **goquery** package. Json parsing is based on **simplejson** package. Jsonp will be conversed to json. Text form represents plain text content without parser. \n\n**Functions:**\n\n- Download: download content of the crawl objective. Result contains data body, header, cookies and request info.\n\n### PageProcesser\n\n**Summary:** The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. \nThese key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.\n\n**Functions:**\n\n- Process: parse the objective crawled.\n\n### Page\n\n**Summary:** save information of request.\n\n**Functions:** \n\n- Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)\n- Get information of objective: GetRequest, GetCookies, GetHeader\n- Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)\n- Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)\n\n\n### Scheduler\n\n**Summary:** The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.\n\n**Functions:**\n\n- Push\n- Poll\n- Count\n\n### Pipeline\n\n**Summary:** The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)\n\n**Functions:**\n\n- Process\n\n\n### Request\n\n**Summary:** The Request moduler has config for http request like url, header and cookies.\n\n**Functions:**\n\n- Process\n\n\n\n## License\ngo_spider is licensed under the [Mozilla Public License Version 2.0](https://github.com/hu17889/go_spider/blob/master/LICENSE)\n\nMozilla summarizes the license scope as follows:\n\u003e MPL: The copyleft applies to any files containing MPLed code.\n\n\nThat means:\n  * You can **use** the **unchanged** source code both in private as also commercial\n  * You **needn't publish** the source code of your library as long the files licensed under the MPL 2.0 are **unchanged**\n  * You **must publish** the source code of any **changed files** licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)\n\nPlease read the [MPL 2.0 FAQ](http://www.mozilla.org/MPL/2.0/FAQ.html) if you have further questions regarding the license.\n\nYou can read the full terms here: [LICENSE](https://raw.github.com/go-sql-driver/mysql/master/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhu17889%2Fgo_spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhu17889%2Fgo_spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhu17889%2Fgo_spider/lists"}