{"id":30930271,"url":"https://github.com/peteryangs/article-spider","last_synced_at":"2025-09-10T10:44:05.443Z","repository":{"id":40525222,"uuid":"341757720","full_name":"PeterYangs/article-spider","owner":"PeterYangs","description":"文章采集工具 Article collection tool","archived":false,"fork":false,"pushed_at":"2025-09-08T02:52:58.000Z","size":55710,"stargazers_count":142,"open_issues_count":1,"forks_count":26,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-08T04:28:07.738Z","etag":null,"topics":["article-spider","go","golang","scraper","spider"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PeterYangs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-24T02:43:48.000Z","updated_at":"2025-09-08T02:52:33.000Z","dependencies_parsed_at":"2023-09-25T13:48:45.649Z","dependency_job_id":"ce769a88-b0ce-4481-bcf1-7c4c4829f0bc","html_url":"https://github.com/PeterYangs/article-spider","commit_stats":null,"previous_names":[],"tags_count":99,"template":false,"template_full_name":null,"purl":"pkg:github/PeterYangs/article-spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterYangs%2Farticle-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterYangs%2Farticle-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterYangs%2Farticle-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterYangs%2Farticle-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PeterYangs","download_url":"https://codeload.github.com/PeterYangs/article-spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterYangs%2Farticle-spider/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274448137,"owners_count":25287120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article-spider","go","golang","scraper","spider"],"created_at":"2025-09-10T10:44:04.256Z","updated_at":"2025-09-10T10:44:05.421Z","avatar_url":"https://github.com/PeterYangs.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"### article-spider是一个用go编写的爬取文章工具。支持两种模式，常规爬取模式和浏览器自动化模式\n\n[中文文档](https://www.kancloud.cn/peter_yang/article-spiderv3/2624485)\n\u003chr/\u003e\n\n声明：该爬虫仅供学习使用，如产生任何法律后果，本人概不负责\n\n**安装**\n\n```shell\ngo get github.com/PeterYangs/article-spider/v4\n```\n\n[v1版本](https://github.com/PeterYangs/article-spider/tree/v1)\n\n[v2版本](https://github.com/PeterYangs/article-spider/tree/v2)\n\n\n\n\n**快速开始**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.925g.com/\",\n\t\tChannel:      \"/zixun_page[PAGE].html/\",\n\t\tListSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \" a\",\n\t\tPageStart:    1,\n\t\tLength:       2,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"J\", Types: articleSpider.Text, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\"},\n\t\t\t\"img\": {ExcelHeader: \"H\", Types: articleSpider.Image, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd img:nth-child(1)\", ImageDir: \"app\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"app\"\n\t\t\t}},\n\t\t\t\"content\": {ExcelHeader: \"I\", Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"/api\"\n\t\t\t}},\n\t\t},\n\t\tListFields:            map[string]articleSpider.Field{},\n\t\tCustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 5,\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal, context.Background())\n\n\ts.Start()\n\n}\n```\n\n[一些例子](https://github.com/PeterYangs/article-spider/tree/master/example)\n\n**常用属性**\n\n```\n\tHost                       string                                   //网站域名\n\tChannel                    string                                   //栏目链接，页码用[PAGE]替换\n\tPageStart                  int                                      //页码起始页\n\tLength                     int                                      //爬取页码长度\n\tListSelector               string                                   //列表选择器\n\tHrefSelector               string                                   //a链接选择器，相对于列表选择器\n\tDisableAutoCoding          bool                                     //是否禁用自动转码\n\tDetailFields               map[string]Field                         //详情页面字段选择器\n\tListFields                 map[string]Field                         //列表页面字段选择器,暂不支持api爬取\n\tHttpTimeout                time.Duration                            //请求超时时间\n\tHttpHeader                 map[string]string                        //header\n\tHttpProxy                  string                                   //代理\n\tChannelFunc                func(form *Form) []string                //自定义栏目链接\n\tDetailCoroutineNumber      int                                      //爬取详情页协程数\n\tLazyImageAttrName          string                                   //懒加载图片属性，默认为data-original\n\tDisableImageExtensionCheck bool                                     //禁用图片拓展名检查，禁用后所有图片拓展名强制为png\n\tAllowImageExtension        []string                                 //允许下载的图片拓展名\n\tDefaultImg                 func(form *Form, item Field) string      //图片出错时，设置默认图片\n\tMiddleSelector             []string                                 //中间层选择器(a链接选择器)，当详情页有多层时使用，暂不支持自动模式\n\tCustomExcelHeader          bool                                     //自定义Excel表格头部\n\tResultCallback             func(item map[string]string, form *Form) //自定义获取爬取结果回调\n\tApiConversion              func(html string, form *Form) []string   //api获取链接\n\tAutoPrefixEvent            func(chromedpCtx context.Context)        //自动爬取模式前置事件\n\tAutoListWaitSelector       string                                   //列表等待选择器（用于自动化爬取）\n\tAutoNextPageMode           NextPageMode                             //下一页模式（用于自动化爬取,目前支持常规分页和加载更多）\n\tAutoDetailForceNewTab      bool                                     //自动模式详情页强制打开新窗口(必须是a链接)\n\tAutoDetailWaitSelector     string                                   //详情等待选择器（用于自动化爬取）\n\tAutoNextSelector           string                                   //下一页选择器（用于自动化爬取）\n\tFilterError                bool                                     //过滤错误的行\n\tDetailUrls                 []string                                 //详情页列表\n\n```\n\n\u003cbr\u003e\n\n**设置header(包含cookie)**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.925g.com/\",\n\t\tChannel:      \"/zixun_page[PAGE].html/\",\n\t\tListSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \" a\",\n\t\tPageStart:    1,\n\t\tLength:       2,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"J\", Types: articleSpider.Text, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\"},\n\t\t\t\"img\": {ExcelHeader: \"H\", Types: articleSpider.Image, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd img:nth-child(1)\", ImageDir: \"app\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"app\"\n\t\t\t}},\n\t\t\t\"content\": {ExcelHeader: \"I\", Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"/api\"\n\t\t\t}},\n\t\t},\n\t\tListFields:            map[string]articleSpider.Field{},\n\t\tCustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 5,\n\t\tHttpHeader: map[string]string{\n\t\t\t\"cookie\":     \"xx\",\n\t\t\t\"user-agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36\",\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n\n```\n\n**自定义分页链接**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost: \"https://www.925g.com\",\n\t\tChannelFunc: func(form *articleSpider.Form) []string {\n\n\t\t\treturn []string{\n\t\t\t\t\"/zixun_page1.html/\",\n\t\t\t\t\"/zixun_page2.html/\",\n\t\t\t\t\"/zixun_page3.html/\",\n\t\t\t\t\"/zixun_page4.html/\",\n\t\t\t\t\"/zixun_page5.html/\",\n\t\t\t\t\"/zixun_page6.html/\",\n\t\t\t\t\"/zixun_page7.html/\",\n\t\t\t\t\"/zixun_page8.html/\",\n\t\t\t\t\"/zixun_page9.html/\",\n\t\t\t}\n\t\t},\n\t\tListSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \" a\",\n\t\tPageStart:    1,\n\t\tLength:       2,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"J\", Types: articleSpider.Text, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\"},\n\t\t\t\"img\": {ExcelHeader: \"H\", Types: articleSpider.Image, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd img:nth-child(1)\", ImageDir: \"app\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"app\"\n\t\t\t}},\n\t\t\t\"content\": {ExcelHeader: \"I\", Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"/api\"\n\t\t\t}},\n\t\t},\n\t\tListFields:            map[string]articleSpider.Field{},\n\t\tCustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 5,\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n\n```\n\n**详情页中间层**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:           \"https://www.ahjingcheng.com\",\n\t\tChannel:        \"/show/dongzuo--------[PAGE]---/\",\n\t\tListSelector:   \"body \u003e div:nth-child(5) \u003e div \u003e div.col-lg-wide-75.col-xs-1.padding-0 \u003e div:nth-child(2) \u003e div \u003e div.stui-pannel_bd \u003e ul \u003e li\",\n\t\tHrefSelector:   \" div \u003e a\",\n\t\tPageStart:      1,\n\t\tLength:         2,\n\t\tMiddleSelector: []string{\"body \u003e div:nth-child(3) \u003e div \u003e div.col-lg-wide-75.col-xs-1.padding-0 \u003e div:nth-child(1) \u003e div \u003e div:nth-child(2) \u003e div.stui-content__thumb \u003e a\"},\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"url\": {Types: articleSpider.Regular, Selector: `\"url\":\"([0-9A-Za-z/\\\\._:]+)\",\"url_next\"`, RegularIndex: 1},\n\t\t},\n\n\t\tDetailCoroutineNumber: 1,\n\t\tHttpHeader: map[string]string{\n\t\t\t\"cookie\":     \"Hm_lvt_66246be1ec92d6574526bda37cf445cc=1633767654; Hm_lvt_56a5b64a8f7a92a018377c693e064bdf=1633767654; recente=%5B%7B%22vod_name%22%3A%22%E4%B8%80%E7%BA%A7%E6%8C%87%E6%8E%A7%22%2C%22vod_url%22%3A%22https%3A%2F%2Fwww.ahjingcheng.com%2Fplay%2F119516-1-1%2F%22%2C%22vod_part%22%3A%22%E6%AD%A3%E7%89%87%22%7D%2C%7B%22vod_name%22%3A%22%E5%85%BB%E8%80%81%E5%BA%84%E5%9B%AD%22%2C%22vod_url%22%3A%22https%3A%2F%2Fwww.ahjingcheng.com%2Fplay%2F119506-1-1%2F%22%2C%22vod_part%22%3A%221080P%22%7D%2C%7B%22vod_name%22%3A%22%E4%B8%96%E7%95%8C%E4%B8%8A%E6%9C%80%E7%BE%8E%E4%B8%BD%E7%9A%84%E6%88%91%E7%9A%84%E5%A5%B3%22%2C%22vod_url%22%3A%22https%3A%2F%2Fwww.ahjingcheng.com%2Fplay%2F59426-1-1%2F%22%2C%22vod_part%22%3A%22%E5%85%A8%E9%9B%86%22%7D%2C%7B%22vod_name%22%3A%22%E6%9C%BA%E6%A2%B0%E5%B8%882%EF%BC%9A%E5%A4%8D%E6%B4%BB%E8%8B%B1%E6%96%87%E7%89%88%22%2C%22vod_url%22%3A%22https%3A%2F%2Fwww.ahjingcheng.com%2Fplay%2F91322-1-1%2F%22%2C%22vod_part%22%3A%22%E9%AB%98%E6%B8%85%22%7D%5D; Hm_lvt_66246be1ec92d6574526bda37cf445cc=1633767654; Hm_lvt_56a5b64a8f7a92a018377c693e064bdf=1633767654; PHPSESSID=7sfu1ui3crco1a817vocccl2u1; Hm_lpvt_66246be1ec92d6574526bda37cf445cc=1633914645; Hm_lpvt_56a5b64a8f7a92a018377c693e064bdf=1633914645\",\n\t\t\t\"user-agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36\",\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n\n```\n\n**自行处理爬取结果**\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.925g.com\",\n\t\tChannel:      \"/zixun_page[PAGE].html/\",\n\t\tListSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \" a\",\n\t\tPageStart:    1,\n\t\tLength:       10,\n\t\tListFields: map[string]articleSpider.Field{\n\n\t\t\t\"title\": {ExcelHeader: \"K\", Types: articleSpider.Text, Selector: \" a \u003e div \u003e span\"},\n\t\t},\n\t\tCustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 2,\n\t\tResultCallback: func(item map[string]string, form *articleSpider.Form) {\n\n\t\t\tfor s2, s3 := range item {\n\n\t\t\t\tfmt.Println(s2, \":\", s3)\n\n\t\t\t}\n\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n\n```\n\n**爬取列表是api的网页**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\t\"encoding/json\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:      \"http://www.tiyuxiu.com\",\n\t\tChannel:   \"/data/list_0_[PAGE].json?__t=16339338\",\n\t\tPageStart: 1,\n\t\tLength:    10,\n\t\tDetailFields: map[string]articleSpider.Field{\n\n\t\t\t\"title\":   {Types: articleSpider.Text, Selector: \"h1\"},\n\t\t\t\"content\": {Types: articleSpider.HtmlWithImage, Selector: \"#main-content\"},\n\t\t},\n\t\t//CustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 2,\n\t\tApiConversion: func(html string, form *articleSpider.Form) []string {\n\n\t\t\ttype list struct {\n\t\t\t\tUrl string\n\t\t\t}\n\n\t\t\tvar l []list\n\n\t\t\tjson.Unmarshal([]byte(html), \u0026l)\n\n\t\t\tvar temp []string\n\n\t\t\tfor _, l2 := range l {\n\n\t\t\t\ttemp = append(temp, l2.Url)\n\n\t\t\t}\n\n\t\t\treturn temp\n\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Api,context.Background()).Debug()\n\n\ts.Start()\n}\n```\n\n**自动化模式**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\ts := articleSpider.NewSpider(articleSpider.Form{\n\n\t\tHost:         \"https://www.925g.com\",\n\t\tChannel:      \"/zixun/\",\n\t\tListSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \"  a\",\n\t\t//下一页选择器\n\t\tAutoNextSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e ul \u003e li:nth-child(11) \u003e a\",\n\t\t//列表等待选择器\n\t\t//AutoListWaitSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.bdDiv \u003e div \u003e ul \u003e li:nth-child(1)\",\n\t\t//详情等待选择器\n\t\tAutoDetailWaitSelector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\",\n\t\tLength:                 3,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"J\", Types: articleSpider.Text, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\"},\n\t\t\t\"content\": {ExcelHeader: \"H\", Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd\", ImageDir: \"app\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"app\"\n\t\t\t}},\n\t\t},\n\n\t\t//cookie\n\t\tHttpHeader: map[string]string{\n\t\t\t\"cookie\": \"user_cookie=Vmod7XlkHN; UM_distinctid=17b805b421c1e0-0005d3dc1ac8ea-c343365-1fa400-17b805b421dda7; url_data=https://www.925g.com/zixun/,https://www.925g.com/; PHPSESSID=3m0ee50ba4r40jq3fleob2n71i; CNZZDATA1278942394=1852940385-1600066493-%7C1635143024; Hm_lvt_46233f03c62deb1e98a07bf1e1708415=1634807167,1634887947,1634955841,1635153418; Hm_lpvt_46233f03c62deb1e98a07bf1e1708415=1635153430\",\n\t\t},\n\t}, articleSpider.Auto,context.Background())\n\n\terr := s.Start()\n\n\tif err != nil {\n\n\t\tfmt.Println(err)\n\t}\n\n}\n```\n\n**自动化模式爬取加载更多页面**\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n\n\t\"github.com/chromedp/chromedp\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.btcfans.com\",\n\t\tChannel:      \"/zh-cn/wallet\",\n\t\tListSelector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div \u003e div.module-content \u003e ul \u003e li\",\n\t\tHrefSelector: \" a\",\n\t\t//下一页选择器\n\t\tAutoNextSelector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div \u003e div.module-content \u003e a\",\n\t\t//列表等待选择器\n\t\tAutoListWaitSelector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div \u003e div.module-content \u003e ul \u003e li:nth-child(1)\",\n\t\t//详情等待选择器\n\t\tAutoDetailWaitSelector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div.wallet-detail-page \u003e div.info_1 \u003e div.name \u003e div.name-ch\",\n\t\tLength:                 4,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"G\", Types: articleSpider.Text, Selector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div.wallet-detail-page \u003e div.info_1 \u003e div.name \u003e div.name-ch\"},\n\t\t\t\"content\": {Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div.wallet-detail-page \u003e div.wallet-des \u003e div \u003e p\", ExcelHeader: \"E\", ImagePrefix: func(form *articleSpider.Form, imageName string) string {\n\n\t\t\t\treturn \"/api/uploads\"\n\t\t\t}, ImageDir: \"game[date:md]/[random:1-100]\"},\n\t\t\t\"desc\":    {Types: articleSpider.Attr, Selector: \"meta[name=\\\"description\\\"]\", AttrKey: \"content\", ExcelHeader: \"H\"},\n\t\t\t\"keyword\": {Types: articleSpider.Attr, Selector: \"meta[name=\\\"keywords\\\"]\", AttrKey: \"content\", ExcelHeader: \"K\"},\n\t\t\t\"img\":     {Types: articleSpider.Image, Selector: \"body \u003e div.page-width.page-content \u003e div.main-content \u003e div.wallet-detail-page \u003e div.info_1 \u003e div.cover \u003e img\", ExcelHeader: \"F\", ImageDir: \"game[date:md]/[random:1-100]\"},\n\t\t\t\"type\":    {Types: articleSpider.Fixed, Selector: \"2\", ExcelHeader: \"L\"},\n\t\t\t//\"size\":    {Types: fileTypes.SingleField, Selector: \"#dinfo \u003e p.base \u003e i:nth-child(3)\", ExcelHeader: \"M\"},\n\t\t},\n\n\t\t//cookie\n\t\tHttpHeader: map[string]string{\n\t\t\t\"user-agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36\",\n\t\t\t\"cookie\":     \"lang=zh-CN; lang=zh-CN; lang=zh-CN; _ga=GA1.1.1532009431.1641283813; UM_distinctid=17e24238a22739-0fc0995e9cfdad-c343365-1fa400-17e24238a2352e; guid=cff3a072d6ca30b80ee729f0884a8596f65d9a28; CNZZDATA5291371=cnzz_eid%3D1358048227-1641278212-%26ntime%3D1641338428; CNZZDATA1278599438=848177868-1641279863-%7C1641340242; Hm_lvt_ddaa34551214df42d1e5f11974f9f744=1641283822,1641346329; _csrf=3f62bc78510faa5fecfbf404cbee0ec56d1c4f3a; s_a=1; _ga_76F07DJEB4=GS1.1.1641346328.3.1.1641346978.0; Hm_lpvt_ddaa34551214df42d1e5f11974f9f744=1641346980\",\n\t\t},\n\t\t//下一页模式\n\t\tAutoNextPageMode:  articleSpider.LoadMore,\n\t\tCustomExcelHeader: true,\n\t\t//爬取前置事件\n\t\tAutoPrefixEvent: func(chromedpCtx context.Context) {\n\n\t\t\t//关闭弹窗\n\t\t\tchromedp.Run(\n\t\t\t\tchromedpCtx,\n\n\t\t\t\tchromedp.Click(\"#Alert \u003e div \u003e div.sure_btn\", chromedp.ByQuery),\n\t\t\t)\n\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Auto,context.Background())\n\n\ts.Start()\n\n}\n```\n\n**代理**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.cgcosplay.jp\",\n\t\tChannel:      \"/product-list?page=[PAGE]\",\n\t\tListSelector: \"#inner_main_container \u003e section \u003e div \u003e div.page_contents.clearfix.alllist_contents \u003e div \u003e div.itemlist_box.tiled_list_box.layout_photo \u003e div \u003e ul \u003e li\",\n\t\tHrefSelector: \" div \u003e a\",\n\t\tPageStart:    1,\n\t\tLength:       10,\n\t\tListFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {ExcelHeader: \"A\", Types: articleSpider.Text, Selector: \"div \u003e a \u003e div \u003e div.list_item_data \u003e p.item_name \u003e span.goods_name\"},\n\t\t\t\"price\": {ExcelHeader: \"B\", Types: articleSpider.Text, Selector: \"div \u003e a \u003e div \u003e div.list_item_data \u003e div \u003e div \u003e p.selling_price \u003e span.figure\"},\n\t\t\t\"img\": {ExcelHeader: \"C\", Types: articleSpider.Image, Selector: \"  div \u003e a \u003e div \u003e div.list_item_photo \u003e div \u003e div\", ImageDir: \"cgcosplay_image\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"cgcosplay_image\"\n\t\t\t}},\n\t\t},\n\t\tCustomExcelHeader:     true,\n\t\tDetailCoroutineNumber: 10,\n\t\tLazyImageAttrName:     \"data-src\",\n\t\tHttpProxy:             \"http://127.0.0.1:4780\",\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n```\n\n**排除不需要的元素**\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n\t\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"http://www.3h3.com\",\n\t\tChannel:      \"/news/g_38_[PAGE].html\",\n\t\tListSelector: \"body \u003e div.main \u003e div \u003e div \u003e div.col-l \u003e ul.ul-info \u003e li\",\n\t\tHrefSelector: \"  div.pic \u003e a\",\n\t\tPageStart:    2,\n\t\tLength:       1,\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"content\": {Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.main \u003e div \u003e div \u003e div.col-l \u003e div.art-body\", NotSelector: []string{\"body \u003e div.main \u003e div \u003e div \u003e div.col-l \u003e div.art-body \u003e div\"}},\n\n\t\t},\n\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal,context.Background())\n\n\ts.Start()\n\n}\n\n```\n\n**根据详情页链接爬取**\n\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost: \"https://www.925g.com/\",\n\n\t\tDetailUrls: []string{\n\n\t\t\t\"https://www.925g.com/gonglue/138499.html\",\n\t\t\t\"https://www.925g.com/gonglue/138498.html\",\n\t\t\t\"https://www.925g.com/gonglue/138497.html\",\n\t\t\t\"https://www.925g.com/gonglue/138496.html\",\n\t\t\t\"https://www.925g.com/gonglue/138495.html\",\n\t\t\t\"https://www.925g.com/gonglue/138494.html\",\n\t\t},\n\t\tDetailFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {Types: articleSpider.Text, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.hd \u003e h1\"},\n\t\t\t\"img\":   {Types: articleSpider.Image, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd img:nth-child(1)\", ImageDir: \"[date:md]/[random:1-100]\"},\n\t\t\t\"content\": {Types: articleSpider.HtmlWithImage, Selector: \"body \u003e div.ny-container.uk-background-default \u003e div.wrap \u003e div \u003e div.commonLeftDiv.uk-float-left \u003e div \u003e div.articleDiv \u003e div.bd\", ImagePrefix: func(form *articleSpider.Form, path string) string {\n\n\t\t\t\treturn \"/api\"\n\t\t\t}, ImageDir: \"[date:md]/[random:1-100]\"},\n\t\t},\n\t\tDetailCoroutineNumber: 3,\n\t\tFilterError:           true,\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Url, context.Background())\n\n\ts.Start()\n\n}\n```\n\n**结果过滤**\n```go\npackage main\n\nimport (\n\t\"context\"\n\tarticleSpider \"github.com/PeterYangs/article-spider/v4\"\n\t\"strings\"\n\t\"time\"\n)\n\nfunc main() {\n\n\tf := articleSpider.Form{\n\t\tHost:         \"https://www.xyzs.com\",\n\t\tChannel:      \"/app/soft/index_[PAGE].html\",\n\t\tListSelector: \"body \u003e div.wrapper \u003e section.aplist \u003e ul \u003e li\",\n\n\t\tPageStart: 51,\n\t\tLength:    100,\n\n\t\tListFields: map[string]articleSpider.Field{\n\t\t\t\"title\": {Types: articleSpider.Text, Selector: \" a \u003e p.name\"},\n\t\t},\n\n\t\tDetailCoroutineNumber: 1,\n\t\tFilterError:           true,\n\t\tFilter: func(m map[string]string) bool {\n\n\t\t\tdefer time.Sleep(100 * time.Millisecond)\n\n\t\t\tif strings.Contains(m[\"title\"], \"直播\") {\n\n\t\t\t\treturn true\n\t\t\t}\n\n\t\t\treturn false\n\n\t\t},\n\t}\n\n\ts := articleSpider.NewSpider(f, articleSpider.Normal, context.Background())\n\n\ts.Start()\n\n}\n```\n\n\n**关于图片保存路径说明**\n\n**Field**中的图片路径设置\n\n\nImageDir:图片生成路径，该路径会生成在结果中，支持动态\nImagePrefix:图片前缀路径，不会出现在结果中\n\n全局设置\n\nSetImageDir(path),图片保存前缀，不会出现在结果中，默认是image\n\nSetSavePath(path),图片保存文件夹，不会出现在结果中\n\n图片保存路径拼接顺序：savePath+imageDir(全局)+imageDir(field)+文件名\n图片结果路径拼接顺序: imagePrefix+ImageDir+文件名\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeteryangs%2Farticle-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeteryangs%2Farticle-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeteryangs%2Farticle-spider/lists"}