{"id":18137950,"url":"https://github.com/shaobeichen/shenjianshou_spiders","last_synced_at":"2025-04-06T17:22:41.598Z","repository":{"id":143908572,"uuid":"75546656","full_name":"shaobeichen/shenjianshou_spiders","owner":"shaobeichen","description":"基于神箭手云爬虫平台的简单例子","archived":false,"fork":false,"pushed_at":"2022-12-08T02:09:33.000Z","size":13,"stargazers_count":18,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-12T23:29:01.991Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shaobeichen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-12-04T15:40:44.000Z","updated_at":"2023-03-05T06:55:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"b6dcecf0-8c6c-4e9d-a9ba-c05151825f81","html_url":"https://github.com/shaobeichen/shenjianshou_spiders","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaobeichen%2Fshenjianshou_spiders","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaobeichen%2Fshenjianshou_spiders/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaobeichen%2Fshenjianshou_spiders/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaobeichen%2Fshenjianshou_spiders/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shaobeichen","download_url":"https://codeload.github.com/shaobeichen/shenjianshou_spiders/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247518673,"owners_count":20951846,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-01T15:07:47.625Z","updated_at":"2025-04-06T17:22:41.565Z","avatar_url":"https://github.com/shaobeichen.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# shenjianshou_spiders\n基于神箭手云爬虫平台的简单例子\n\n-----------------------------------------------\n【停止维护】目前官方也不推荐了，更推荐后羿采集器。\n-----------------------------------------------\n\n\u003e 既然你来到了这里，想必你肯定已经知道了神箭手云爬虫平台是干什么的，目的也是非常的明确。\n\u003e 那么接下来的过程中，我将给你演示如何在最快时间内编写一个简单的爬虫，每一个属性的讲解，将会让你一路顺风。\n\n\n\n--------------------------------\n####**进入爬虫市场**\n\n\n首先进入爬虫市场，登录，链接在这--[爬虫市场]。\n![这里写图片描述](http://img.blog.csdn.net/20161204233506561)\n\n在这里也可以使用别人的爬虫和API，但是这不是我们的目的，况且博主自己把大部分爬虫看了，很少会有人将爬虫的代码开源出来，只有去官方的[GitHub](https://github.com/ShenJianShou/crawler_samples)能看到几个例子，但是对于初学者来说，还是稍微难了一点。\n\n这里是神箭手的开发文档，如果你想真的写爬虫，最好还是先过一遍文档，链接在这--[开发文档](http://docs.shenjianshou.cn/)。\n\n第一遍应该能了解个大概，但是又不知从何入手，没关系，主题来了。\n\n--------------------------------\n####**创建爬虫**\n\n\n进入我的控制台或者我的爬虫，点击新建应用。\n\n然后弹窗中选择自己开发，输入名字，点击创建。\n\n进入到项目中。\n\n--------------------------------\n####**编辑代码**\n\n这里是我一个采集名叫牛人微信的一个小网站。\n\n\n```\nvar configs = {\n  domains: [\"weixin.niurenqushi.com\"],\n  //定义爬虫爬取哪些域名下的网页, 非域名下的url会被忽略以提高爬取速度\n  \n  scanUrls: [\"http://weixin.niurenqushi.com/\"],\n  //定义爬虫的入口链接, 爬虫从这些链接开始爬同时这些链接也是监控爬虫所要监控的链接\n  \n  contentUrlRegexs: \"http://weixin\\\\.niurenqushi\\\\.com/article/list\\\\-\\\\d+.html\",\n  //定义”内容页”url的规则“内容页”是指包含要爬取内容的网页, 比如,“http://www.qiushibaike.com/article/117844937“就是糗事百科一个”内容页”\n  \n  helperUrlRegexes: [\"http://weixin\\\\.niurenqushi\\\\.com/article/2016-11-30/\\\\d+.html\"],\n  //定义”列表页”url的规则对于有列表页的网站, 使用此配置可以大幅提高爬虫的爬取速率“列表页”是指包含”内容页”列表的网页, 比如,“http://www.qiushibaike.com/8hr/page/2/?s=4867046“就是糗事百科的一个”列表页”\n  \n  enableJS: false,\n  //是否使用JS渲染默认值是false, 如果需要使用JS渲染, 可以设置此项为true\n  \n  interval: 3000,\n  //爬虫爬取每个网页的时间间隔单位: 毫秒\n  \n  fields: [\n  //定义”内容页”的抽取规则规则由一个个field组成, 一个field代表一个数据抽取项\n    {\n        name: \"article_title\",//名称字段，可以随便取\n        selector: \"//div[contains(@class,'contitle')]/h1\",//指的是你要抓取的内容在哪个标签中，这里就是在一个名叫contitle的div中的h1中抓取内容\n        required: false//是否能为空\n    },\n    {\n       name: \"article_content\",\n       selector: \"//div[contains(@id,'contentbody')]\",\n       required: false\n    },\n    {\n       name: \"article_publish_time\",\n       selector: \"//div[contains(@class,'contitle')]//div\",\n      required: false\n    },\n     {\n       name: \"article_topic\",\n       selector: \"//a[contains(@class,'ly')]\",\n      required: false\n    }\n  ]\n};\n\n//下面这个方法，当一个field的内容被抽取到后进行的回调, 在此回调中可以对网页中抽取的内容作进一步处理\nconfigs.afterExtractField = function(fieldName, data, page){\n  if (fieldName == \"article_content\") {\n        return cacheImg(data); // 返回可被托管到图片云服务器上的url，如果你只想将数据保存在本地，那么这个可以不写。\n    }\n\t if(fieldName==\"article_publish_time\"){\n      data = Date.parse(new Date())/1000+\"\";//将抓取到的时间转换成2016-12-4形式\n    }\n  return data;\n};\n  \nvar crawler = new Crawler(configs);\ncrawler.start();//开启爬虫\n```\n\n可以在右边测试栏先测试。\n\n####**抓取结果**\n\n![这里写图片描述](http://img.blog.csdn.net/20161204233643753)\n点击左侧总览，然后右上角启动。\n\n稍作等待。\n\n点击左侧爬取结果。\n\n####**发布结果**\n不论你是想发布到网站上还是保存数据下来，平台都有方法。\n\n如果想要导出Excel表格形式，点击左侧导出到文件。按需求选择，点击生成文件即可。\n\n如果是想发布到网站上，点击这里，会有很好的解释。--[数据发布](http://docs.shenjianshou.cn/use/datapub/useDataPublish.html)\n\n这里有很多集成式网站的接口，可以直接使用，博主就是用的wecenter发布的数据。\n\n如果在发布过程后，数据被发布了，但是其中的图片没有显示出来，那么可以试试神箭手平台的图片托管，有三种，阿里，七牛，神箭手，为了方便，我用的神箭手。\n\n[如何将图片托管到神箭手?](http://docs.shenjianshou.cn/use/picture/useSJSPhotoStorage.html)\n\n![这里写图片描述](http://docs.shenjianshou.cn/images/publish/use_sjs_photo_storage/use_sjs_photo_storage_img_12.jpg)\n\n\u003e \n\u003e 如果喜欢的话，请在GitHub上给上一颗star吧！\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaobeichen%2Fshenjianshou_spiders","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshaobeichen%2Fshenjianshou_spiders","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaobeichen%2Fshenjianshou_spiders/lists"}