{"id":17242873,"url":"https://github.com/imcuttle/jandan-spider","last_synced_at":"2025-04-14T03:30:26.725Z","repository":{"id":45113052,"uuid":"62325504","full_name":"imcuttle/jandan-spider","owner":"imcuttle","description":"jandan-spider","archived":false,"fork":false,"pushed_at":"2022-01-07T14:28:12.000Z","size":8,"stargazers_count":6,"open_issues_count":1,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T02:34:02.351Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/imcuttle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-06-30T16:16:31.000Z","updated_at":"2018-06-26T05:27:45.000Z","dependencies_parsed_at":"2022-09-22T17:12:09.091Z","dependency_job_id":null,"html_url":"https://github.com/imcuttle/jandan-spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcuttle%2Fjandan-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcuttle%2Fjandan-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcuttle%2Fjandan-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcuttle%2Fjandan-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/imcuttle","download_url":"https://codeload.github.com/imcuttle/jandan-spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248815446,"owners_count":21165927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T06:14:18.557Z","updated_at":"2025-04-14T03:30:26.689Z","avatar_url":"https://github.com/imcuttle.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# “流氓不可怕,就怕流氓有文化”\n\u003e 前天刚考完编译，今天考完网络，就开始捣鼓代码了，花了一天时间摸索了一下`nodejs`的爬虫，也就是`tcp`，`http`连接。\n\n也是就做了一个爬取[煎蛋网妹子图](http://jandan.net/)的爬虫，并保持至本地。\n\u003c!--more--\u003e\n\n# 思路介绍\n1. 通过`http请求报文`模拟一次访问煎蛋网的操作\n2. 获取到了网页的HTML代码后，进行正则表达式匹配，得到图片地址\n3. 通过图片地址，再次发送`http请求报文`，将图片数据保存至本地\n思路简单了解后，便开始工作了。\n\n# 然而并不是一帆风顺\n## 得不到`HTML`？\n参考资料[http://chenxi.name/60.html](http://chenxi.name/60.html)，利用`request`包进行傻瓜式调用，然而并不能生效，将会跳转至一个[屏蔽提示网页](http://jandan.net/block.php)\n![png](http://moyuyc.github.io/images/jandan-block.png)\n煎蛋网为了防止恶意爬取数据，进行了一定程度的防爬措施。\n但这可难不倒我，**为什么在浏览器上就能正常浏览图片页面呢？**\n于是我打开浏览器控制台，复制页面请求报文的cmd格式，粘贴至命令行中运行，能够正确得到`HTML`\n![png](http://moyuyc.github.io/images/jandan2.png)\n![png](http://moyuyc.github.io/images/jandan3.png)\n所以，我觉得问题就是出现在请求报文头部数据，于是复制下浏览器中报头，利用`nodejs`的`http`包，建立http连接。\n```javascript\nrequire('http').get({\n        hostname:'jandan.net',\n        path:'/',\n        header:{\n            ...\n        }\n    },function(res){\n        \n    })\n```\n但是奇怪的是！还是响应302，跳转至屏蔽提示页面。\n\n最后没办法的我只好利用底层一点的api——`net`包，建立tcp连接，发送符合`http请求报文`格式的数据。\n```javascript\nvar net = require('net');\nvar header = require('fs').readFileSync('./header.txt').toString();\n\nmodule.exports = function (path,callback) {\n    const socket = net.createConnection(80,'jandan.net');\n\n    socket.write(\n        'GET '+path+' HTTP/1.1\\r\\n'+\n        header\n    );\n\n    socket.setEncoding('utf-8');\n    socket.setTimeout(4000,function () {\n        callback(html);\n        console.error(new Error('Time OUT'));\n        socket.end();\n    });\n\n    var html = '';\n    socket.on('data',function (chunk) {\n        html+=chunk;\n    });\n\n    socket.on('end',function () {\n        console.log('disconnected from server');\n    });\n}\n\n```\n`header.txt`\n```\nHost: jandan.net\nConnection: keep-alive\nCache-Control: max-age=0\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\nUpgrade-Insecure-Requests: 1\nUser-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36\nReferer: http://jandan.net/v\nAccept-Language: zh-CN,zh;q=0.8\nCookie: gif-click-load=on; bad-click-load=on; PHPSESSID=u1gnmqnpb75injakbgvkb6r413; 4036050675=c119Yp%2BLrMWuv%2BWMyYtq3x6vTdbFzaTbUyoiLt%2Fv; jdna=596e6fb28c1bb47f949e65e1ae03f7f5#1467288596467; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1467287791; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1467288598; _ga=GA1.2.330681373.1467287790\n\n```\n**注意，header.txt最后需要两个`\\r\\n`表示请求报头结束**\n最后果然是成功了，但具体两种方法的不同我也说不上来，希望有热心读者能告诉我。\n\n## 数据传输同步异步？\n利用下面的递归方法加上`Promise.all`同步方法，防止过度的tcp连接（改用下面方法后，tcp读写错误明显减少，但还是会出现，不知道有没有大神帮我解决该问题呢？）\n```javascript\nfunction run(i,low) {\n    if(i\u003clow) return;\n    spider('/ooxx/page-'+i,function (html) {\n        var images = [];\n        html.replace(/\u003cimg.+?src=\"(http.+?sina.+?)\"/g,function (m,c) {\n            images.unshift(c);\n        });\n        var page = i;\n        var proms = images.map((x,i,a)=\u003e{\n            return new Promise((resolve,reject)=\u003e{\n                var req = http.get(x,function (res) {\n                    res.on('error',function (err) {\n                        console.error(err);\n                        resolve('fail');\n                    });\n                    var filename = x.substr(x.lastIndexOf('/')+1);\n                    download(dir+'/'+filename,res);\n                    console.log('PAGE:'+page+'...'+filename+'...'+(i+1)+'/'+a.length);\n                    resolve('done');\n                }).end();\n            });\n        });\n        Promise.all(proms)\n            .then((values)=\u003e{\n                //上一页的图片加入下载队列后，再开始递归下一页。\n                run(i-1,low);\n            });\n    });\n}\n```\n\n最后文件夹就像下面一样！\n![png](http://moyuyc.github.io/images/jandan4.png)\n# 甩下代码地址，飙个车\n[jandan-spider](https://github.com/moyuyc/jandan-spider)\n\n关注我的博客[moyuyc.github.io](http://moyuyc.github.io/) ,有技术的老司机带你飙车！\n\n![png](http://moyuyc.github.io/images/girl1.gif)\n![png](http://moyuyc.github.io/images/girl6.gif)\n![png](http://moyuyc.github.io/images/girl2.jpg)\n![png](http://moyuyc.github.io/images/girl3.jpg)\n![png](http://moyuyc.github.io/images/girl4.jpg)\n![png](http://moyuyc.github.io/images/girl5.jpg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimcuttle%2Fjandan-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimcuttle%2Fjandan-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimcuttle%2Fjandan-spider/lists"}