{"id":13717335,"url":"https://github.com/zhang2333/light-crawler","last_synced_at":"2025-09-06T04:38:13.993Z","repository":{"id":2308204,"uuid":"46206710","full_name":"zhang2333/light-crawler","owner":"zhang2333","description":"a simplified directed customizable website crawler","archived":false,"fork":false,"pushed_at":"2024-02-29T15:28:46.000Z","size":115,"stargazers_count":74,"open_issues_count":0,"forks_count":21,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-08-09T09:54:11.619Z","etag":null,"topics":["crawler","node-js"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhang2333.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-11-15T06:07:05.000Z","updated_at":"2025-04-19T13:38:32.000Z","dependencies_parsed_at":"2024-11-14T05:41:46.232Z","dependency_job_id":null,"html_url":"https://github.com/zhang2333/light-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zhang2333/light-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhang2333%2Flight-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhang2333%2Flight-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhang2333%2Flight-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhang2333%2Flight-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhang2333","download_url":"https://codeload.github.com/zhang2333/light-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhang2333%2Flight-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273858844,"owners_count":25180766,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","node-js"],"created_at":"2024-08-03T00:01:20.908Z","updated_at":"2025-09-06T04:38:13.972Z","avatar_url":"https://github.com/zhang2333.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"## Light Crawler - A Directed Web Crawler\r\n\r\n[![Build Status](https://travis-ci.org/zhang2333/light-crawler.svg)](https://travis-ci.org/zhang2333/light-crawler)\r\n\r\n[![NPM Status](https://nodei.co/npm/light-crawler.png?downloads=true\u0026downloadRank=true)](https://nodei.co/npm/light-crawler/)\r\n\r\n[![NPM Downloads](https://nodei.co/npm-dl/light-crawler.png)](https://nodei.co/npm/light-crawler/)\r\n\r\nA simplified directed web crawler, easy for scraping web pages and downloading other resources.\r\n\r\nEnglish Doc(Here) or [中文文档](https://github.com/zhang2333/light-crawler/blob/master/README_zh_CN.md).\r\n\r\n### Installation\r\n\r\n```shell\r\nnpm install light-crawler\r\n```\r\n\r\n### Example\r\n\r\n```javascript\r\nconst Crawler = require('light-crawler');\r\n// create a instance of Crawler\r\nlet c = new Crawler();\r\n// add a url or an array to request\r\nc.addTask('http://www.xxx.com');\r\n// define a scraping rule\r\nc.addRule(function (result) {\r\n\t// result has 2 props : task and body\r\n\t// result.task: id, url, others you added.\r\n\t// result.body is the HTML of the page\r\n\t// scrape result.body, you can use cheerio\r\n})\r\n// start your crawler now\r\nc.start().then(() =\u003e {\r\n\tconsole.log('Finished!');\r\n});\r\n```\r\n### Crawler Properties\r\n\r\nIn light-crawler,requesting page is called `task`.Tasks will be put into task-pool and be executed in order.\r\n\r\n- `settings`: basic settings of crawler\r\n  - `id`: id of the crawler,integer or string，defalut: `null`\r\n  - `interval`: crawling interval，defalut: `0`(ms).or a random value in a range e.g.`[200,500]`\r\n  - `retry`: retry times，defalut:`3`\r\n  - `concurrency`: an integer for determining how many tasks should be run in parallel，defalut: `1`\r\n  - `skipDuplicates`: whether skip the duplicate task(same url)，defalut: `false`\r\n\r\n  - `requestOpts`: request options of task，**this is global request options**\r\n    - `timeout`: defalut: `10000`\r\n    - `proxy`: proxy address\r\n    - `headers`: headers of request，defalut: `{}`\r\n    - or other settings in [request opts][request-opts]\r\n\r\n- `taskCounter`: count all finished tasks whether they are failed or not\r\n- `failCounter`: count all failed tasks\r\n- `doneCounter`: count tasks which has done\r\n- `started`： boolean\r\n- `finished`： boolean\r\n- `errLog`: record all error infos in crawling\r\n- `downloadDir`: downloaded files in here, default: `../__dirname`\r\n- `drainAwait`: crawler will be finished when task-pool is drained.This prop will let crawler await adding tasks when task-pool is drained.default:`0`(ms)\r\n- `tasksSize`: size of task-pool, exceeding tasks is in the buffer of task-pool, default:`50`\r\n- `logger`: show the console log, default:`false`\r\n\r\n### Crawler API\r\n\r\n* `Crawler(opts: object)`\r\n\r\n construtor of `Crawler`\r\n \r\n```javascript\r\n// e.g.：\r\nlet c = new Crawler({\r\n\tinterval: 1000,\r\n\tretry: 5,\r\n\t.... // other props of `crawler.settings`\r\n\trequestOpts: {\r\n\t\ttimeout: 5000,\r\n\t\tproxy: 'http://xxx'\r\n\t\t.... // other props of `crawler.requestOpts`\r\n\t}\r\n});\r\n```\r\n* `tweak(opts: object)`\r\n\r\n tweak settings of crawler\r\n* `addTasks(urls: string or array[, props: obejct])`\r\n\r\n add task into task-pool\r\n\r\n```javascript\r\n// e.g.\r\n\r\n// add single task\r\n\r\n// input: url\r\nc.addTask('http://www.google.com');\r\n\r\n// input: url, prop\r\n// set request options for the task(will override global)\r\nc.addTask('http://www.google.com', {\r\n\tname: 'google',\r\n\trequestOpts: { timeout: 1 }\r\n});\r\n\r\n// input: url, next(processor of the task)\r\n// crawler rules will not process this task again\r\nc.addTask('http://www.google.com', function (result) {\r\n\tconsole.log('the task has done');\r\n});\r\n\r\n// input: url, prop, next\r\nc.addTask('http://www.google.com', { name: 'google' }, function (result) {\r\n\tconsole.log('the task has done');\r\n});\r\n\r\n// or input an object\r\nc.addTask({\r\n\turl: 'http://www.google.com',\r\n\ttype: 'SE',\r\n\tnext: function (result) {\r\n\t\tconsole.log('the task has done');\r\n\t}\r\n});\r\n\r\n// add multiple tasks\r\n\r\n// input: an array of string\r\nc.addTasks(['http://www.google.com','http://www.yahoo.com']);\r\n\r\n// add prop for tasks\r\nc.addTasks(['http://www.google.com','http://www.yahoo.com'], { type: 'SE' });\r\n// get these props in processing function\r\nc.addRule(function (result) {\r\n\tif (result.task.type == 'SE') {\r\n\t\tconsole.log('Searching Engine');\r\n\t}\r\n});\r\n\r\n// input: an array of object\r\nc.addTasks([\r\n\t{\r\n\t\turl: 'http://www.google.com',\r\n\t\tname: 'google'\r\n\t},\r\n\t{\r\n\t\turl: 'http://www.sohu.com',\r\n\t\tname: 'sohu'\r\n\t}\r\n]);\r\n\r\n```\r\n\r\n* `addRule(reg: string|object, func: function)`\r\n\r\n define a rule for scraping\r\n\r\n```javascript\r\n// e.g.：\r\nlet tasks = [\r\n\t'http://www.google.com/123', \r\n\t'http://www.google.com/2546', \r\n\t'http://www.google.com/info/foo',\r\n\t'http://www.google.com/info/123abc'\r\n];\r\nc.addTasks(tasks);\r\nc.addRule('http://www.google.com/[0-9]*', function (result) {\r\n\t// match to tasks[0] and tasks[1]\r\n});\r\nc.addRule('http://www.google.com/info/**', function (result) {\r\n\t// match to tasks[2] and tasks[3]\r\n});\r\n// or you can not define the rule\r\nc.addRule(function (result) {\r\n\t// match to all url in tasks\r\n});\r\n\r\n// $(i.e. cheerio.load(result.body)) is a optional arg\r\nc.addRule(function (result, $){\r\n    console.log($('title').text());\r\n});\r\n```\r\n\u003e Tip: light-crawler will transform all `.` in rule string.So you can directly write `www.a.com` instead of `www\\\\.a\\\\.com`.\r\nIf you need `.*`,you can use `**`, just like the upper example.If you have to use `.`,just `\u003c.\u003e`.\r\n\r\n* `start()`\r\n\r\n start the crawler\r\n```javascript\r\n// e.g.：\r\nc.start().then(function () {\r\n\t// on finished\r\n\tconsole.log('done！');\r\n});\r\n```\r\n\r\n* `pause()`\r\n\r\n pause the crawler\r\n\r\n* `resume()`\r\n\r\n resume the crawler\r\n\r\n* `isPaused()`\r\n\r\n the crawler is is paused or not\r\n \r\n* `stop()`\r\n\r\n stop the crawler\r\n \r\n* `uniqTasks()`\r\n\r\n reomve duplicate task(deeply compare)\r\n\r\n* `log(info: string, isErr: boolean, type: int)`\r\n\r\n crawler's logger\r\n\r\n```javascript\r\n// e.g.：\r\n// if it's an error,c.errLog will append it\r\nc.log('some problems', true);\r\n// console print: \r\n// [c.settings.id if it has]some problems\r\n\r\n// type is color code of first '[...]', e.g.'[Crawler is Finished]'\r\n// 1 red,2 green,3 yellow,4 blue,5 magenta,6 cyan...so on\r\nc.log('[Parsed]blahblah~', false, 4);\r\n// console print: \r\n// [c.settings.id if it has][Parsed]([Parsed] wil be blue)blahblah~\r\n\r\n// you can do something after log() everytime\r\nc.on('afterLog', function (info, isErr, type) {\r\n\tfs.appendFileSync('c.log', info); // append info to c.log\r\n\t....\r\n};\r\n\r\n// even you can replace the log()\r\nc.log = function (info, isErr, type) {\r\n\t// log something....\r\n};\r\n```\r\n\r\n### Download Files\r\njust add `downloadTask: true` for task you need to download\r\n```javascript\r\n// e.g.：\r\n// specify download directory\r\nc.tweak({ downloadDir: 'D:\\\\yyy' });\r\n\r\nlet file = 'http://xxx/abc.jpg';\r\n// 'abc.jpg' will be downloaded into 'D:\\\\yyy'\r\nc.addTask(file, {downloadTask: true});\r\n// or you can specify its name\r\nc.addTask(file, {downloadTask: true, downloadFile: 'mine.jpg'});\r\n// or specify relative dir(to 'D:\\\\yyy')\r\n// if this directory ('jpg') doesn't exist,crawler will create it\r\nc.addTask(file, {downloadTask: true, downloadFile: 'jpg/mine.jpg'});\r\n// or specify absolute dir\r\nc.addTask(file, {downloadTask: true, downloadFile: 'C:\\\\pics\\\\mine.jpg'});\r\n```\r\n\r\n### Events\r\n\r\n* `start`\r\n\r\n after the crawler is started\r\n```js\r\n// e.g.\r\nc.on('start', function () {\r\n    console.log('started!');\r\n});\r\n```\r\n\r\n* `beforeCrawl`\r\n\r\n task's props: `id`,`url`,`retry`,`working`,`requestOpts`,`downloadTask`,`downloadFile`...so on\r\n```js\r\n// e.g.\r\nc.on('beforeCrawl', function (task) {\r\n    console.log(task);\r\n});\r\n```\r\n\r\n* `drain`\r\n\r\n when task-pool and its buffer are drained\r\n```js\r\n// e.g.\r\nc.on('drain', function () {\r\n    // perform something\r\n});\r\n```\r\n\r\n* `error`\r\n\r\n### Utils API\r\n\r\n* `getLinks(html: string, baseUrl: string)`\r\n\r\n get all links in the element\r\n\r\n```js\r\n// e.g.：\r\nlet html = `\r\n  \u003cdiv\u003e\r\n\t\u003cul\u003e\r\n\t\t\u003cli\u003e\r\n            \u003ca href=\"http://link.com/a/1\"\u003e1\u003c/a\u003e\r\n            \u003ca href=\"a/2\"\u003e2\u003c/a\u003e\r\n            \u003ca href=\"b/3\"\u003e3\u003c/a\u003e\r\n\t\t\u003c/li\u003e\r\n\t\t\u003cli\u003e\u003ca href=\"4\"\u003e4\u003c/a\u003e\u003c/li\u003e\r\n\t\t\u003cli\u003efoo\u003c/li\u003e\r\n\t\u003c/ul\u003e\r\n\u003c/div\u003e\r\n`;\r\nlet links = Crawler.getLinks(html, 'http://link.com/index.html');\r\nconsole.log(links);\r\n// ['http://link.com/a/1','http://link.com/a/2','http://link.com/b/3','http://link.com/4']\r\n\r\n// you can also use cheerio\r\nlet $ = cheerio.load(html);\r\nlet links = Crawler.getLinks($('ul'));\r\n```\r\n\r\n* `getImages(html: string, baseUrl: string)`\r\n\r\n like `getLinks`, get `src` from `\u003cimg\u003e`.\r\n\r\n* `loadHeaders(file: string)`\r\n\r\n load request headers from file\r\n`example.headers`\r\n```\r\nAccept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding:gzip, deflate, sdch\r\nAccept-Language:zh-CN,zh;q=0.8,en;q=0.6\r\nCache-Control:max-age=0\r\nConnection:keep-alive\r\nCookie:csrftoken=Wwb44iw\r\nHost:abc\r\nUpgrade-Insecure-Requests:1\r\nUser-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64)\r\n...\r\n```\r\nload this file and set headers for requesting\r\n```js\r\nlet headers = Crawler.loadHeaders('example.headers');\r\nc.tweak({\r\n\trequestOpts: {\r\n\t\theaders: headers\r\n\t}\r\n});\r\n```\r\n\r\n* `getRegWithPath(fromUrl: string)`\r\n\r\n get reg string with path of fromUrl\r\n \r\n```js\r\nlet reg = Crawler.getRegWithPath('http://www.google.com/test/something.html');\r\n// reg: http://www.google.com/test/**\r\n```\r\n\r\n### Advanced Usage\r\n\r\n* `addRule`\r\n\r\n```js\r\n// since 1.5.10, the rule of scraping could be a object\r\nc.addTask('http://www.baidu.com', { name: 'baidu', type: 'S.E.' });\r\nc.addTask('http://www.google.com', { name: 'google', type: 'S.E.' });\r\n// following rules has same reg string, but name are different\r\nc.addRule({ reg: 'www.**.com', name: 'baidu' }, function (r) {\r\n    // scraping r.body\r\n});\r\nc.addRule({ reg: 'www.**.com', name: 'google' }, function (r) {\r\n    // scraping r.body\r\n});\r\n\r\n// using function match could make rules more complex\r\n// boolean match(task)\r\nc.addTask('http://www.baidu.com', { tag: 3 });\r\nc.addTask('http://www.google.com', { tag: 50 });\r\nc.addRule({ reg: 'www.**.com', match: function (task) {\r\n\t\treturn task.tag \u003e 10;\r\n}}, function (r) {\r\n    // scrape google\r\n});\r\n```\r\n\r\n* `loadRule`\r\n\r\n recycle rules\r\n\r\n```js\r\n// lc-rules.js\r\nexports.crawlingGoogle = {\r\n    reg: 'www.**.com',\r\n    name: 'google',\r\n    scrape: function (r, $) {\r\n        // ...\r\n    }\r\n};\r\n\r\n// crawler.js\r\nlet c = new Crawler();\r\nc.addTask('http://www.google.com', { name: 'google' });\r\nc.loadRule(crawlingGoogle);\r\n\r\n// or expand the function named 'scrape'\r\n// implement the 'expand' in 'loadRule'\r\n// on the other hand, you can use 'this'(Crawler) in 'addRule' or 'loadRule'\r\ncrawlingGoogle = {\r\n    // ...\r\n    scrape: function (r, $, expand) {\r\n        expand($('title').text());\r\n    }\r\n};\r\n\r\ncrawlerAAA.loadRule(crawlingGoogle, function (text) {\r\n    console.log(text);\r\n    this.addTask('www.abc.com');\r\n});\r\n\r\ncrawlerBBB.loadRule(crawlingGoogle, function (text) {\r\n    console.log(text.toLowerCase());\r\n});\r\n```\r\n\r\n* `removeRule`\r\n\r\n remove some rules\r\n\r\n```js\r\n// by its 'ruleName'\r\nlet rule = {\r\n    // ...\r\n    ruleName: 'someone'\r\n    // ...\r\n}\r\nc.loadRule(rule);\r\nc.removeRule('someone');\r\n```\r\n\r\n[request-opts]: https://github.com/request/request#requestoptions-callback\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhang2333%2Flight-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhang2333%2Flight-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhang2333%2Flight-crawler/lists"}