{"id":18428098,"url":"https://github.com/airtoxin/stackable-crawler","last_synced_at":"2025-04-13T20:54:15.193Z","repository":{"id":57369179,"uuid":"65473948","full_name":"airtoxin/stackable-crawler","owner":"airtoxin","description":"middleware based lightweight crawler framework","archived":false,"fork":false,"pushed_at":"2016-08-23T23:56:09.000Z","size":16,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-13T20:54:12.781Z","etag":null,"topics":["crawler","javascript","lightweight"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airtoxin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-08-11T14:00:23.000Z","updated_at":"2017-10-10T23:32:07.000Z","dependencies_parsed_at":"2022-09-26T16:41:15.415Z","dependency_job_id":null,"html_url":"https://github.com/airtoxin/stackable-crawler","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airtoxin%2Fstackable-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airtoxin%2Fstackable-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airtoxin%2Fstackable-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airtoxin%2Fstackable-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airtoxin","download_url":"https://codeload.github.com/airtoxin/stackable-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248782281,"owners_count":21160716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","javascript","lightweight"],"created_at":"2024-11-06T05:12:47.983Z","updated_at":"2025-04-13T20:54:15.176Z","avatar_url":"https://github.com/airtoxin.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# stackable-crawler [![Build Status](https://travis-ci.org/airtoxin/stackable-crawler.svg?branch=master)](https://travis-ci.org/airtoxin/stackable-crawler)\n\nmiddleware based lightweight crawler framework.\n\n## Features\n\n+ Customizable pre-request, pre-process middleware stacks (it enables to log, cache, normalize, etc...)\n+ Cancelable crawler\n+ Customizable caching strategy\n+ Parallel crawling\n+ Pause/Resume crawling (it enables to sleep crawler)\n+ Error handling (it enables to retry)\n\n## Install\n\n`$ npm i -S stackable-crawler`\n\n## QuickStart\n\n```js\nimport StackableCrawler, {\n  CancelRequest\n} from 'stackable-crawler';\n\nconst crawler = new StackableCrawler({\n  prerequest: [\n    options =\u003e {\n      console.log('options:', options);\n      return options;\n    }\n  ],\n  processor([response, body]) {\n    return new Promise((resolve, reject) =\u003e {\n      saveFileFunction(body, error =\u003e {\n        if (error) return reject(error);\n        resolve();\n      });\n    });\n  }\n});\n\ncrawler.on('error', (error, url) =\u003e {\n  console.error(error, url);\n});\n\ncrawler.add('https://github.com/');\n```\n\n### What can I do in `prerequest`?\n\n`prerequest` middlewares stack can have sideeffect about requesting options. `options` is [request](https://github.com/request/request) module's request option. If prerequest middleware throw `CancelRequest` error, to request to that url was canceled.\n\n### What can I do in `preprocess`?\n\n`preprocess` middlewares stack can have sideeffect about response, body, requestOptions. They are also from [request](https://github.com/request/request).\n\n### Friendly crawler?\n\nUse sleep function\n\n```js\nconst sleepCrawler = (crawler, sleepTime, interval) =\u003e {\n  setTimeout(() =\u003e {\n    crawler.stop();\n    crawler.once('stopped', () =\u003e {\n      setTimeout(() =\u003e crawler.start(), sleepTime);\n    });\n  }, interval);\n}\n```\n\n## Documents\n\n### StackableCrawler class (Default export)\n\nClass of crawler. Extends EventEmitter3.\n\n#### constructor\n\nTake one argument, configure object.\n\n```js\n{\n  concurrency: 1, // max # of parallel crawling\n  prerequest: [], // prerequest middleware functions\n  requestCache() {}, // cache strategy\n  preprocess: [], // preprocess middleware functions\n  processor() {}, // main callback to handle crawled Document\n}\n```\n\n##### prerequest middleware\n\nFunction that process `requestOption`s. Default argument is `{ uri: url }`. Function must return new (mutated) requestOptions or Promise[requestOptions].\n\n##### requestCache function\n\nFunction that returns cached value or `undefined`. Type of cached value is T or Promise[T]. If cached value returned, crawler pass through that values to processor function. If undefined returned, crawler fetch document as usual.\n\n##### preprocess middleware\n\nFunction that process `[response, body, requestOptions]`. `response` and `body` are fetching result of [request](https://github.com/request/request) module. It also return new argument or Promise.\n\n##### processor function\n\nMain function. You can do everything here. It can return promise.\n\n#### crawler#add(url)\n\nAdd url to crawling task queue.\n\n#### crawler#stop()\n\nPause crawler. If crawler has one more running tasks, these are still running until finished, but no more run new task.\n\n#### crawler#start()\n\nResume crawler if paused.\n\n#### crawler#on(), crawler#once(), ...\n\nThese methods are inherit from EventEmitter3.\n\navailable event and args\n\n+ event: `error`, args: `[error, url]`\n+ event: `stopped`\n\n### CancelRequest class (Named export)\n\nError class. If crawler throws CancelRequest error `throw new CancelRequest()`, crawler stop to request that url with no error.\n\n### middlewares object (Named export)\n\nBundled simple middlewares.\n\n#### middlewares.filterUrl\n\nFilter only valid url. If url is invalid, this middleware throw CancelRequest error.\n\n#### middlewares.urlCache\n\nFunction that returns simple in memory cache middleware. Do `urlCache()` when use. If requesting url was already cached, this middleware throw CancelRequest error.\n\n#### middlewares.logger\n\nVery simple console.log middleware\n\n#### middlewares.body2cheerio\n\nReplace body to cheerio object (`$`). Next middlewares and processor call with `[response, $, requestOptions]`\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairtoxin%2Fstackable-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairtoxin%2Fstackable-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairtoxin%2Fstackable-crawler/lists"}