{"id":17531176,"url":"https://github.com/bartozzz/crawlerr","last_synced_at":"2025-04-23T20:22:56.458Z","repository":{"id":48423836,"uuid":"56944425","full_name":"Bartozzz/crawlerr","owner":"Bartozzz","description":"A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.","archived":false,"fork":false,"pushed_at":"2021-07-27T04:58:56.000Z","size":1141,"stargazers_count":25,"open_issues_count":21,"forks_count":7,"subscribers_count":4,"default_branch":"development","last_synced_at":"2025-04-12T13:46:15.624Z","etag":null,"topics":["crawler","jsdom","nodejs","scraper","spider","web-crawler"],"latest_commit_sha":null,"homepage":"https://npmjs.com/package/crawlerr","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bartozzz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-04-23T22:55:57.000Z","updated_at":"2024-02-05T07:45:44.000Z","dependencies_parsed_at":"2022-08-24T09:43:12.904Z","dependency_job_id":null,"html_url":"https://github.com/Bartozzz/crawlerr","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bartozzz%2Fcrawlerr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bartozzz%2Fcrawlerr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bartozzz%2Fcrawlerr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bartozzz%2Fcrawlerr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bartozzz","download_url":"https://codeload.github.com/Bartozzz/crawlerr/tar.gz/refs/heads/development","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250506754,"owners_count":21441838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","jsdom","nodejs","scraper","spider","web-crawler"],"created_at":"2024-10-20T17:23:07.903Z","updated_at":"2025-04-23T20:22:56.426Z","avatar_url":"https://github.com/Bartozzz.png","language":"JavaScript","readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003ecrawlerr\u003c/h1\u003e\n\n[![Greenkeeper badge](https://badges.greenkeeper.io/Bartozzz/crawlerr.svg)](https://greenkeeper.io/)\n[![Build Status](https://img.shields.io/travis/Bartozzz/crawlerr.svg)](https://travis-ci.org/Bartozzz/crawlerr/)\n[![License](https://img.shields.io/github/license/Bartozzz/crawlerr.svg)](LICENSE)\n[![npm version](https://img.shields.io/npm/v/crawlerr.svg)](https://www.npmjs.com/package/crawlerr)\n[![npm downloads](https://img.shields.io/npm/dt/crawlerr.svg)](https://www.npmjs.com/package/crawlerr)\n  \u003cbr\u003e\n\n**crawlerr** is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on [_wildcards_](https://github.com/Bartozzz/wildcard-named#wildcard-named). It uses [_Bloom filter_](https://en.wikipedia.org/wiki/Bloom_filter) for caching. A browser-like feeling.\n\u003c/div\u003e\n\n\u003cbr /\u003e\n\n- **Simple:** our crawler is simple to use;\n- **Elegant:** provides a verbose, Express-like API;\n- **MIT Licensed**: free for personal and commercial use;\n- **Server-side DOM**: we use [JSDOM](https://github.com/jsdom/jsdom) to make you feel like in your browser;\n- **Configurable pool size**, **retries**, **rate limit** and more;\n\n## Installation\n\n```bash\n$ npm install crawlerr\n```\n\n## Usage\n\n`crawlerr(base [, options])`\n\nYou can find several examples in the [`examples/`](https://github.com/Bartozzz/crawlerr/tree/development/examples) directory. There are the some of the most important ones:\n\n### Example 1: _Requesting title from a page_\n\n```javascript\nconst spider = crawlerr(\"http://google.com/\");\n\nspider.get(\"/\")\n  .then(({ req, res, uri }) =\u003e console.log(res.document.title))\n  .catch(error =\u003e console.log(error));\n```\n\n### Example 2: _Scanning a website for specific links_\n\n```javascript\nconst spider = crawlerr(\"http://blog.npmjs.org/\");\n\nspider.when(\"/post/[digit:id]/[all:slug]\", ({ req, res, uri }) =\u003e {\n  const post = req.param(\"id\");\n  const slug = req.param(\"slug\").split(\"?\")[0];\n\n  console.log(`Found post with id: ${post} (${slug})`);\n});\n```\n\n### Example 3: _Server side DOM_\n\n```javascript\nconst spider = crawlerr(\"http://example.com/\");\n\nspider.get(\"/\").then(({ req, res, uri }) =\u003e {\n  const document = res.document;\n  const elementA = document.getElementById(\"someElement\");\n  const elementB = document.querySelector(\".anotherForm\");\n\n  console.log(element.innerHTML);\n});\n```\n\n### Example 4: _Setting cookies_\n\n```javascript\nconst url = \"http://example.com/\";\nconst spider = crawlerr(url);\n\nspider.request.setCookie(spider.request.cookie(\"foobar=…\"), url);\nspider.request.setCookie(spider.request.cookie(\"session=…\"), url);\n\nspider.get(\"/profile\").then(({ req, res, uri }) =\u003e {\n  //… spider.request.getCookieString(url);\n  //… spider.request.setCookies(url);\n});\n```\n\n## API\n\n### `crawlerr(base [, options])`\n\nCreates a new `Crawlerr` instance for a specific website with custom `options`. All routes will be resolved to `base`.\n\n| Option       | Default | Description                                    |\n|:-------------|:--------|:-----------------------------------------------|\n| `concurrent` | `10`    | How many request can be run simultaneously     |\n| `interval`   | `250`   | How often should new request be send (in ms)   |\n| …            | `null`  | See [`request` defaults](https://github.com/request/request#requestdefaultsoptions) for more informations   |\n\n\u003cbr /\u003e\n\n#### **public** `.get(url)`\n\nRequests `url`. Returns a `Promise` which resolves with `{ req, res, uri }`, where:\n- `req` is the [Request object](#request);\n- `res` is the [Response object](#response);\n- `uri` is the absolute `url` (resolved from `base`).\n\n**Example:**\n\n```javascript\nspider\n  .get(\"/\")\n  .then(({ res, req, uri }) =\u003e …);\n```\n\n\u003cbr /\u003e\n\n#### **public** `.when(pattern)`\n\nSearches the entire website for urls which match the specified `pattern`. `pattern` can include named [wildcards](https://github.com/Bartozzz/wildcard-named) which can be then retrieved in the response via `res.param`.\n\n**Example:**\n\n```javascript\nspider\n  .when(\"/users/[digit:userId]/repos/[digit:repoId]\", ({ res, req, uri }) =\u003e …);\n```\n\n\u003cbr /\u003e\n\n#### **public** `.on(event, callback)`\n\nExecutes a `callback` for a given `event`. For more informations about which events are emitted, refer to [queue-promise](https://github.com/Bartozzz/queue-promise).\n\n**Example:**\n\n```javascript\nspider.on(\"error\", …);\nspider.on(\"resolve\", …);\n```\n\n\u003cbr /\u003e\n\n#### **public** `.start()`/`.stop()`\n\nStarts/stops the crawler.\n\n**Example:**\n\n```javascript\nspider.start();\nspider.stop();\n```\n\n\u003cbr /\u003e\n\n#### **public** `.request`\n\nA configured [`request`](https://github.com/request/request) object which is used by [`retry-request`](https://github.com/stephenplusplus/retry-request) when crawling webpages. Extends from `request.jar()`. Can be configured when initializing a new crawler instance through `options`. See [crawler options](https://github.com/Bartozzz/crawlerr#crawlerrbase--options) and [`request` documentation](https://github.com/request/request) for more informations.\n\n**Example:**\n\n```javascript\nconst url = \"https://example.com\";\nconst spider = crawlerr(url);\nconst request = spider.request;\n\nrequest.post(`${url}/login`, (err, res, body) =\u003e {\n  request.setCookie(request.cookie(\"session=…\"), url);\n  // Next requests will include this cookie\n\n  spider.get(\"/profile\").then(…);\n  spider.get(\"/settings\").then(…);\n});\n```\n\n---\n\n### Request\n\n\u003csup\u003eExtends the default `Node.js` [incoming message](https://nodejs.org/api/http.html#http_class_http_incomingmessage).\u003c/sup\u003e\n\n#### **public** `get(header)`\n\nReturns the value of a HTTP `header`. The `Referrer` header field is special-cased, both `Referrer` and `Referer` are interchangeable.\n\n**Example:**\n\n```javascript\nreq.get(\"Content-Type\"); // =\u003e \"text/plain\"\nreq.get(\"content-type\"); // =\u003e \"text/plain\"\n```\n\n\u003cbr /\u003e\n\n#### **public** `is(...types)`\n\nCheck if the incoming request contains the \"Content-Type\" header field, and it contains the give mime `type`. Based on [type-is](https://www.npmjs.com/package/type-is).\n\n**Example:**\n\n```javascript\n// Returns true with \"Content-Type: text/html; charset=utf-8\"\nreq.is(\"html\");\nreq.is(\"text/html\");\nreq.is(\"text/*\");\n```\n\n\u003cbr /\u003e\n\n#### **public** `param(name [, default])`\n\nReturn the value of param `name` when present or `defaultValue`:\n- checks route placeholders, ex: `user/[all:username]`;\n- checks body params, ex: `id=12, {\"id\":12}`;\n- checks query string params, ex: `?id=12`;\n\n**Example:**\n\n```javascript\n// .when(\"/users/[all:username]/[digit:someID]\")\nreq.param(\"username\");  // /users/foobar/123456 =\u003e foobar\nreq.param(\"someID\");    // /users/foobar/123456 =\u003e 123456\n```\n\n---\n\n### Response\n\n#### **public** `jsdom`\n\nReturns the [JSDOM](https://www.npmjs.com/package/jsdom) object.\n\n\u003cbr /\u003e\n\n#### **public** `window`\n\nReturns the DOM window for response content. Based on [JSDOM](https://www.npmjs.com/package/jsdom).\n\n\u003cbr /\u003e\n\n#### **public** `document`\n\nReturns the DOM document for response content. Based on [JSDOM](https://www.npmjs.com/package/jsdom).\n\n**Example:**\n\n```javascript\nres.document.getElementById(…);\nres.document.getElementsByTagName(…);\n// …\n```\n\n## Tests\n\n```bash\nnpm test\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbartozzz%2Fcrawlerr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbartozzz%2Fcrawlerr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbartozzz%2Fcrawlerr/lists"}