{"id":13397859,"url":"https://github.com/brendonboshell/supercrawler","last_synced_at":"2026-01-12T02:27:57.862Z","repository":{"id":41203337,"uuid":"63552531","full_name":"brendonboshell/supercrawler","owner":"brendonboshell","description":"A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.","archived":false,"fork":false,"pushed_at":"2022-12-30T18:25:30.000Z","size":680,"stargazers_count":374,"open_issues_count":23,"forks_count":61,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-09-30T23:18:37.146Z","etag":null,"topics":["crawler","distributed-crawler","robot","sitemap","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brendonboshell.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-17T21:08:09.000Z","updated_at":"2024-09-21T12:50:59.000Z","dependencies_parsed_at":"2023-01-31T13:00:25.704Z","dependency_job_id":null,"html_url":"https://github.com/brendonboshell/supercrawler","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brendonboshell%2Fsupercrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brendonboshell%2Fsupercrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brendonboshell%2Fsupercrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brendonboshell%2Fsupercrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brendonboshell","download_url":"https://codeload.github.com/brendonboshell/supercrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243505050,"owners_count":20301546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","distributed-crawler","robot","sitemap","web-crawler"],"created_at":"2024-07-30T18:01:49.331Z","updated_at":"2026-01-12T02:27:57.823Z","avatar_url":"https://github.com/brendonboshell.png","language":"JavaScript","readme":"# Node.js Web Crawler\n\n[![npm](https://img.shields.io/npm/v/supercrawler.svg?maxAge=2592000)]()\n[![npm](https://img.shields.io/npm/l/supercrawler.svg?maxAge=2592000)]()\n[![GitHub issues](https://img.shields.io/github/issues/brendonboshell/supercrawler.svg?maxAge=2592000)]()\n[![David](https://img.shields.io/david/brendonboshell/supercrawler.svg?maxAge=2592000)]()\n[![David](https://img.shields.io/david/dev/brendonboshell/supercrawler.svg?maxAge=2592000)]()\n[![Travis](https://img.shields.io/travis/brendonboshell/supercrawler.svg?maxAge=2592000)]()\n\nSupercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.\n\nWhen Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.\n\n## Features\n\n* **Link Detection**. Supercrawler will parse crawled HTML documents, identify\n  links and add them to the queue.\n* **Robots Parsing**. Supercrawler will request robots.txt and check the rules\n  before crawling. It will also identify any sitemaps.\n* **Sitemaps Parsing**. Supercrawler will read links from XML sitemap files,\n  and add links to the queue.\n* **Concurrency Limiting**. Supercrawler limits the number of requests sent out\n  at any one time.\n* **Rate limiting**. Supercrawler will add a delay between requests to avoid\n  bombarding servers.\n* **Exponential Backoff Retry**. Supercrawler will retry failed requests after 1 hour, then 2 hours, then 4 hours, etc. To use this feature, you must use the database-backed or Redis-backed crawl queue.\n* **Hostname Balancing**. Supercrawler will fairly split requests between\ndifferent hostnames. To use this feature, you must use the Redis-backed crawl queue.\n\n## How It Works\n\n**Crawling** is controlled by the an instance of the `Crawler` object, which acts like a web client. It is responsible for coordinating with the *priority queue*, sending requests according to the concurrency and rate limits, checking the robots.txt rules and despatching content to the custom *content handlers* to be processed. Once started, it will automatically crawl pages until you ask it to stop.\n\nThe **Priority Queue** or **UrlList** keeps track of which URLs need to be crawled, and the order in which they are to be crawled. The Crawler will pass new URLs discovered by the content handlers to the priority queue. When the crawler is ready to crawl the next page, it will call the `getNextUrl` method. This method will work out which URL should be crawled next, based on implementation-specific rules. Any retry logic is handled by the queue.\n\nThe **Content Handlers** are functions which take content buffers and do some further processing with them. You will almost certainly want to create your own content handlers to analyze pages or store data, for example. The content handlers tell the Crawler about new URLs that should be crawled in the future. Supercrawler provides content handlers to parse links from HTML pages, analyze robots.txt files for `Sitemap:` directives and parse sitemap files for URLs.\n\n## Get Started\n\nFirst, install Supercrawler.\n\n```\nnpm install supercrawler --save\n```\n\nSecond, create an instance of `Crawler`.\n\n```js\nvar supercrawler = require(\"supercrawler\");\n\n// 1. Create a new instance of the Crawler object, providing configuration\n// details. Note that configuration cannot be changed after the object is\n// created.\nvar crawler = new supercrawler.Crawler({\n  // By default, Supercrawler uses a simple FIFO queue, which doesn't support\n  // retries or memory of crawl state. For any non-trivial crawl, you should\n  // create a database. Provide your database config to the constructor of\n  // DbUrlList.\n  urlList: new supercrawler.DbUrlList({\n    db: {\n      database: \"crawler\",\n      username: \"root\",\n      password: secrets.db.password,\n      sequelizeOpts: {\n        dialect: \"mysql\",\n        host: \"localhost\"\n      }\n    }\n  }),\n  // Tme (ms) between requests\n  interval: 1000,\n  // Maximum number of requests at any one time.\n  concurrentRequestsLimit: 5,\n  // Time (ms) to cache the results of robots.txt queries.\n  robotsCacheTime: 3600000,\n  // Query string to use during the crawl.\n  userAgent: \"Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)\",\n  // Custom options to be passed to request.\n  request: {\n    headers: {\n      'x-custom-header': 'example'\n    }\n  }\n});\n```\n\nThird, add some content handlers.\n\n```js\n// Get \"Sitemaps:\" directives from robots.txt\ncrawler.addHandler(supercrawler.handlers.robotsParser());\n\n// Crawl sitemap files and extract their URLs.\ncrawler.addHandler(supercrawler.handlers.sitemapsParser());\n\n// Pick up \u003ca href\u003e links from HTML documents\ncrawler.addHandler(\"text/html\", supercrawler.handlers.htmlLinkParser({\n  // Restrict discovered links to the following hostnames.\n  hostnames: [\"example.com\"]\n}));\n\n// Match an array of content-type\ncrawler.addHandler([\"text/plain\", \"text/html\"], myCustomHandler);\n\n// Custom content handler for HTML pages.\ncrawler.addHandler(\"text/html\", function (context) {\n  var sizeKb = Buffer.byteLength(context.body) / 1024;\n  logger.info(\"Processed\", context.url, \"Size=\", sizeKb, \"KB\");\n});\n```\n\nFourth, add a URL to the queue and start the crawl.\n\n```js\ncrawler.getUrlList()\n  .insertIfNotExists(new supercrawler.Url(\"http://example.com/\"))\n  .then(function () {\n    return crawler.start();\n  });\n```\n\nThat's it! Supercrawler will handle the crawling for you. You only have to define your custom behaviour in the content handlers.\n\n## Crawler\n\nEach `Crawler` instance represents a web crawler. You can configure your\ncrawler with the following options:\n\n| Option | Description |\n| --- | --- |\n| urlList | Custom instance of `UrlList` type queue. Defaults to `FifoUrlList`, which processes URLs in the order that they were added to the queue; once they are removed from the queue, they cannot be recrawled. |\n| interval | Number of milliseconds between requests. Defaults to 1000. |\n| concurrentRequestsLimit | Maximum number of concurrent requests. Defaults to 5. |\n| robotsEnabled | Indicates if the robots.txt is downloaded and checked. Defaults to `true`. |\n| robotsCacheTime | Number of milliseconds that robots.txt should be cached for. Defaults to 3600000 (1 hour). |\n| robotsIgnoreServerError | Indicates if `500` status code response for robots.txt should be ignored. Defaults to `false`. |\n| userAgent | User agent to use for requests. This can be either a string or a function that takes the URL being crawled. Defaults to `Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)`. |\n| request | Object of options to be passed to [request](https://github.com/request/request). Note that request does not support an asynchronous (and distributed) cookie jar. |\n\nExample usage:\n\n```js\nvar crawler = new supercrawler.Crawler({\n  interval: 1000,\n  concurrentRequestsLimit: 1\n});\n```\n\nThe following methods are available:\n\n| Method | Description |\n| --- | --- |\n| getUrlList | Get the `UrlList` type instance. |\n| getInterval | Get the interval setting. |\n| getConcurrentRequestsLimit | Get the maximum number of concurrent requests. |\n| getUserAgent | Get the user agent. |\n| start | Start crawling. |\n| stop | Stop crawling. |\n| addHandler(handler) | Add a handler for all content types. |\n| addHandler(contentType, handler) | Add a handler for a specific content type. If `contentType` is a string, then (for example) 'text' will match 'text/html', 'text/plain', etc. If `contentType` is an array of strings, the page content type must match exactly. |\n\nThe `Crawler` object fires the following events:\n\n| Event | Description |\n| --- | --- |\n| crawlurl(url) | Fires when crawling starts with a new URL. |\n| crawledurl(url, errorCode, statusCode, errorMessage) | Fires when crawling of a URL is complete. `errorCode` is `null` if no error occurred. `statusCode` is set if and only if the request was successful. `errorMessage` is `null` if no error occurred. |\n| urllistempty | Fires when the URL list is (intermittently) empty. |\n| urllistcomplete | Fires when the URL list is permanently empty, barring URLs added by external sources. This only makes sense when running Supercrawler in non-distributed fashion. |\n\n## DbUrlList\n\n`DbUrlList` is a queue backed with a database, such as MySQL, Postgres or SQLite. You can use any database engine supported by Sequelize.\n\nIf a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.\n\nOptions:\n\n| Option | Description |\n| --- | --- |\n| opts.db.database | Database name. |\n| opts.db.username | Database username. |\n| opts.db.password | Database password. |\n| opts.db.sequelizeOpts | Options to pass to sequelize. |\n| opts.db.table | Table name to store URL queue. Default = 'url' |\n| opts.recrawlInMs | Number of milliseconds to recrawl a URL. Default = 31536000000 (1 year) |\n\nExample usage:\n\n```js\nnew supercrawler.DbUrlList({\n  db: {\n    database: \"crawler\",\n    username: \"root\",\n    password: \"password\",\n    sequelizeOpts: {\n      dialect: \"mysql\",\n      host: \"localhost\"\n    }\n  }\n})\n```\n\nThe following methods are available:\n\n| Method | Description |\n| --- | --- |\n| insertIfNotExists(url) | Insert a `Url` object. |\n| upsert(url) | Upsert `Url` object. |\n| getNextUrl() | Get the next `Url` to be crawled. |\n\n## RedisUrlList\n\n`RedisUrlList` is a queue backed with Redis.\n\nIf a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.\n\nIt also balances requests between different hostnames. So, for example, if you\ncrawl a sitemap file with 10,000 URLs, the next 10,000 URLs will not be stuck in\nthe same host.\n\nOptions:\n\n| Option | Description |\n| --- | --- |\n| opts.redis | Options passed to [ioredis](https://github.com/luin/ioredis). |\n| opts.delayHalfLifeMs | Hostname delay factor half-life. Requests are delayed by an amount of time proportional to the number of pages crawled for a hostname, but this factor exponentially decays over time. Default = 3600000 (1 hour). |\n| opts.expiryTimeMs | Amount of time before recrawling a successful URL. Default = 2592000000 (30 days). |\n| opts.initialRetryTimeMs | Amount of time to wait before first retry after a failed URL. Default = 3600000 (1 hour) |\n\nExample usage:\n\n```js\nnew supercrawler.RedisUrlList({\n  redis: {\n    host: \"127.0.0.1\"\n  }\n})\n```\n\nThe following methods are available:\n\n| Method | Description |\n| --- | --- |\n| insertIfNotExists(url) | Insert a `Url` object. |\n| upsert(url) | Upsert `Url` object. |\n| getNextUrl() | Get the next `Url` to be crawled. |\n\n## FifoUrlList\n\nThe `FifoUrlList` is the default URL queue powering the crawler. You can add\nURLs to the queue, and they will be crawled in the same order (FIFO).\n\nNote that, with this queue, URLs are only crawled once, even if the request\nfails. If you need retry functionality, you must use `DbUrlList`.\n\nThe following methods are available:\n\n| Method | Description |\n| --- | --- |\n| insertIfNotExists(url) | Insert a `Url` object. |\n| upsert(url) | Upsert `Url` object. |\n| getNextUrl() | Get the next `Url` to be crawled. |\n\n## Url\n\nA `Url` represents a URL to be crawled, or a URL that has already been\ncrawled. It is uniquely identified by an absolute-path URL, but also contains\ninformation about errors and status codes.\n\n| Option | Description |\n| --- | --- |\n| url | Absolute-path string url |\n| statusCode | HTTP status code or `null`. |\n| errorCode | String error code or `null`. |\n\nExample usage:\n\n```js\nvar url = new supercrawler.Url({\n  url: \"https://example.com\"\n});\n```\n\nYou can also call it just a string URL:\n\n```js\nvar url = new supercrawler.Url(\"https://example.com\");\n```\n\nThe following methods are available:\n\n| Method | Description |\n| --- | --- |\n| getUniqueId | Get the unique identifier for this object. |\n| getUrl | Get the absolute-path string URL. |\n| getErrorCode | Get the error code, or `null` if it is empty. |\n| getStatusCode | Get the status code, or `null` if it is empty. |\n\n## handlers.htmlLinkParser\n\nA function that returns a handler which parses a HTML page and identifies any\nlinks.\n\n| Option | Description |\n| --- | --- |\n| hostnames | Array of hostnames that are allowed to be crawled. |\n| urlFilter(url, pageUrl) | Function that takes a URL and returns `true` if it should be included. |\n\nExample usage:\n\n```js\nvar hlp = supercrawler.handlers.htmlLinkParser({\n  hostnames: [\"example.com\"]\n});\n```\n\n```js\nvar hlp = supercrawler.handlers.htmlLinkParser({\n  urlFilter: function (url) {\n    return url.indexOf(\"page1\") === -1;\n  }\n});\n```\n\n## handlers.robotsParser\n\nA function that returns a handler which parses a robots.txt file. Robots.txt\nfile are automatically crawled, and sent through the same content handler\nroutines as any other file. This handler will look for any `Sitemap: ` directives,\nand add those XML sitemaps to the crawl.\n\nIt will ignore any files that are not `/robots.txt`.\n\nIf you want to extract the URLs from those XML sitemaps, you will also need\nto add a sitemap parser.\n\n| Option | Description |\n| --- | --- |\n| urlFilter(sitemapUrl, robotsTxtUrl) | Function that takes a URL and returns `true` if it should be included. |\n\nExample usage:\n\n```js\nvar rp = supercrawler.handlers.robotsParser();\ncrawler.addHandler(\"text/plain\", supercrawler.handlers.robotsParser());\n```\n\n## handlers.sitemapsParser\n\nA function that returns a handler which parses an XML sitemaps file. It will\npick up any URLs matching `sitemapindex \u003e sitemap \u003e loc, urlset \u003e url \u003e loc`.\n\nIt will also handle a gzipped file, since that it part of the sitemaps\nspecification.\n\n| Option | Description |\n| --- | --- |\n| urlFilter | Function that takes a URL (including sitemap entries) and returns `true` if it should be included. |\n\nExample usage:\n\n```js\nvar sp = supercrawler.handlers.sitemapsParser();\ncrawler.addHandler(supercrawler.handlers.sitemapsParser());\n```\n\n## Changelog\n\n### 2.0.0\n\n* [Added] `crawledurl` event to contain the error message, thanks [hjr3](https://github.com/hjr3).\n* [Changed] `sitemapsParser` to apply `urlFilter` on the sitemaps entries, thanks [hjr3](https://github.com/hjr3).\n* [Added] `Crawler` to take `userAgent` option as a function, thanks [hjr3](https://github.com/hjr3).\n\n### 1.7.2\n\n* [Fixed] Update DbUrlList to use symbol operators, thanks [hjr3](https://github.com/hjr3).\n\n### 1.7.1\n\n* [Changed] Updated dependencies, thanks [MrRefactoring](https://github.com/MrRefactoring/supercrawler).\n\n### 1.7.0\n\n* [Changed] `Crawler#addHandler` can now take an array of content-type to match, thanks [taina0407](https://github.com/taina0407).\n\n### 1.6.0\n\n* [Added] Added `opts.db.table` option to `DbUrlList` ([adversinc](https://github.com/adversinc)).\n* [Added] Added `recrawlInMs` option to `DbUrlList` ([adversinc](https://github.com/adversinc)).\n* [Added] Added the `urlFilter` option to `htmlLinkParser` ([adversinc](https://github.com/adversinc)).\n\n### 1.5.0\n\n* [Added] Added the `robotsEnabled` (default `true`) option to allow the\nrobots.txt check to be disabled ([cbess](https://github.com/cbess)).\n\n### 1.4.0\n\n* [Added] Added the `robotsIgnoreServerError` option to accept a robots.txt 500 error code as \"allow all\" rather than \"deny all\" (default), thanks [cbess](https://github.com/cbess).\n\n### 1.3.3\n\n* [Fix] Updated dependencies, thanks [cbess](https://github.com/cbess).\n\n### 1.3.1\n\n* [Fix] `htmlLinkParser` should detect links matching the `area[href]` selector.\n\n### 1.3.0\n\n* [Added] Crawler fires the `crawledurl` event the crawl of a specific URL is\ncomplete (whether successful or not).\n\n### 1.2.0\n\n* [Added] Crawler fires the `urllistcomplete` event when the UrlList is permanently\nempty (compare with `urllistempty`, which may fire intermittently).\n\n### 1.1.0\n\n* [Added] Ability to provide custom options to the `request` library.\n\n### 1.0.0\n\n* [Fixed] Removed warnings from unit tests.\n* [Changed] Updated dependencies.\n* [Changed] Make API stable - release 1.0.0.\n\n### 0.16.1\n\n* [Fixed] Treats 410 the same as 404 for robots.txt requests.\n\n### 0.16.0\n\n* [Added] Support for `gzipContentTypes` option to `sitemapsParser`. Example: `gzipContentTypes: 'application/gzip'` and `gzipContentTypes: ['application/gzip']`.\n\n### 0.15.1\n\n* [Fixed] Support for multiple \"User-agent\" lines in robots.txt files\n\n### 0.15.0\n\n* [Added] Redis based queue.\n\n### 0.14.0\n\n* [Added] Crawler emits `redirect`, `links` and `httpError` events.\n\n### 0.13.1\n\n* [Fixed] `DbUrlList` doesn't fetch the existing record from the database unless\nthere was an error.\n\n### 0.13.0\n\n* [Added] `errorMessage` column on `urls` table that gives more information\nabout, e.g., a handlers error that occurred.\n\n### 0.12.1\n\n* [Fixed] Downgrade to cheerio 0.19, to fix a memory leak issue.\n\n### 0.12.0\n\n* [Change] Rather than calling content handlers with (body, url), they are\nnow called with a single `context` argument. This allows you to pass information\nforwards via handlers. For example, you might cache the `cheerio` parsing\nso you don't parse with every content handler.\n\n### 0.11.0\n\n* [Added] Event called `handlersError` is emitted if any of the handlers\nreturns an error.\n\n### 0.10.4\n\n* [Fixed] Shortend `urlHash` field to 40 characters, in case tables are using\n`utf8mb4` collations for strings.\n\n### 0.10.3\n\n* [Fixed] URLs are now crawled in a random order. Improved the `getNextUrl`\nfunction of `DbUrlList` to use a more optimized query.\n\n### 0.10.2\n\n* [Fixed] When content handler throws an exception / rejects a Promise, it will\nbe marked as an error. (And scheduled for a retry if using `DbUrlList`).\n\n### 0.10.1\n\n* [Fixed] Request sends `Accept-Encoding: gzip, deflate` header, so the\nresponses arrive compressed (saving data transfer).\n\n### 0.10.0\n\n* [Added] Support for a custom URL filter on the `robotsParser` function.\n\n### 0.9.1\n\n* [Fixed] Performance improvement for sitemaps parser. Very large sitemap\nprevious took 25 seconds, now takes 1-2 seconds.\n\n### 0.9.0\n\n* [Added] Support for a custom URL filter on the `sitemapsParser` function.\n\n### 0.8.0\n\n* [Changed] Sitemaps parser now extracts `\u003cxhtml:link rel=\"alternate\"\u003e` URLs,\nin addition to the `\u003cloc\u003e` URLs.\n\n### 0.7.0\n\n* [Added] Support for optional `insertIfNotExistsBulk` method which can insert\na large list of URLs into the crawl queue.\n* [Changed] `DbUrlList` supports the bulk insert method.\n\n### 0.6.1\n\n* [Fix] Support sitemaps with content type `application/gzip` as well as\n`application/x-gzip`.\n\n### 0.6.0\n\n* [Added] Crawler fires the `urllistempty` and `crawlurl` events. It also\ncaptures the `RangeError` event when the URL list is empty.\n\n### 0.5.0\n\n* [Changed] `htmlLinkParser` now also picks up `link` tags where `rel=alternate`.\n\n### 0.4.0\n\n* [Changed] Supercrawler no longer follows redirects on crawled URLs. Supercrawler will now add a redirected URL to the queue as a separate entry. We still follow redirects for the `/robots.txt` that is used for checking rules; but not for `/robots.txt` added to the queue.\n\n### 0.3.3\n\n* [Fix] `DbUrlList` to mark a URL as taken, and ensure it never returns a URL that is being crawled in another concurrent request. This has required a new field called `holdDate` on the `url` table\n\n### 0.3.2\n\n* [Fix] Time-based unit tests made more reliable.\n\n### 0.3.1\n\n* [Added] Support for Travis CI.\n\n### 0.3.0\n\n* [Added] Content type passed as third argument to all content type handlers.\n* [Added] Sitemaps parser to extract sitemap URLs and urlset URLs.\n* [Changed] Content handlers receive Buffers rather than strings for the first argument.\n* [Fix] Robots.txt checking to work for the first crawled URL. There was a bug that caused robots.txt to be ignored if it wasn't in the cache.\n\n### 0.2.3\n\n* [Added] A robots.txt parser that identifies `Sitemap:` directives.\n\n### 0.2.2\n\n* [Fixed] Support for URLs up to 10,000 characters long. This required a new `urlHash` SHA1 field on the `url` table, to support the unique index.\n\n### 0.2.1\n\n* [Added] Extensive documentation.\n\n### 0.2.0\n\n* [Added] Status code is updated in the queue for successfully crawled pages (HTTP code \u003c 400).\n* [Added] A new error type `error.RequestError` for all errors that occur when requesting a page.\n* [Added] `DbUrlList` queue object that stores URLs in a SQL database. Includes exponetial backoff retry logic.\n* [Changed] Interface to `DbUrlList` and `FifoUrlList` is now via methods `insertIfNotExists`, `upsert` and `getNextUrl`. Previously, it was just `insert` (which also updated) and `upsert`, but we need a way to differentiate between discovered URLs which should not update the crawl state.\n\n### 0.1.0\n\n* [Added] `Crawler` object, supporting rate limiting, concurrent requests limiting, robots.txt caching.\n* [Added] `FifoUrlList` object, a first-in, first-out in-memory list of URLs to be crawled.\n* [Added] `Url` object, representing a URL in the crawl queue.\n* [Added] `htmlLinkParser`, a function to extract links from crawled HTML documents.\n","funding_links":[],"categories":["JavaScript","All","Repository"],"sub_categories":["Crawler"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrendonboshell%2Fsupercrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrendonboshell%2Fsupercrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrendonboshell%2Fsupercrawler/lists"}