{"id":15825364,"url":"https://github.com/dsc8x/node-scraper","last_synced_at":"2025-04-01T18:31:02.752Z","repository":{"id":57111769,"uuid":"135034655","full_name":"dsc8x/node-scraper","owner":"dsc8x","description":"Scraping websites made easy! A minimalistic yet powerful tool for collecting data from websites.","archived":false,"fork":false,"pushed_at":"2019-01-03T17:54:53.000Z","size":146,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-27T04:01:58.942Z","etag":null,"topics":["axios","cheerio","javascript","node","scraper","scraping","website-scraper"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dsc8x.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-05-27T10:00:49.000Z","updated_at":"2025-01-09T19:31:02.000Z","dependencies_parsed_at":"2022-08-21T10:31:01.852Z","dependency_job_id":null,"html_url":"https://github.com/dsc8x/node-scraper","commit_stats":null,"previous_names":["dsc8x/node-scraper","epegzz/node-scraper"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsc8x%2Fnode-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsc8x%2Fnode-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsc8x%2Fnode-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsc8x%2Fnode-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dsc8x","download_url":"https://codeload.github.com/dsc8x/node-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246635949,"owners_count":20809331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["axios","cheerio","javascript","node","scraper","scraping","website-scraper"],"created_at":"2024-10-05T09:08:56.977Z","updated_at":"2025-04-01T18:31:02.451Z","avatar_url":"https://github.com/dsc8x.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003enode-scraper\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cstrong\u003eScraping websites made easy!\u003c/strong\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n  A minimalistic yet powerful tool for collecting data from websites.\n\u003c/div\u003e\n\u003cbr/\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca target=\"_blank\" href=\"https://travis-ci.org/epegzz/node-scraper\"\u003e\n    \u003cimg alt=\"Travis\" src=\"https://img.shields.io/travis/epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n  \u003ca target=\"_blank\" href=\"https://codeclimate.com/github/epegzz/node-scraper/maintainability\"\u003e\n    \u003cimg alt=\"Maintainability\" src=\"https://img.shields.io/codeclimate/maintainability/epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n  \u003ca target=\"_blank\" href=\"https://codecov.io/gh/epegzz/node-scraper\"\u003e\n    \u003cimg alt=\"Codecov\" src=\"https://img.shields.io/codecov/c/github/epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n  \u003ca target=\"_blank\" href=\"https://www.npmjs.com/package/@epegzz/node-scraper\"\u003e\n    \u003cimg alt=\"npm version\" src=\"https://img.shields.io/npm/v/@epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n  \u003ca target=\"_blank\" href=\"https://www.npmjs.com/package/@epegzz/node-scraper\"\u003e\n    \u003cimg alt=\"npm installs\" src=\"https://img.shields.io/npm/dm/@epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n  \u003ca target=\"_blank\" href=\"https://david-dm.org/epegzz/node-scraper\"\u003e\n    \u003cimg alt=\"dependencies\" src=\"https://img.shields.io/david/epegzz/node-scraper.svg?style=flat-square\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n# Table of Contents\n- [Features](#features)\n- [Installing](#installing)\n- [Concept](#concept)\n- [Example](#example)\n- [API](#api)\n  - [find(selector, [node])](#findselector-node-parse-the-dom-of-the-website)\n  - [follow(url, [parser], [context])](#followurl-parser-context-add-another-url-to-parse)\n  - [capture(url, parser, [context])](#captureurl-parser-context-parse-urls-without-yielding-the-results)\n\n# Features\n\n- __Generator based:__ It will only scrape as fast as you can consume the results\n- __Powerful HTML parsing:__ Uses the popular cheerio library under the hood\n- __Easy to test:__ Uses Axios to make network requests, which can be easily mocked\n\n# Installing\n\nusing npm\n```sh\nnpm install @epegzz/node-scraper --save\n```\n\nusing yarn\n```sh\nyarn add @epegzz/node-scraper\n```\n\n# Concept\n\nnode-scraper is very minimalistic: You provide the URL of the website you want\nto scrape and a parser function that converts HTML into Javascript objects.\n\nParser functions are implemented as generators, which means they will `yield` results\n instead of returning them. That guarantees that network requests are made only\n as fast/frequent as we can consume them.\n Stopping consuming the results will stop further network requests ✨\n\n# Example\n\n```js\nconst scrape = require('@epegzz/node-scraper')\n\n// Start scraping our made-up website `https://car-list.com` and console log the results\n//\n// This will print:\n//   { brand: 'Ford', model: 'Focus', ratings: [{ value: 5, comment: 'Excellent car!'}]}\n//   { brand: 'Audi', model: 'A8', ratings: [{ value: 4.5, comment: 'I like it'}, {value: 5, comment: 'Best car I ever owned'}]}\n//   ...\n;(async function() {\n  const scrapeResults = scrape('https://car-list.com', parseCars)\n  for await (const carListing of scrapeResults) {\n    console.log(JSON.stringify(carListing))\n  }\n})()\n\n/**\n * https://car-list.com\n *\n * \u003cbody\u003e\n *   \u003cul\u003e\n *     \u003cli class=\"car\"\u003e\n *       \u003cspan class=\"brand\"\u003eFord\u003c/span\u003e\n *       \u003cspan class=\"model\"\u003eFocus\u003c/span\u003e\n *       \u003ca class=\"ratings\" href=\"/ratings/ford-focus\"\u003eshow ratings\u003c/a\u003e\n *     \u003c/li\u003e\n *     ...\n *   \u003c/ul\u003e\n * \u003c/body\u003e\n */\nasync function* parseCars({ find, follow, capture }) {\n  const cars = find('.car')\n  for (const car of cars) {\n    yield {\n      brand: car.find('.brand').text(),\n      model: car.find('.model').text(),\n      ratings: await capture(car.find('a.ratings').attr('href'), parseCarRatings)\n    }\n  }\n  follow(find('.next-page'))\n}\n\n/**\n * https://car-list.com/ratings/ford-focus\n *\n * \u003cbody\u003e\n *   \u003cul\u003e\n *     \u003cli class=\"rating\"\u003e\n *       \u003cspan class=\"value\"\u003e5\u003c/span\u003e\n *       \u003cspan class=\"comment\"\u003eExcellent car!\u003c/span\u003e\n *     \u003c/li\u003e\n *     ...\n *   \u003c/ul\u003e\n * \u003c/body\u003e\n */\nfunction* parseCarRatings({ find }) {\n  const ratings = find('.rating')\n  for (const rating of ratings) {\n    yield {\n      value: rating.find('.value').text(),\n      comment: rating.find('.comment').text(),\n    }\n  }\n}\n\n```\n\n# API\n\n## Usage\n\nHere's the basic usage:\n\n```js\n  // import scraper\n  const scrape = require('@epegzz/node-scraper')\n\n  // define a parser function\n  function* parser() {\n    // ...\n  }\n\n  // call scraper with URL and parser\n  const scrapeResults = scrape('https://some-website.com', parser)\n\n  // consume scrape results\n  for await (const scrapedItem of scrapeResults) {\n    console.log(JSON.stringify(scrapedItem))\n  }\n```\n\nInstead of calling the scraper with a URL, you can also call it with an [Axios\nrequest config object](https://github.com/axios/axios#request-config) to gain more control over the requests:\n\n```js\nconst scrapeResults = scrape({\n  url: 'https://some-website.com',\n  timeout: 5000,\n}, parser)\n```\n\n## Creating a parser function\n\nA parser function is a synchronous or asynchronous generator function which receives\nthree utility functions as argument: [find](#findselector-node-parse-the-dom-of-the-website), [follow](#followurl-parser-context-add-another-url-to-parse) and [capture](#captureurl-parser-context-parse-urls-without-yielding-the-results).\n\nA fourth parser function argument is the `context` variable, which can be passed using the `scrape`, `follow` or `capture` function.\n\nWhatever is `yield`ed by the generator function, can be consumed as scrape result.\n\n```js\nasync function* parseCars({ find, follow, capture }) {\n  const cars = find('.car')\n  for (const car of cars) {\n    yield {\n      brand: car.find('.brand').text(),\n      model: car.find('.model').text(),\n      ratings: await capture(car.find('a.ratings').attr('href'), parseCarRatings)\n    }\n  }\n  follow(find('a.next-page').href)\n}\n\n;(async function() {\n  const scrapeResults = scrape('https://car-list.com', parseCars)\n  for await (const car of scrapeResults) {\n    // whatever is yielded by the parser, ends up here\n    console.log(JSON.stringify(car))\n  }\n})()\n```\n\n\n### `find(selector, [node])` Parse the DOM of the website\n\nThe `find` function allows you to extract data from the website.\nIt's basically just performing a [Cheerio](https://cheerio.js.org) query, so check out their\n[documentation](https://github.com/cheeriojs/cheerio) for details on how to use it.\n\nThink of `find` as the `$` in their documentation, loaded with the HTML contents of the\nscraped website.\n\n__Example__:\n\n```js\n  // yields the href and text of all links from the webpage\n  for (const link of find('a')) {\n    yield {\n        linkHref: link.attr('href'),\n        linkText: link.text(),\n    };\n  }\n```\n\nThe major difference between cheerio's `$` and node-scraper's `find` is, that the results of `find`\nare iterable. So you can do `for (element of find(selector)) { … }` instead of having\nto use a `.each` callback, which is important if we want to yield results.\n\nThe other difference is, that you can pass an optional `node` argument to `find`. This\nwill not search the whole document, but instead limits the search to that particular node's\ninner HTML.\n\n\n### `follow(url, [parser], [context])` Add another URL to parse\n\nThe main use-case for the `follow` function scraping paginated websites.\nIn that case you would use the href of the \"next\" button to let the scraper `follow` to the next page:\n\n```js\nasync function* parser({ find, follow }) {\n  ...\n  follow(find('a.next-page').attr('href'))\n}\n```\n\nThe `follow` function will by default use the current parser to parse the\nresults of the new URL. You can, however, provide a different parser if you like.\n\n\n### `capture(url, parser, [context])` Parse URLs without yielding the results\n\nThe `capture` function is somewhat similar to the `follow` function: It takes\na new URL and a parser function as argument to scrape data. But instead of yielding the data as scrape results\nit instead returns them as an array.\n\nThis is useful if you want add more details to a scraped object, where getting those details requires\nan additional network request:\n\n```js\nasync function* parseCars({ find, follow, capture }) {\n  const cars = find('.car')\n  for (const car of cars) {\n    yield {\n      brand: car.find('.brand').text(),\n      model: car.find('.model').text(),\n      ratings: await capture(car.find('a.ratings').attr('href'), parseCarRatings)\n    }\n  }\n}\n```\n\nIn the example above the comments for each car are located on a nested car\ndetails page. We are therefore making a `capture` call. All `yield`s from the\n`parseCarRatings` parser will be added to the resulting array that we're\nassigning to the `ratings` property.\n\nNote that we have to use `await`, because network requests are always asynchronous.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsc8x%2Fnode-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdsc8x%2Fnode-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsc8x%2Fnode-scraper/lists"}