{"id":20710106,"url":"https://github.com/oxylabs/javascript-web-scraping","last_synced_at":"2026-04-24T14:07:03.888Z","repository":{"id":134336624,"uuid":"526098400","full_name":"oxylabs/javascript-web-scraping","owner":"oxylabs","description":"JavaScript Web Scraping","archived":false,"fork":false,"pushed_at":"2025-02-11T12:51:02.000Z","size":25,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T05:49:01.889Z","etag":null,"topics":["javascript","javascript-scraper","javascript-webscraping","node-js-web-scraping","nodejs","web-scraping","web-scraping-javascript"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-18T07:11:56.000Z","updated_at":"2025-02-11T12:51:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"363abcd2-9623-4899-9408-a6f1d56ff3a2","html_url":"https://github.com/oxylabs/javascript-web-scraping","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/javascript-web-scraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fjavascript-web-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fjavascript-web-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fjavascript-web-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fjavascript-web-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/javascript-web-scraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fjavascript-web-scraping/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32226461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","javascript-scraper","javascript-webscraping","node-js-web-scraping","nodejs","web-scraping","web-scraping-javascript"],"created_at":"2024-11-17T02:09:50.476Z","updated_at":"2026-04-24T14:07:03.883Z","avatar_url":"https://github.com/oxylabs.png","language":"JavaScript","readme":"# JavaScript Web Scraping\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=javascript-web-scraping-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n## Required software\nThere are only two pieces of software that will be needed:\n\n1. Node.js (which comes with npm—the package manager for Node.js)\n2. Any code editor\n\n## Set up Node.js project\nBefore writing any code to web scrape using node.js, create a folder where JavaScript files will be stored. These files will contain all the code required for web scraping.\n\nOnce the folder is created, navigate to this folder and run the initialization command:\n\n```bash\nnpm init -y\n```\n\n## Installing Node.js packages\n\n```bash\nnpm install axios\n```\n\n```bash\nnpm install axios cheerio json2csv\n```\n\n## JavaScript web scraping – a practical example\n\nOne of the most common scenarios of web scraping with JavaScript is to scrape e-commerce stores. A good place to start is a fictional book store http://books.toscrape.com/. This site is very much like a real store, except that this is fictional and is made to learn web scraping.\n\n### Creating selectors\nThe first step before beginning JavaScript web scraping is creating selectors. The purpose of selectors is to identify the specific element to be queried.\n\nBegin by opening the URL http://books.toscrape.com/catalogue/category/books/mystery_3/index.html in Chrome or Firefox. Once the page loads, right-click on the title of the genre, Mystery, and select Inspect. This should open the Developer Tools with `\u003ch1\u003eMystery\u003c/h1\u003e` selected in the Elements tab.\n\n![](https://images.prismic.io/oxylabs-sm/OWYyNGNmOWItMzBjYS00NjJjLWIyY2YtNDU1MGYyM2FjMjQz_copy-selector-for-web-scraping-with-node-js.jpg?auto=compress,format\u0026rect=0,0,1222,720\u0026w=1222\u0026h=720\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nThe simplest way to create a selector is to right-click this `h1` tag in the Developer Tools, point to Copy, and then click Copy Selector. This will create a selector like this:\n\n```css\n#content_inner \u003e article \u003e div.row \u003e div.col-sm-6.product_main \u003e h1\n```\n\nThis selector is valid and works well. The only problem is that this method creates a long selector. This makes it difficult to understand and maintain the code.\n\nAfter spending some time with the page, it becomes clear that there is only one h1 tag on the page. This makes it very easy to create a very short selector:\n\n```css\nh1\n```\n\n## Scraping the genre\nThe first step is to define the constants that will hold a reference to Axios and Cheerio.\n\n```javascript\nconst cheerio = require(\"cheerio\");\nconst axios = require(\"axios\");\n```\n\nThe address of the page that is being scraped is saved in the variable URL for readability\n\n```javascript\nconst url = \"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html\";\n```\n\nAxios has a method `get()` that will send an HTTP GET request. Note that this is asynchronous method and thus needs await prefix:\n\n\n```javascript\nconst response = await axios.get(url);\n```\n\nIf there is a need to pass additional headers, for example, User-Agent, this can be sent as the second parameter:\n\n```javascript\nconst response = await axios.get(url, {\n      headers: \n      {\n        \"User-Agent\": \"custom-user-agent string\",\n      }\n    });\n```\n\nThis particular site does not need any special header, which makes it easier to learn.\n\nAxios supports both the Promise pattern and the async-await pattern. This tutorial focuses on the async-await pattern. The response has a few attributes like headers, data, etc. The HTML that we want is in the data attribute. This HTML can be loaded into an object that can be queried, using cheerio.load() method.\n\n```javascript\nconst $ = cheerio.load(response.data);\n```\n\nCheerio’s `load()` method returns a reference to the document, which can be stored in a constant. This can have any name. To make our code look and feel more like jQuery web scraping code, a $ can be used instead of a name.\n\nFinding this specific element within the document is as easy as writing . In this particular case, it would be .\n\nThe method `text()` will be used everywhere when writing web scraping code with JavaScript, as it can be used to get the text inside any element. This can be extracted and saved in a local variable.\n\n```javascript\nconst genre = $(\"h1\").text();\n```\n\nFinally, `console.log()` will simply print the variable value on the console.\n\n```javascript\nconsole.log(genre);\n```\n\nTo handle errors, the code will be surrounded by a try-catch block. Note that it is a good practice to use console.error for errors and console.log for other messages.\n\nHere is the complete code put together. Save it as genre.js in the folder created earlier, where the command npm init was run.\n\n```javascript\nconst cheerio = require(\"cheerio\");\nconst axios = require(\"axios\");\nconst url = \"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html\";\n\nasync function getGenre() {\n  try {\n    const response = await axios.get(url);\n    const document = cheerio.load(response.data);\n    const genre = document(\"h1\").text();\n    console.log(genre);\n  } catch (error) {\n    console.error(error);\n  }\n}\ngetGenre();\n```\n\nThe final step to run this web scraping in JavaScript is to run it using Node.js. Open the terminal and run this command:\n\n```javascript\nnode genre.js\n```\n\nThe output of this code is going to be the genre name:\n\n```javascript\nMystery\n```\n\nCongratulations! This was the first program that uses JavaScript and Node.js for web scraping. Time to do more complex things!\n\n## Scraping book listings\nLet’s try scraping listings. Here is the same page that has a book listing of the Mystery genre – http://books.toscrape.com/catalogue/category/books/mystery_3/index.html\n\nFirst step is to analyze the page and understand the HTML structure. Load this page in Chrome, press F12, and examine the elements. \n\nEach book is wrapped in `\u003carticle\u003e` tag. It means that all these books can be extracted and a loop can be run to extract individual book details. If the HTML is parsed with Cheerio, jQuery function `each()` can be used to run a loop. Let’s start with extracting title of all the books. Here is the code:\n\n```javascript\nconst books = $(\"article\"); //Selector to get all books\nbooks.each(function () \n           { //running a loop\n\t\ttitle = $(this).find(\"h3 a\").text(); //extracting book title\n\t\tconsole.log(title);//print the book title\n\t\t\t});\n```\n\nAs it is evident from the above code that the extracted details need to be saved somewhere else inside the loop. The best idea would be to store these values in an array. In fact, other attributes of the books can be extracted and stored as a JSON in an array.\n\nHere is the complete code. Create a new file, paste this code and save it as books.js in the same folder that where npm init was run:\n\n```javascript\nconst cheerio = require(\"cheerio\");\nconst axios = require(\"axios\");\nconst mystery = \"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html\";\nconst books_data = [];\nasync function getBooks(url) {\n  try {\n    const response = await axios.get(url);\n    const $ = cheerio.load(response.data);\n \n    const books = $(\"article\");\n    books.each(function () {\n      title = $(this).find(\"h3 a\").text();\n      price = $(this).find(\".price_color\").text();\n      stock = $(this).find(\".availability\").text().trim();\n      books_data.push({ title, price, stock }); //store in array\n    });\n    console.log(books_data);//print the array\n  } catch (err) {\n    console.error(err);\n  }\n}\ngetBooks(mystery);\n```\n\nRun this file using Node.js from the terminal:\n\n```bash\nnode books.js\n```\n\nThis should print the array of books on the console. The only limitation of this JavaScript code is that it is scraping only one page. The next section will cover how pagination can be handled.\n\n## Handling pagination \n\nThe listings like this are usually spread over multiple pages. While every site may have its own way of paginating, the most common one is having a next button on every page. The exception is the last, which will not have a next page link.\n\nThe pagination logic for these situations is rather simple. Create a selector for the next page link. If the selector results in a value, take the href attribute value and call `getBooks` function with this new URL recursively.\n\nImmediate after the `books.each()` loop, add these lines:\n\n```javascript\nif ($(\".next a\").length \u003e 0) {\n      next_page = baseUrl + $(\".next a\").attr(\"href\"); //converting to absolute URL\n      getBooks(next_page); //recursive call to the same function with new URL\n}\n```\n\nNote that the href returned above is a relative URL. To convert it into an absolute URL, the simplest way is to concatenate a fixed part to it. This fixed part of the URL is stored in the baseUrl variable\n\n\n```javascript\nconst baseUrl =\"http://books.toscrape.com/catalogue/category/books/mystery_3/\"\n```\n\nOnce the scraper reaches the last page, the Next button will not be there and the recursive call will stop. At this point, the array will have book information from all the pages. The final step of web scraping with Node.js is to save the data.\n\n## Saving scraped data to CSV\nIf web scraping with JavaScript is easy, saving data into a CSV file is even easier. It can be done using these two packages —fs and json2csv. The file system is represented by the package fs, which is in-built. json2csv would need to be installed using npm install json2csv command\n\n```bash\nnpm install json2csv\n```\n\nafter the installation, create a constant that will store this package’s Parser.\n\n```javascript\nconst j2cp = require(\"json2csv\").Parser;\n```\n\nThe access to the file system is needed to write the file on disk. For this, initialize the `fs` package.\n\n```javascript\nconst fs = require(\"fs\");\n```\n\nFind the line in the code where an array with all the scraped is available, and then insert the following lines of code to create the CSV file.\n\n```javascript\nconst parser = new j2cp();\nconst csv = parser.parse(books_data); // json to CSV in memory\nfs.writeFileSync(\"./books.csv\", csv); // CSV is now written to disk\n```\n\nHere is the complete script put together. This can be saved as a .js file in the node.js project folder. Once it is run using node command on terminal, data from all the pages will be available in books.csv file.\n\n```javascript\nconst fs = require(\"fs\");\nconst j2cp = require(\"json2csv\").Parser;\nconst axios = require(\"axios\");\nconst cheerio = require(\"cheerio\");\n \nconst mystery = \"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html\";\n \nconst books_data = [];\n \nasync function getBooks(url) {\n  try {\n    const response = await axios.get(url);\n    const $ = cheerio.load(response.data);\n \n    const books = $(\"article\");\n    books.each(function () {\n      title = $(this).find(\"h3 a\").text();\n      price = $(this).find(\".price_color\").text();\n      stock = $(this).find(\".availability\").text().trim();\n      books_data.push({ title, price, stock });\n    });\n    // console.log(books_data);\n    const baseUrl = \"http://books.toscrape.com/catalogue/category/books/mystery_3/\";\n    if ($(\".next a\").length \u003e 0) {\n      next = baseUrl + $(\".next a\").attr(\"href\");\n      getBooks(next);\n    } else {\n      const parser = new j2cp();\n      const csv = parser.parse(books_data);\n      fs.writeFileSync(\"./books.csv\", csv);\n    }\n  } catch (err) {\n    console.error(err);\n  }\n}\n \ngetBooks(mystery);\n```\n\nRun this file using Node.js from the terminal:\n\n```bash\nnode books.js\n```\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fjavascript-web-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fjavascript-web-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fjavascript-web-scraping/lists"}