{"id":18614885,"url":"https://github.com/andrewjbateman/node-puppeteer-webscraper","last_synced_at":"2026-05-08T15:50:31.128Z","repository":{"id":96860798,"uuid":"395697134","full_name":"AndrewJBateman/node-puppeteer-webscraper","owner":"AndrewJBateman","description":":clipboard: Node.js with puppeteer to extract web content from Google Chrome","archived":false,"fork":false,"pushed_at":"2022-03-22T10:17:47.000Z","size":345,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-12-27T02:44:51.310Z","etag":null,"topics":["cheerio","html5","javascript","nodejs","puppeteer","webscrapping"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AndrewJBateman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-13T15:08:31.000Z","updated_at":"2022-03-22T10:17:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3fb401d-9fea-497e-8af7-92a0ab5bb607","html_url":"https://github.com/AndrewJBateman/node-puppeteer-webscraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewJBateman%2Fnode-puppeteer-webscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewJBateman%2Fnode-puppeteer-webscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewJBateman%2Fnode-puppeteer-webscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewJBateman%2Fnode-puppeteer-webscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AndrewJBateman","download_url":"https://codeload.github.com/AndrewJBateman/node-puppeteer-webscraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239406449,"owners_count":19633024,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cheerio","html5","javascript","nodejs","puppeteer","webscrapping"],"created_at":"2024-11-07T03:27:19.563Z","updated_at":"2025-11-03T03:30:30.320Z","avatar_url":"https://github.com/AndrewJBateman.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :zap: Node Puppeteer Webscraper\n\n* Node.js used with [Puppeteer](https://www.npmjs.com/package/puppeteer) \u0026 [Cheerio](https://www.npmjs.com/package/cheerio) to gather data from web pages\n* PhotosScraper code from [LearnWebCode](https://www.youtube.com/channel/UCHRp19HU7Y2LwfI0Ai6WAGQ) - see [:clap: Inspiration](#clap-inspiration) below. Also includes Imdb film data scraper.\n* **Note:** to open web links in a new window use: _ctrl+click on link_\n\n![GitHub repo size](https://img.shields.io/github/repo-size/AndrewJBateman/node-puppeteer-webscraper?style=plastic)\n![GitHub pull requests](https://img.shields.io/github/issues-pr/AndrewJBateman/node-puppeteer-webscraper?style=plastic)\n![GitHub Repo stars](https://img.shields.io/github/stars/AndrewJBateman/node-puppeteer-webscraper?style=plastic)\n![GitHub last commit](https://img.shields.io/github/last-commit/AndrewJBateman/node-puppeteer-webscraper?style=plastic)\n\n## :page_facing_up: Table of contents\n\n* [:zap: Node Puppeteer Webscraper](#zap-node-puppeteer-webscraper)\n\t* [:page_facing_up: Table of contents](#page_facing_up-table-of-contents)\n\t* [:books: General info](#books-general-info)\n\t* [:camera: Screenshots](#camera-screenshots)\n\t* [:signal_strength: Technologies](#signal_strength-technologies)\n\t* [:floppy_disk: Setup](#floppy_disk-setup)\n\t* [:wrench: Testing](#wrench-testing)\n\t* [:computer: Code Examples](#computer-code-examples)\n\t* [:cool: Features](#cool-features)\n\t* [:clipboard: Status, Testing \u0026 To-Do List](#clipboard-status-testing--to-do-list)\n\t* [:clap: Inspiration/General Tools](#clap-inspirationgeneral-tools)\n\t* [:file_folder: License](#file_folder-license)\n\t* [:envelope: Contact](#envelope-contact)\n\n## :books: General info\n\n* Puppeteer contains a version of Chrome and runs headless by default.\n* PhotosScraper.js extracts photos from the LearnWebCode website and stores them.\n* Cheerio functions were used in the imdbScraper to access data from the HTML web page\n* ImdbScraper.js uses [JS array map method](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) to produce CSV and JSON files with film title, year, rating \u0026 url extracted from the HTML\n\n## :camera: Screenshots\n\n![Frontend screenshot](./imgs/imdb.png)\n\n## :signal_strength: Technologies\n\n* [Node.js v14](https://nodejs.org/) Javascript runtime using the [Chrome V8 engine](https://v8.dev/)\n* [Puppeteer v13](https://www.npmjs.com/package/puppeteer) Node library headless automation tool and API for Chrome and Chromium-based web browsers\n* [cheerio v1](https://www.npmjs.com/package/cheerio) to parse markup and provide an API for traversing/manipulating the resulting data structure\n* [objects-to-csv v1](https://www.npmjs.com/package/objects-to-csv) to convert an array of JavaScript objects to Comma Separated Variable (CSV) format that is saved as a file.\n\n## :floppy_disk: Setup\n\n* Install dependencies using `npm i`\n* `node photosScraper` to run photo data extracting code\n* `node imdbScraper` to run film data extracting code\n* image and data files are generated\n\n## :wrench: Testing\n\n* N/A\n\n## :computer: Code Examples\n\n* `imdbScraper.js` function to create array of Cheerio objects using map() then return array of elements using get()\n\n```javascript\n const results = $('tr')\n  .map((index, element) =\u003e {\n   // title - convert to text\n   const titleElement = $(element).find('.titleColumn \u003e a');\n   const title = $(titleElement).text();\n\n   // year - remove unwanted ( and '\n   const yearElement = $(element).find('.titleColumn \u003e span');\n   const year = yearElement.text().replace('(', '').replace(')', '');\n\n   // imdbRating - convert to text\n   const ratingRating = $(element).find('.imdbRating \u003e strong');\n   const rating = ratingRating.text();\n\n   // url - take href attribute\n   const urlElement = $(element).find('.titleColumn \u003e a');\n   const urlAttr = urlElement.attr('href');\n   const url = `http://imdb.com${urlAttr}`;\n\n   return title !== '' ? { index, title, year, rating, url } : null;\n  })\n  .get();\n```\n\n## :cool: Features\n\n* Puppeteer can be used to fill in web site data fields. Can be used to extract the latest news/prices etc. from websites which could be made automatic using a server cron job.\n\n## :clipboard: Status, Testing \u0026 To-Do List\n\n* Status: Working\n* To-Do: Add more Web scraping code - a news site for example\n\n## :clap: Inspiration/General Tools\n\n* [LearnWebCode: Web Scraping with Puppeteer \u0026 Node.js: Chrome Automation](https://www.youtube.com/watch?v=lgyszZhAZOI\u0026t=392s)\n* [Puppeteer Documentation](https://devdocs.io/puppeteer/)\n* [Array.prototype.map()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map)\n* [Stack overflow: What does the get() function do in cheerio?](https://stackoverflow.com/questions/54164509/what-does-the-get-function-do-in-cheerio)\n\n## :file_folder: License\n\n* N/A\n\n## :envelope: Contact\n\n* Repo created by [ABateman](https://github.com/AndrewJBateman), email: gomezbateman@yahoo.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewjbateman%2Fnode-puppeteer-webscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewjbateman%2Fnode-puppeteer-webscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewjbateman%2Fnode-puppeteer-webscraper/lists"}