Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/andrewjbateman/node-puppeteer-webscraper
:clipboard: Node.js with puppeteer to extract web content from Google Chrome
https://github.com/andrewjbateman/node-puppeteer-webscraper
cheerio html5 javascript nodejs puppeteer webscrapping
Last synced: about 1 month ago
JSON representation
:clipboard: Node.js with puppeteer to extract web content from Google Chrome
- Host: GitHub
- URL: https://github.com/andrewjbateman/node-puppeteer-webscraper
- Owner: AndrewJBateman
- Created: 2021-08-13T15:08:31.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-03-22T10:17:47.000Z (almost 3 years ago)
- Last Synced: 2024-11-07T03:33:18.479Z (3 months ago)
- Topics: cheerio, html5, javascript, nodejs, puppeteer, webscrapping
- Language: JavaScript
- Homepage:
- Size: 337 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# :zap: Node Puppeteer Webscraper
* Node.js used with [Puppeteer](https://www.npmjs.com/package/puppeteer) & [Cheerio](https://www.npmjs.com/package/cheerio) to gather data from web pages
* PhotosScraper code from [LearnWebCode](https://www.youtube.com/channel/UCHRp19HU7Y2LwfI0Ai6WAGQ) - see [:clap: Inspiration](#clap-inspiration) below. Also includes Imdb film data scraper.
* **Note:** to open web links in a new window use: _ctrl+click on link_![GitHub repo size](https://img.shields.io/github/repo-size/AndrewJBateman/node-puppeteer-webscraper?style=plastic)
![GitHub pull requests](https://img.shields.io/github/issues-pr/AndrewJBateman/node-puppeteer-webscraper?style=plastic)
![GitHub Repo stars](https://img.shields.io/github/stars/AndrewJBateman/node-puppeteer-webscraper?style=plastic)
![GitHub last commit](https://img.shields.io/github/last-commit/AndrewJBateman/node-puppeteer-webscraper?style=plastic)## :page_facing_up: Table of contents
* [:zap: Node Puppeteer Webscraper](#zap-node-puppeteer-webscraper)
* [:page_facing_up: Table of contents](#page_facing_up-table-of-contents)
* [:books: General info](#books-general-info)
* [:camera: Screenshots](#camera-screenshots)
* [:signal_strength: Technologies](#signal_strength-technologies)
* [:floppy_disk: Setup](#floppy_disk-setup)
* [:wrench: Testing](#wrench-testing)
* [:computer: Code Examples](#computer-code-examples)
* [:cool: Features](#cool-features)
* [:clipboard: Status, Testing & To-Do List](#clipboard-status-testing--to-do-list)
* [:clap: Inspiration/General Tools](#clap-inspirationgeneral-tools)
* [:file_folder: License](#file_folder-license)
* [:envelope: Contact](#envelope-contact)## :books: General info
* Puppeteer contains a version of Chrome and runs headless by default.
* PhotosScraper.js extracts photos from the LearnWebCode website and stores them.
* Cheerio functions were used in the imdbScraper to access data from the HTML web page
* ImdbScraper.js uses [JS array map method](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) to produce CSV and JSON files with film title, year, rating & url extracted from the HTML## :camera: Screenshots
![Frontend screenshot](./imgs/imdb.png)
## :signal_strength: Technologies
* [Node.js v14](https://nodejs.org/) Javascript runtime using the [Chrome V8 engine](https://v8.dev/)
* [Puppeteer v13](https://www.npmjs.com/package/puppeteer) Node library headless automation tool and API for Chrome and Chromium-based web browsers
* [cheerio v1](https://www.npmjs.com/package/cheerio) to parse markup and provide an API for traversing/manipulating the resulting data structure
* [objects-to-csv v1](https://www.npmjs.com/package/objects-to-csv) to convert an array of JavaScript objects to Comma Separated Variable (CSV) format that is saved as a file.## :floppy_disk: Setup
* Install dependencies using `npm i`
* `node photosScraper` to run photo data extracting code
* `node imdbScraper` to run film data extracting code
* image and data files are generated## :wrench: Testing
* N/A
## :computer: Code Examples
* `imdbScraper.js` function to create array of Cheerio objects using map() then return array of elements using get()
```javascript
const results = $('tr')
.map((index, element) => {
// title - convert to text
const titleElement = $(element).find('.titleColumn > a');
const title = $(titleElement).text();// year - remove unwanted ( and '
const yearElement = $(element).find('.titleColumn > span');
const year = yearElement.text().replace('(', '').replace(')', '');// imdbRating - convert to text
const ratingRating = $(element).find('.imdbRating > strong');
const rating = ratingRating.text();// url - take href attribute
const urlElement = $(element).find('.titleColumn > a');
const urlAttr = urlElement.attr('href');
const url = `http://imdb.com${urlAttr}`;return title !== '' ? { index, title, year, rating, url } : null;
})
.get();
```## :cool: Features
* Puppeteer can be used to fill in web site data fields. Can be used to extract the latest news/prices etc. from websites which could be made automatic using a server cron job.
## :clipboard: Status, Testing & To-Do List
* Status: Working
* To-Do: Add more Web scraping code - a news site for example## :clap: Inspiration/General Tools
* [LearnWebCode: Web Scraping with Puppeteer & Node.js: Chrome Automation](https://www.youtube.com/watch?v=lgyszZhAZOI&t=392s)
* [Puppeteer Documentation](https://devdocs.io/puppeteer/)
* [Array.prototype.map()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map)
* [Stack overflow: What does the get() function do in cheerio?](https://stackoverflow.com/questions/54164509/what-does-the-get-function-do-in-cheerio)## :file_folder: License
* N/A
## :envelope: Contact
* Repo created by [ABateman](https://github.com/AndrewJBateman), email: [email protected]