Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kernix13/puppeteer-web-scraper

How to use Puppeteer for web scaping
https://github.com/kernix13/puppeteer-web-scraper

puppeteer

Last synced: 19 days ago
JSON representation

How to use Puppeteer for web scaping

Host: GitHub
URL: https://github.com/kernix13/puppeteer-web-scraper
Owner: Kernix13
Created: 2022-11-10T21:08:08.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-02-26T01:36:19.000Z (almost 2 years ago)
Last Synced: 2024-11-09T22:25:01.835Z (3 months ago)
Topics: puppeteer
Language: JavaScript
Homepage:
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Web Scraping With Puppeteer

From video by Brad Traversy [Intro To Web Scraping With Puppeteer](https://youtu.be/S67gyqnYHmI), his [GitHub repo](https://github.com/bradtraversy/courses-scrape), and [Puppeteer Getting Started docs](https://pptr.dev/#getting-started).

Puppeteer Methods:

- `launch()`, `newPage()`, `goto()`, `screenshot()`, `pdf()`, `content()`, `evaluate()`, `$$eval()`, `type()`, `close()`, `waitForSelector()`, `click()`

## Install and Setup

- Run `npm init -y` then `npm i puppeteer`

- Create the entry-point file `index.js` or any name you like

- Edit the script in `package.json`: `"start": "node index"`

- Then run `npm start` to run `index.js`

- `require` puppeteer in `index.js`

- All the methods shown are asynchronous so create an `async` Function

Steps:

1. Open the browser using the method `launch()`

1. Use the method `newPage()` to create a new Page object

1. Then use the `goto()` method to go to a specific page: `await page.goto('URL');`

1. At the bottom of the Fx remember to close the browser with the method `close()`

### Simple Examples

**SCREENSHOT**:

- Use the method `screenshot()` - you need to pass in an object with a key of `path` which is the folder directory and filename - then run `npm start`

- That will set the image to a specific size which is not the whole page - to get the whole page, add another property to the object of `fullPage` set to `true`

**PDF**:

- Similar to above but 1) change the file extension from `.png` to `.pdf` and replace `fullPage` with `format` and select the size for the pages, e.g. `'A4'`

- **NOTE**: Cookies notice at the bottom of every page, some styles are there, others are not like background image, and it's mobile view - COOL!

```js

/* Create a screenshot */

await page.screenshot({ path: 'kwd.png', fullPage: true });

/* Create a PDF of the page */

await page.pdf({ path: 'kwd.pdf', format: 'A4' });

```

## Targeting the content

- Grab all the HTML using the `content()` method

- To target specific elements use the `evaluate()` method which is a high order Fx so pass it a CB Fx to get access to the `document` object

- For all the links on a page, you need to use `querySelectorAll` which returns a _nodelist_ which is like an array in that it is iterable, but it does not have array methods

- So wrap it in `Array.from()` method - `Array.from()` takes in a 2nd param which is a Fx

```js

/* Get the entire HTML page content */

const html = await page.content();

/* Target specific elements */

const title = await page.evaluate(() => document.title);

/* Get all the text */

const text = await page.evaluate(() => document.body.innerText);

/* Get all the links */

const links = await page.evaluate(() =>

  Array.from(document.querySelectorAll('a'), item => item.href)

);

```

- To get deepely nested content you need to be more specific - for `querySelectorAll` you need to look at the structure of the website in order to scrape it - we need to know what to put in the selector and return an obj

- For the object he used parens `()` around it or else it would be interpreted as a code block

> ABSOLUTELY CRUCIAL TO REMEMBER USING `({})` FOR THE `Array.from` CALLBACK

Alternate syntax for above without using `Array.from()`:

- Use the `$$eval()` and add the IDs/classes inside it - so that insteead of Array.from() and querySelectorAll() - and it takes a 2nd param, a cb Fx - then use `.map()` - parens around the curly braces in map to return an object

```js

/* Get all his courses */

const courses = await page.evaluate(() =>

  Array.from(document.querySelectorAll('#courses .card'), item => ({

    title: item.querySelector('.card-body h3').innerText,

    level: item.querySelector('.card-body .level').innerText,

    url: item.querySelector('.card-footer a').href,

    promo: item.querySelector('.card-footer .promo-code .promo').innerText,

  }))

);

/* Alternate for above using $eval() */

const courses2 = await page.$$eval('#courses .card', items =>

  items.map(item => ({

    title: item.querySelector('.card-body h3').innerText,

    level: item.querySelector('.card-body .level').innerText,

    url: item.querySelector('.card-footer a').href,

    promo: item.querySelector('.card-footer .promo-code .promo').innerText,

  }))

);

```

### File System

- Install the `fs` file-system `Node.js` package to write the values to a file

```js

const fs = require('fs');

/* Write data to a JSON file */

fs.writeFile('courses.json', JSON.stringify(courses2), err => {

  if (err) throw err;

  console.log('File saved');

});

```

## innerText vs textContent

- `textContent` gets the content of all elements, including `` and `<style>` elements. In contrast, `innerText` only shows "human-readable" elements

- `textContent` returns every element in the node. In contrast, `innerText` is aware of styling and won't return the text of "hidden" elements

- `innerText` triggers a reflow to ensure up-to-date computed styles. Reflows can be computationally expensive, and thus should be avoided when possible

- Both `textContent` and `innerText` remove child nodes when altered

`page.content()` gets all the source code - would getting `innerHTML` for the `body` tag be better?