Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/beenotung/playwright-cache.ts

Cache web pages fetched using the Playwright.
https://github.com/beenotung/playwright-cache.ts

cache caching fetch playwright playwright-cache scraping typescript web-scraping

Last synced: about 2 months ago
JSON representation

Cache web pages fetched using the Playwright.

Host: GitHub
URL: https://github.com/beenotung/playwright-cache.ts
Owner: beenotung
License: bsd-2-clause
Created: 2023-11-03T06:52:22.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2023-11-03T07:07:25.000Z (about 1 year ago)
Last Synced: 2024-08-09T08:27:28.180Z (5 months ago)
Topics: cache, caching, fetch, playwright, playwright-cache, scraping, typescript, web-scraping
Language: TypeScript
Homepage: https://www.npmjs.com/package/playwright-cache.ts
Size: 8.79 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # playwright-cache.ts

Cache web pages fetched using the Playwright.

[![npm Package Version](https://img.shields.io/npm/v/playwright-cache.ts)](https://www.npmjs.com/package/playwright-cache.ts)

This is particularly useful for web scraping where you might want to reduce the load on the server, or speed up the process by reusing previously fetched web pages.

## Installation

To install this library, you can use npm, pnpm or yarn:

```bash

npm install playwright-cache.ts

```

or

```bash

pnpm install playwright-cache.ts

```

or

```bash

yarn add playwright-cache.ts

```

## Usage Example

```typescript

import { chromium } from 'playwright'

import { PlaywrightCache, toOrigin } from 'playwright-cache.ts'

async function main() {

  let browser = await chromium.launch({ headless: false })

  let page = await browser.newPage()

  let cache = new PlaywrightCache({

    cacheDir: '.cache',

    getMode: 'navigate',

  })

  let url = 'https://en.wikipedia.org/wiki/Factorial'

  let html = await cache.cachedGetPageContent(page, url)

  await toOrigin(page, url, { waitUntil: 'domcontentloaded' })

  let links = await page.evaluate(html => {

    let document = new DOMParser().parseFromString(html, 'text/html')

    return Array.from(document.querySelectorAll('a'), a => a.href).filter(

      link => {

        try {

          return new URL(link).origin == location.origin

        } catch (error) {

          return false

        }

      },

    )

  }, html)

  console.log(links.length, 'links')

  console.log(links)

  await page.close()

  await browser.close()

}

main()

```

## API with Typescript Signature

### Class: PlaywrightCache

Below is the API of `PlaywrightCache`'s `constructor()` and it's main method `cachedGetPageContent()`, which returns the html of the given url:

```typescript

import { Page } from 'playwright'

export type GotoOptions = Parameters[1]

export class PlaywrightCache {

  constructor(options?: CacheOption)

  /**

   * @description get the html payload of the given url, auto save and reuse caches

   */

  cachedGetPageContent(

    page: Page,

    url: string,

    options?: GotoOptions,

  ): Promise

}

export type CacheOption = {

  /**

   * @description the directory to store cached web pages

   * @default '.cache'

   * */

  cacheDir?: string

  /**

   * @description reuse cached web pages within this interval period (in milliseconds)

   * @default 15*60*1000 (15 minutes)

   */

  cacheInterval?: number

  /**

   * @description to fetch() the web page within the same origin, or to navigate with page.goto()

   * @default 'navigate'

   */

  getMode?: GetMode

}

export type GetMode = 'fetch' | 'navigate'

```

### Helper Function: toOrigin()

Below is the API of a helper function `toOrigin()`, which can be called before calling `new DOMParser().parseFromString(html,'text/html')` in `page.evaluate()` to make sure it handles relative href properly.

```typescript

import { Page } from 'playwright'

export type GotoOptions = Parameters[1]

/**

 * @description goto the url's origin if not already, for evaluating relative links of a[href]

 */

export function toOrigin(

  page: Page,

  url: string,

  options: GotoOptions,

): Promise

```

## License

This project is licensed with [BSD-2-Clause](./LICENSE)

This is free, libre, and open-source software. It comes down to four essential freedoms [[ref]](https://seirdy.one/2021/01/27/whatsapp-and-the-domestication-of-users.html#fnref:2):

- The freedom to run the program as you wish, for any purpose

- The freedom to study how the program works, and change it so it does your computing as you wish

- The freedom to redistribute copies so you can help others

- The freedom to distribute copies of your modified versions to others