Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/beenotung/playwright-cache.ts
Cache web pages fetched using the Playwright.
https://github.com/beenotung/playwright-cache.ts
cache caching fetch playwright playwright-cache scraping typescript web-scraping
Last synced: about 2 months ago
JSON representation
Cache web pages fetched using the Playwright.
- Host: GitHub
- URL: https://github.com/beenotung/playwright-cache.ts
- Owner: beenotung
- License: bsd-2-clause
- Created: 2023-11-03T06:52:22.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-11-03T07:07:25.000Z (about 1 year ago)
- Last Synced: 2024-08-09T08:27:28.180Z (5 months ago)
- Topics: cache, caching, fetch, playwright, playwright-cache, scraping, typescript, web-scraping
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/playwright-cache.ts
- Size: 8.79 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# playwright-cache.ts
Cache web pages fetched using the Playwright.
[![npm Package Version](https://img.shields.io/npm/v/playwright-cache.ts)](https://www.npmjs.com/package/playwright-cache.ts)
This is particularly useful for web scraping where you might want to reduce the load on the server, or speed up the process by reusing previously fetched web pages.
## Installation
To install this library, you can use npm, pnpm or yarn:
```bash
npm install playwright-cache.ts
```or
```bash
pnpm install playwright-cache.ts
```or
```bash
yarn add playwright-cache.ts
```## Usage Example
```typescript
import { chromium } from 'playwright'
import { PlaywrightCache, toOrigin } from 'playwright-cache.ts'async function main() {
let browser = await chromium.launch({ headless: false })
let page = await browser.newPage()let cache = new PlaywrightCache({
cacheDir: '.cache',
getMode: 'navigate',
})let url = 'https://en.wikipedia.org/wiki/Factorial'
let html = await cache.cachedGetPageContent(page, url)
await toOrigin(page, url, { waitUntil: 'domcontentloaded' })
let links = await page.evaluate(html => {
let document = new DOMParser().parseFromString(html, 'text/html')
return Array.from(document.querySelectorAll('a'), a => a.href).filter(
link => {
try {
return new URL(link).origin == location.origin
} catch (error) {
return false
}
},
)
}, html)console.log(links.length, 'links')
console.log(links)await page.close()
await browser.close()
}main()
```## API with Typescript Signature
### Class: PlaywrightCache
Below is the API of `PlaywrightCache`'s `constructor()` and it's main method `cachedGetPageContent()`, which returns the html of the given url:
```typescript
import { Page } from 'playwright'
export type GotoOptions = Parameters[1]export class PlaywrightCache {
constructor(options?: CacheOption)/**
* @description get the html payload of the given url, auto save and reuse caches
*/
cachedGetPageContent(
page: Page,
url: string,
options?: GotoOptions,
): Promise
}export type CacheOption = {
/**
* @description the directory to store cached web pages
* @default '.cache'
* */
cacheDir?: string
/**
* @description reuse cached web pages within this interval period (in milliseconds)
* @default 15*60*1000 (15 minutes)
*/
cacheInterval?: number
/**
* @description to fetch() the web page within the same origin, or to navigate with page.goto()
* @default 'navigate'
*/
getMode?: GetMode
}export type GetMode = 'fetch' | 'navigate'
```### Helper Function: toOrigin()
Below is the API of a helper function `toOrigin()`, which can be called before calling `new DOMParser().parseFromString(html,'text/html')` in `page.evaluate()` to make sure it handles relative href properly.
```typescript
import { Page } from 'playwright'
export type GotoOptions = Parameters[1]/**
* @description goto the url's origin if not already, for evaluating relative links of a[href]
*/
export function toOrigin(
page: Page,
url: string,
options: GotoOptions,
): Promise
```## License
This project is licensed with [BSD-2-Clause](./LICENSE)
This is free, libre, and open-source software. It comes down to four essential freedoms [[ref]](https://seirdy.one/2021/01/27/whatsapp-and-the-domestication-of-users.html#fnref:2):
- The freedom to run the program as you wish, for any purpose
- The freedom to study how the program works, and change it so it does your computing as you wish
- The freedom to redistribute copies so you can help others
- The freedom to distribute copies of your modified versions to others