Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/arianrhodsandlot/html-play
Fetch and parse web pages with Node.js like a boss ð¶.
https://github.com/arianrhodsandlot/html-play
fetch headless-chrome playwright scraping
Last synced: 7 days ago
JSON representation
Fetch and parse web pages with Node.js like a boss ð¶.
- Host: GitHub
- URL: https://github.com/arianrhodsandlot/html-play
- Owner: arianrhodsandlot
- License: mit
- Created: 2024-01-23T03:27:15.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-12T10:19:17.000Z (12 months ago)
- Last Synced: 2024-02-12T11:35:43.271Z (12 months ago)
- Topics: fetch, headless-chrome, playwright, scraping
- Language: HTML
- Homepage:
- Size: 765 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license
Awesome Lists containing this project
README
HTML-Play
Fetch and parse web pages with Node.js like a boss ð¶
## Features
+ Intuitive APIs for extracting useful contents like links and images.
+ CSS selectors.
+ Mocked user-agent (like a real web browser).
+ Full JavaScript support.
```js
await htmlPlay(url, { browser: true })
```
Using Chromium under the hood by default, thanks to [Playwright](https://playwright.dev).## Recipes
+ Grab a list of all links and images on the page.
```js
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('https://nodejs.org')
// Will print all link URLs on the page
console.log(dom.links)
// Will print all image URLs on the page
console.log(dom.images)
```+ Select an element with a CSS selector.
```js
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('https://nodejs.org')
const intro = dom.find('#home-intro', { containing: 'Node' })
// Will print: 'Node.js® is an open-source, cross-platform...'
console.log(intro.text)
```Expand to view more recipes.
+ Let's grab some wallpapers from unsplash.
```js
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('https://unsplash.com/t/wallpapers')
const elements = dom.findAll('img[itemprop=thumbnailUrl]')
const images = elements.map(({ image }) => image)
// Will print something like
// ['https://images.unsplash.com/photo-1705834008920-b08bf6a05223', ...]
console.log(images)
```
+ Let's load some hacker news from Hack News.
```js
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('https://news.ycombinator.com')
const titles = dom.findAll('.titleline')
const news = titles.map(({ text, link }) => [text, link])
// Will print something like
// [['news 1', 'http://xxx.com'], ['news 2', 'http://yyy.com'], ...]
console.log(news)
```
+ Load a dynamic website, which means its content is generated by JavaScript!
```js
// Search for images of "flower" with Google
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('https://www.google.com/search?&q=flower&tbm=isch', { browser: true })
// Filtering is still needed if you want this work...
console.log(dom.images)
```
+ Send requests with custom cookies.
```js
import { htmlPlay } from '../src/index.js'const { dom } = await htmlPlay('https://httpbin.org/cookies', {
fetch: { fetchInit: { headers: { Cookie: 'a=1; b=2;' } } },
})
// Will print { "cookies": { "a": "1", "b": "2" } }
console.log(dom.text)
```## Installation
```sh
npm i html-play
```
If you want to use a browser to "run" the page before parsing, you'll need to install Chromium with Playwright.
```sh
npm i playwright
npx playwright install chromium
```## APIs
+ ### Methods
#### `htmlPlay`Fetch a certain URL and return its response with the parsed DOM.
##### Example:
```js
import { htmlPlay } from 'html-play'const { dom } = await htmlPlay('http://example.com')
```##### Parameters:
+ `url`Type: `string`
The URL to fetch.
+ `options` (Optional)
Type: `object`
Default: `{ fetch: true }`
+ `fetch` (Optional)
Type: `boolean | object`
Default: `true`
If set to `true`, we will use [the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/fetch) to load the requested URL. You can also specify the options for [the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/fetch) by passing an `object` here.
+ `fetcher` (Optional)
Type: `function`
The fetch function we are going to use. We can pass a polyfill here.
+ `fetchInit` (Optional)
Type: `function`
The fetch parameters passed to the fetch function. See [fetch#options](https://developer.mozilla.org/en-US/docs/Web/API/fetch#options). You can set HTTP headers or cookies here.
+ `browser` (Optional)
Type: `boolean | object`
Default: `false`
If set to `true`, we will use Playwright to load the requested URL. You can also specify the options for Playwright by passing an `object` here.
+ `browser` (Optional)
Type: `object`
[The Playwright Browser instance](https://playwright.dev/docs/api/class-browser) to use.
+ `page` (Optional)
Type: `object`
[The Playwright Page instance](https://playwright.dev/docs/api/class-page) to use.
+ `launchOptions` (Optional)
The `launchOptions` passed to Playwright when we are launching the browser. See [BrowserType#browser-type-launch](https://playwright.dev/docs/api/class-browsertype#browser-type-launch)
+ `beforeNavigate` (Optional)
A custom hook function that will be called before the page is loaded. `page` and `browser` can be accessed here as the properties of its first parameter to interact with the page.
+ `afterNavigate` (Optional)
A custom hook function that will be called after the page is loaded. `page` and `browser` can be accessed here as the properties of its first parameter to interact with the page.
##### Returns:
A `Promise` of the [`Response`](#Response) instance (see below).+ ### Classes
#### `Response`
##### Properties
+ `url`Type: `string`
The URL of the response. If the response is redirected from another URL, the value will be the final redirected URL.
+ `status`
Type: `number`
The HTTP status code of the response.
+ `content`
Type: `string`
The response content as a plain string.
+ `dom`
Type: `object`
The parsed root DOM. See [`DOMElement`](#DOMElement).
+ `json`
Type: `object | undefined`
The parsed response JSON. If the response is not a valid JSON, it will be `undefined`.
+ `rawBrowserResponse`
Type: `object`
The raw response object returned by Playwright.
+ `rawFetchResponse`
Type: `object`
The raw response object returned by [the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/fetch).
#### `DOMElement`
##### Properties
+ `html`Type: `string`
The "[`outerHTML`](https://developer.mozilla.org/en-US/docs/Web/API/Element/outerHTML)" of this element.
+ `link`
Type: `string`
If the element is an [anchor element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a), this will be the absolute value of the element's link, or it will be an empty string.
+ `links`
Type: `string[]`
All the [anchor elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a) inside this element.
+ `text`
Type: `string`
The text of the element with whitespaces and linebreaks stripped.
+ `rawText`
Type: `string`
The original text of the element.
+ `image`
Type: `string`
If the element is an [image embed element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img), this will be the absolute URL of the element's image, or it will be an empty string.
+ `images`
Type: `string[]`
All the image URLs inside this element.
+ `backgroundImage`
Type: `string`
The background image source extracted from the element's inline style.
+ `element`
Type: `object`
The corresponding `JSDOM` element object.
##### Methods
+ `find`Find the first matched child `DOMElement` inside this element.
##### Parameters
+ `selector`Type: `string`
The CSS selector to use.
+ `options` (Optional)
Type: `object`
+ `containing` (Optional)
Type: `string`
Check if the element contains the specified substring.
Type: `string`
+ `findAll`Find all matched child `DOMElement`s inside this element.
##### Parameters
+ `selector`Type: `string`
The CSS selector to use.
+ `options` (Optional)
Type: `object`
+ `containing` (Optional)
Type: `string`
Check if the element contains the specified substring.
Type: `string`
+ `getAttribute`
##### Parameters
+ `qualifiedName`Type: `string`
Returns element's first attribute whose qualified name is qualifiedName, and `undefined` if there is no such attribute otherwise.
## Credits
This project is highly inspired by another fabulous library [Requests-HTML](https://github.com/psf/requests-html) for Python.## License
[MIT](licenses)