https://github.com/ncarlier/htmlgrabr
A Node.js library to grab and clean HTML content.
https://github.com/ncarlier/htmlgrabr
Last synced: about 19 hours ago
JSON representation
A Node.js library to grab and clean HTML content.
- Host: GitHub
- URL: https://github.com/ncarlier/htmlgrabr
- Owner: ncarlier
- License: mit
- Created: 2018-11-04T17:41:52.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-01-06T01:36:52.000Z (almost 3 years ago)
- Last Synced: 2025-07-06T11:03:25.169Z (3 months ago)
- Language: TypeScript
- Homepage: https://ncarlier.github.io/htmlgrabr/
- Size: 2.65 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# HTMLGrabr library
[](https://travis-ci.org/ncarlier/htmlgrabr)
[](https://coveralls.io/github/ncarlier/htmlgrabr?branch=master)
[](https://paypal.me/nunux)A Node.js library to grab and clean HTML content.
### Features
- Extract page content from an URL (`HTMLGrabr.grabURL(url: URL): GrabbedPage`)
- Extract page content from a string (`HTMLGrabr.grab(s: string): GrabbedPage`)
- Extract Open Graph properties
- Clean the page content:
- Extract main HTML content using [mozilla-readability](https://github.com/mozilla/readability)
- Sanitize HTML content using [DOMPurify](https://github.com/cure53/DOMPurify), with some extras:
- Remove unwanted links or images
- Remove pixel tracker
- Remove unwanted attributes (such as `style`, `class`, `id`, ...)
- And more### Usage
```bash
npm install --save htmlgrabr
```The in your code:
```javascript
const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const { URL } = require('url')const grabber = new HTMLGrabr()
grabber.grabUrl(new URL('https://about.readflow.app'))
.then(page => {
console.log(page)
}, err => {
console.error(err)
})
```### API
Create new instance:
```js
const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const grabber = new HTMLGrabr(config)
```Configuration object:
```typescript
interface GrabberConfig {
debug?: boolean // Print debug logs if true
pretty?: boolean // Beautify HTML content if true
isBlockedHost?: BlockedHostCtrlFunc // Function used to detect unwanted URLs
rewriteURL?: URLRewriterFunc // Function used to rewrite HTML src attributes
rules?: Map // Rule definitions (see below)
headers?: Headers // HTTP headers to set
}
```Rule definition:
```typescript
export interface Rule {
selector: string // HTML query selector
type: 'redirect' | 'content' // Rule type:
// - 'redirect' will use 'src' or 'href' attribute to redirect content extraction
// - 'content' to specify content to extract
}
```Grab a page:
```js
const result = grabber.grabUrl(new URL('https://...'))
```Result object:
```typescript
interface GrabbedPage {
title: string // Page title
url: string | null // Source URL
image: string | null // Page illustration
html: string // HTML content
text: string // Text content (from HTML)
excerpt: string // Excerpt (from meta data or HTML)
length: number // Read length
images: ImageMeta[] // Embedded image URLs
}
```---