https://github.com/wrtnlabs/web-content-extractor
A LLM-free library for extracting main content from HTML strings via Text Density analysis
https://github.com/wrtnlabs/web-content-extractor
Last synced: 5 months ago
JSON representation
A LLM-free library for extracting main content from HTML strings via Text Density analysis
- Host: GitHub
- URL: https://github.com/wrtnlabs/web-content-extractor
- Owner: wrtnlabs
- License: mit
- Created: 2025-02-13T05:59:27.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-03-27T04:49:45.000Z (7 months ago)
- Last Synced: 2025-04-05T13:01:38.504Z (6 months ago)
- Language: TypeScript
- Homepage:
- Size: 83 KB
- Stars: 175
- Watchers: 3
- Forks: 13
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# web-content-extractor
A small and fast library for extracting content from HTML.
It is an one of implementation of the paper [DOM Based Content Extraction via Text Density](https://ofey.me/assets/pdf/cetd-sigir11.pdf).
## Installation
To install via NPM:
```bash
npm i @wrtnlabs/web-content-extractor
```## Usage
```ts
import { extractContent } from "@wrtnlabs/web-content-extractor";const { title, description, content, contentHtmls, links } =
extractContent(html);console.log("title", title);
console.log("description", description);console.log("content", content); // The content of the page; string
for (const fragment of contentHtmls) {
console.log("fragment", fragment); // The fragment of the content; string
}for (const link of links) {
console.log("url", link.url); // The URL of the link
console.log("content", link.content); // The content of the link
}
```## Note
It strips some tags that can be considered as non-content tags, including:
- `script`
- `noscript`
- `style`
- `nav`
- `header`
- `footer`
- `img`
- `svg`
- `video`
- `audio`
- `form`
- `label`
- `input`
- `select`
- `option`
- `button`
- `object`
- `embed`
- `iframe`
- `canvas`
- `map`
- `area`