https://github.com/wrtnlabs/web-content-extractor

A LLM-free library for extracting main content from HTML strings via Text Density analysis
https://github.com/wrtnlabs/web-content-extractor

Last synced: 5 months ago
JSON representation

A LLM-free library for extracting main content from HTML strings via Text Density analysis

Host: GitHub
URL: https://github.com/wrtnlabs/web-content-extractor
Owner: wrtnlabs
License: mit
Created: 2025-02-13T05:59:27.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-03-27T04:49:45.000Z (7 months ago)
Last Synced: 2025-04-05T13:01:38.504Z (6 months ago)
Language: TypeScript
Homepage:
Size: 83 KB
Stars: 175
Watchers: 3
Forks: 13
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # web-content-extractor

A small and fast library for extracting content from HTML.

It is an one of implementation of the paper [DOM Based Content Extraction via Text Density](https://ofey.me/assets/pdf/cetd-sigir11.pdf).

## Installation

To install via NPM:

```bash

npm i @wrtnlabs/web-content-extractor

```

## Usage

```ts

import { extractContent } from "@wrtnlabs/web-content-extractor";

const { title, description, content, contentHtmls, links } =

  extractContent(html);

console.log("title", title);

console.log("description", description);

console.log("content", content); // The content of the page; string

for (const fragment of contentHtmls) {

  console.log("fragment", fragment); // The fragment of the content; string

}

for (const link of links) {

  console.log("url", link.url); // The URL of the link

  console.log("content", link.content); // The content of the link

}

```

## Note

It strips some tags that can be considered as non-content tags, including:

- `script`

- `noscript`

- `style`

- `nav`

- `header`

- `footer`

- `img`

- `svg`

- `video`

- `audio`

- `form`

- `label`

- `input`

- `select`

- `option`

- `button`

- `object`

- `embed`

- `iframe`

- `canvas`

- `map`

- `area`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wrtnlabs/web-content-extractor

Awesome Lists containing this project

README