https://github.com/woojubb/html-article-extractor

A web page content extractor
https://github.com/woojubb/html-article-extractor

article-extracting article-extractor crawler crawling extraction extractor

Last synced: 6 months ago
JSON representation

A web page content extractor

Host: GitHub
URL: https://github.com/woojubb/html-article-extractor
Owner: woojubb
License: mit
Created: 2019-01-03T01:43:07.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-08-13T05:30:20.000Z (almost 2 years ago)
Last Synced: 2025-10-20T10:36:09.839Z (8 months ago)
Topics: article-extracting, article-extractor, crawler, crawling, extraction, extractor
Language: JavaScript
Size: 22.5 KB
Stars: 21
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # html-article-extractor

A web page content extractor for News websites

# installation

```javascript

npm install html-article-extractor

```

# usage

```javascript

var htmlArticleExtractor = require("html-article-extractor");

var dom = new JSDOM("...");

var body = dom.window.document.body

result = htmlArticleExtractor(body);

console.log(result)

```

Outputs:

```

{

    html: '
contents',

    text: 'contents'

}

```

# example

```

git clone https://github.com/jungyoun/html-article-extractor

cd html-article-extractor

npm install

node example/crawler.js

```

# demo

https://online-article-extractor.herokuapp.com/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/woojubb/html-article-extractor

Awesome Lists containing this project

README