An open API service indexing awesome lists of open source software.

https://github.com/woojubb/html-article-extractor

A web page content extractor
https://github.com/woojubb/html-article-extractor

article-extracting article-extractor crawler crawling extraction extractor

Last synced: 6 months ago
JSON representation

A web page content extractor

Awesome Lists containing this project

README

          

# html-article-extractor
A web page content extractor for News websites

# installation
```javascript
npm install html-article-extractor
```

# usage
```javascript
var htmlArticleExtractor = require("html-article-extractor");

var dom = new JSDOM("...");
var body = dom.window.document.body
result = htmlArticleExtractor(body);
console.log(result)
```

Outputs:
```
{
html: '

contents
',
text: 'contents'
}
```

# example
```
git clone https://github.com/jungyoun/html-article-extractor
cd html-article-extractor
npm install
node example/crawler.js
```

# demo
https://online-article-extractor.herokuapp.com/