https://github.com/woojubb/html-article-extractor
A web page content extractor
https://github.com/woojubb/html-article-extractor
article-extracting article-extractor crawler crawling extraction extractor
Last synced: 6 months ago
JSON representation
A web page content extractor
- Host: GitHub
- URL: https://github.com/woojubb/html-article-extractor
- Owner: woojubb
- License: mit
- Created: 2019-01-03T01:43:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-08-13T05:30:20.000Z (almost 2 years ago)
- Last Synced: 2025-10-20T10:36:09.839Z (8 months ago)
- Topics: article-extracting, article-extractor, crawler, crawling, extraction, extractor
- Language: JavaScript
- Size: 22.5 KB
- Stars: 21
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# html-article-extractor
A web page content extractor for News websites
# installation
```javascript
npm install html-article-extractor
```
# usage
```javascript
var htmlArticleExtractor = require("html-article-extractor");
var dom = new JSDOM("...");
var body = dom.window.document.body
result = htmlArticleExtractor(body);
console.log(result)
```
Outputs:
```
{
html: '
contents',
text: 'contents'
}
```
# example
```
git clone https://github.com/jungyoun/html-article-extractor
cd html-article-extractor
npm install
node example/crawler.js
```
# demo
https://online-article-extractor.herokuapp.com/