https://github.com/mitica/ascrape-js
Extracts article content from a web page.
https://github.com/mitica/ascrape-js
article-extracting cheerio
Last synced: 2 months ago
JSON representation
Extracts article content from a web page.
- Host: GitHub
- URL: https://github.com/mitica/ascrape-js
- Owner: mitica
- Created: 2016-07-14T04:35:34.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-02-25T13:43:17.000Z (over 8 years ago)
- Last Synced: 2025-06-26T23:58:59.605Z (3 months ago)
- Topics: article-extracting, cheerio
- Language: JavaScript
- Size: 225 KB
- Stars: 10
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ascrape
Nodejs module for extracting web page content using Cheerio.
This module is based on [luin](https://github.com/luin/readability)'s readability project.
## Install
```
npm install ascrape
```## Usage
```
var scrape = require('ascrape');scrape(html [, options], callback);
```**Where**
- **html** url or html code.
- **options** is an optional options object
- **callback** is the callback to run - callback(error, article, meta)## Example
```
var scrape = require('ascrape');scrape('http://howtonode.org/really-simple-file-uploads', function(err, article, meta) {
// Main Article
console.log(article.content.text());// Title
console.log(article.title);// Article HTML Source Code
console.log(article.content.html());
});
```**NB** If the page has been marked with charset other than utf-8, it will be converted automatically. Charsets such as GBK, GB2312 is also supported.
## Options
ascrape will pass the options to request directly. See request lib to view all available options.
ascrape has one additional option:
- **preprocess** - which should be a function to check or modify downloaded source before passing it to ascrape.
```
scrape(url, {
preprocess: function(source, response, contentType, callback) {
if (source.length > maxBodySize) {
return callback(new Error('too big'));
}
callback(null, source);
}, function(err, article, response) {
//...
});
```### Article object
- **content** - The article content of the web page. Return false if failed. Is a Cheerio object.
- **title** - The article title of the web page. It's may not same to the text in the `` tag.
- **excerpt** - The article description from any description, og:description or twitter:description ``
## License
MIT License