Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Tjatse/node-readability
Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
https://github.com/Tjatse/node-readability
Last synced: 3 months ago
JSON representation
Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
- Host: GitHub
- URL: https://github.com/Tjatse/node-readability
- Owner: Tjatse
- Created: 2014-05-10T02:42:52.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2018-08-01T06:37:53.000Z (over 6 years ago)
- Last Synced: 2024-10-14T01:37:00.480Z (3 months ago)
- Language: JavaScript
- Homepage:
- Size: 573 KB
- Stars: 343
- Watchers: 11
- Forks: 36
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
Awesome Lists containing this project
- awesome-nodejs-cn - read-art - 从任何页面提取可读内容 (包 / 人性化)
- awesome-nodejs-cn - read-art - **star:343** 从任何页面中提取可读内容 (包 / 人性化)
- awesome-nodejs - read-art - Extract readable content from any page. (Packages / Humanize)
- awesome-nodejs - node-readability - Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English. - ★ 271 (Humanize)
- awesome-node - read-art - Extract readable content from any page. (Packages / Humanize)
- awesome-nodejs-cn - read-art - 从任何页面提取可读内容. (目录 / 人性化)
README
read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability) [![js-standard-style](https://img.shields.io/badge/code%20style-standard-brightgreen.svg)](http://standardjs.com/)
=========
[![NPM](https://nodei.co/npm/read-art.png?downloads=true&downloadRank=true&stars=true)](https://nodei.co/npm/read-art/)1. Readability reference to Arc90's.
2. Scrape article from any page (automatically).
3. Make any web page readable, no matter Chinese or English.> *快速抓取网页文章标题和内容,适合node.js爬虫使用,服务于ElasticSearch。*
## Guide
- [Features](https://github.com/Tjatse/node-readability/wiki/Handbook#features)
- [Performance](https://github.com/Tjatse/node-readability/wiki/Handbook#perfs)
- [Installation](https://github.com/Tjatse/node-readability/wiki/Handbook#ins)
- [Usage](https://github.com/Tjatse/node-readability/wiki/Handbook#usage)
- [Debug](https://github.com/Tjatse/node-readability/wiki/Handbook#debug)
- [Score Rule](https://github.com/Tjatse/node-readability/wiki/Handbook#score_rule)
- [Extract Selectors](https://github.com/Tjatse/node-readability/wiki/Handbook#selectors)
- [Image Fallback](https://github.com/Tjatse/node-readability/wiki/Handbook#imgfallback)
- [Threshold](https://github.com/Tjatse/node-readability/wiki/Handbook#threshold)
- [Customize Settings](https://github.com/Tjatse/node-readability/wiki/Handbook#cus_sets)
- [Output](https://github.com/Tjatse/node-readability/wiki/Handbook#output)
- [Notes](https://github.com/Tjatse/node-readability/wiki/Handbook#notes)## How it works
In my case, the speed of [spider](https://github.com/Tjatse/spider2) is about **1500k documents per day**, and the maximize crawling speed is **1.2k /minute**, **avg 1k /minute**, the memory cost are about **200 MB** on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing [Score Rules](https://github.com/Tjatse/node-readability/wiki/Handbook#score_rule) or [Selectors](https://github.com/Tjatse/node-readability/wiki/Handbook#selectors). it's better than any other readability modules.
> (4) Server infos:
> * 20M bandwidth of fibre-optical
> * 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
> * 32G memory