https://github.com/Tjatse/node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
https://github.com/Tjatse/node-readability

Last synced: 3 months ago
JSON representation

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.

Host: GitHub
URL: https://github.com/Tjatse/node-readability
Owner: Tjatse
Created: 2014-05-10T02:42:52.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2018-08-01T06:37:53.000Z (almost 7 years ago)
Last Synced: 2025-03-20T04:34:54.927Z (3 months ago)
Language: JavaScript
Homepage:
Size: 573 KB
Stars: 343
Watchers: 10
Forks: 36
Open Issues: 7
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md

Awesome Lists containing this project

awesome-nodejs-cn - read-art - **star:343** 从任何页面中提取可读内容 (包 / 人性化)
awesome-nodejs - read-art - Extract readable content from any page. (Packages / Humanize)
awesome-node - read-art - Extract readable content from any page. (Packages / Humanize)
awesome-nodejs - node-readability - Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English. - ★ 271 (Humanize)
awesome-nodejs-cn - read-art - 从任何页面提取可读内容 (包 / 人性化)
fucking-awesome-nodejs - read-art - Extract readable content from any page. (Packages / Humanize)
fucking-awesome-nodejs - read-art - Extract readable content from any page. (Packages / Humanize)
awesome-nodejs-cn - read-art - 从任何页面提取可读内容. (目录 / 人性化)

README

        read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability) [![js-standard-style](https://img.shields.io/badge/code%20style-standard-brightgreen.svg)](http://standardjs.com/)

=========

[![NPM](https://nodei.co/npm/read-art.png?downloads=true&downloadRank=true&stars=true)](https://nodei.co/npm/read-art/)

1. Readability reference to Arc90's.

2. Scrape article from any page (automatically).

3. Make any web page readable, no matter Chinese or English.

> *快速抓取网页文章标题和内容，适合node.js爬虫使用，服务于ElasticSearch。*

## Guide

- [Features](https://github.com/Tjatse/node-readability/wiki/Handbook#features)

- [Performance](https://github.com/Tjatse/node-readability/wiki/Handbook#perfs)

- [Installation](https://github.com/Tjatse/node-readability/wiki/Handbook#ins)

- [Usage](https://github.com/Tjatse/node-readability/wiki/Handbook#usage)

- [Debug](https://github.com/Tjatse/node-readability/wiki/Handbook#debug)

- [Score Rule](https://github.com/Tjatse/node-readability/wiki/Handbook#score_rule)

- [Extract Selectors](https://github.com/Tjatse/node-readability/wiki/Handbook#selectors)

- [Image Fallback](https://github.com/Tjatse/node-readability/wiki/Handbook#imgfallback)

- [Threshold](https://github.com/Tjatse/node-readability/wiki/Handbook#threshold)

- [Customize Settings](https://github.com/Tjatse/node-readability/wiki/Handbook#cus_sets)

- [Output](https://github.com/Tjatse/node-readability/wiki/Handbook#output)

- [Notes](https://github.com/Tjatse/node-readability/wiki/Handbook#notes)

## How it works

In my case, the speed of [spider](https://github.com/Tjatse/spider2) is about **1500k documents per day**, and the maximize crawling speed is **1.2k /minute**, **avg 1k /minute**, the memory cost are about **200 MB** on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing [Score Rules](https://github.com/Tjatse/node-readability/wiki/Handbook#score_rule) or [Selectors](https://github.com/Tjatse/node-readability/wiki/Handbook#selectors). it's better than any other readability modules.

> (4) Server infos:

> * 20M bandwidth of fibre-optical

> * 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus

> * 32G memory

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Tjatse/node-readability

Awesome Lists containing this project

README