https://github.com/zjhiphop/cnext

A Chinese content extractor for web page.
https://github.com/zjhiphop/cnext

extractor machine-learning web

Last synced: 5 months ago
JSON representation

A Chinese content extractor for web page.

Host: GitHub
URL: https://github.com/zjhiphop/cnext
Owner: zjhiphop
Created: 2019-04-24T08:03:48.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-04-24T08:07:44.000Z (about 7 years ago)
Last Synced: 2025-04-09T01:47:12.371Z (about 1 year ago)
Topics: extractor, machine-learning, web
Language: HTML
Size: 3.54 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          Content Extractor

================

> A text based extractor based on modern tech such as Machine Learning.

Road Map

========

1. Web content Extractor

2. Email content Extractor

3. IM content Extractor

TODO

====

1. Chinese content extractor

>

    (1) 预处理：将网页解析成DOM树，并剔除不可视节点.

    (2) 获取待提取文本块：根据网页DOM树计算各个块的文本密度，并将文本密度大于块的文本块的上一级文本块作为待提取块.

    (3) 获取标签路径集合：计算每条标签路径的TPR值，设定阈值，获取正文节点候选的路径集合.

    (4) 提取正文：将 (3) 的候选路径集合与 (2) 获取的文本块中的路径集合求交集，将交集中路径节点的文本提取，输出为网页正文.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zjhiphop/cnext

Awesome Lists containing this project

README