https://github.com/bmoscon/articleparse

Heuristic text extraction from news sites in Python3
https://github.com/bmoscon/articleparse

analysis boilerplate-removal heuristics python text-analysis text-extraction

Last synced: about 1 month ago
JSON representation

Heuristic text extraction from news sites in Python3

Host: GitHub
URL: https://github.com/bmoscon/articleparse
Owner: bmoscon
License: other
Created: 2013-11-08T23:07:42.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2017-12-31T23:44:17.000Z (over 7 years ago)
Last Synced: 2025-05-07T18:10:03.596Z (about 1 month ago)
Topics: analysis, boilerplate-removal, heuristics, python, text-analysis, text-extraction
Language: Python
Homepage:
Size: 29.3 KB
Stars: 10
Watchers: 3
Forks: 4
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ArticleParse

============

[![License](https://img.shields.io/badge/license-XFree86-blue.svg)](LICENSE)

Library that strips boilerplate HTML from news articles and performs heuristic analysis to determine the body of the article. Ranks text sections of the website by probability of being news content.

Currently uses for analysis:

* Section Length

* Section Position

* Number of Anchors in a Section

* Anchor Density in a Section

* Word Count

* Uppercase Word Count

* Average Word Length

* Average Sentence Length

* Number of Sentences

This is a work in progress. I have manually tested it on several news websites, but extensive testing still needs to be performed.

Supports Python3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bmoscon/articleparse

Awesome Lists containing this project

README