https://github.com/slaveofcode/boilerpipe3
A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
https://github.com/slaveofcode/boilerpipe3
Last synced: 3 months ago
JSON representation
A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
- Host: GitHub
- URL: https://github.com/slaveofcode/boilerpipe3
- Owner: slaveofcode
- Created: 2016-10-22T19:22:14.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2020-04-10T15:31:11.000Z (over 5 years ago)
- Last Synced: 2025-06-25T14:17:13.488Z (5 months ago)
- Language: Python
- Homepage:
- Size: 7.23 MB
- Stars: 45
- Watchers: 2
- Forks: 15
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# boilerpipe3
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
Installation
============
You can install this lib directly from github repository by execute these command
pip install git+ssh://git@github.com/slaveofcode/boilerpipe3@master
Or from official pypi
pip install boilerpipe3
Configuration
=============
Dependencies:
jpype, charade
The boilerpipe jar files will get fetched and included automatically when building the package.
Usage
=====
Be sure to have set JAVA_HOME properly since jpype depends on this setting.
The constructor takes a keyword argment ``extractor``, being one of the available boilerpipe extractor types:
- DefaultExtractor
- ArticleExtractor
- ArticleSentencesExtractor
- KeepEverythingExtractor
- KeepEverythingWithMinKWordsExtractor
- LargestContentExtractor
- NumWordsRulesExtractor
- CanolaExtractor
If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either ``html`` for HTML text or ``url``.
from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)
Then, to extract relevant content:
extracted_text = extractor.getText()
extracted_html = extractor.getHTML()