Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/slaveofcode/boilerpipe3
A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
https://github.com/slaveofcode/boilerpipe3
Last synced: about 8 hours ago
JSON representation
A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
- Host: GitHub
- URL: https://github.com/slaveofcode/boilerpipe3
- Owner: slaveofcode
- Created: 2016-10-22T19:22:14.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2020-04-10T15:31:11.000Z (over 4 years ago)
- Last Synced: 2024-11-06T04:05:48.372Z (14 days ago)
- Language: Python
- Homepage:
- Size: 7.23 MB
- Stars: 45
- Watchers: 3
- Forks: 15
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# boilerpipe3
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pagesInstallation
============
You can install this lib directly from github repository by execute these command
pip install git+ssh://[email protected]/slaveofcode/boilerpipe3@masterOr from official pypi
pip install boilerpipe3
Configuration
=============Dependencies:
jpype, charadeThe boilerpipe jar files will get fetched and included automatically when building the package.
Usage
=====Be sure to have set JAVA_HOME properly since jpype depends on this setting.
The constructor takes a keyword argment ``extractor``, being one of the available boilerpipe extractor types:
- DefaultExtractor
- ArticleExtractor
- ArticleSentencesExtractor
- KeepEverythingExtractor
- KeepEverythingWithMinKWordsExtractor
- LargestContentExtractor
- NumWordsRulesExtractor
- CanolaExtractorIf no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either ``html`` for HTML text or ``url``.
from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)Then, to extract relevant content:
extracted_text = extractor.getText()
extracted_html = extractor.getHTML()