https://github.com/slaveofcode/boilerpipe3

A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
https://github.com/slaveofcode/boilerpipe3

Last synced: 3 months ago
JSON representation

A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.

Host: GitHub
URL: https://github.com/slaveofcode/boilerpipe3
Owner: slaveofcode
Created: 2016-10-22T19:22:14.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2020-04-10T15:31:11.000Z (over 5 years ago)
Last Synced: 2025-06-25T14:17:13.488Z (5 months ago)
Language: Python
Homepage:
Size: 7.23 MB
Stars: 45
Watchers: 2
Forks: 15
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # boilerpipe3

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Installation

============

You can install this lib directly from github repository by execute these command

    

    pip install git+ssh://git@github.com/slaveofcode/boilerpipe3@master

Or from official pypi 

    pip install boilerpipe3

Configuration

=============

Dependencies:

jpype, charade

The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

=====

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment ``extractor``, being one of the available boilerpipe extractor types:

- DefaultExtractor

- ArticleExtractor

- ArticleSentencesExtractor

- KeepEverythingExtractor

- KeepEverythingWithMinKWordsExtractor

- LargestContentExtractor

- NumWordsRulesExtractor

- CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either ``html`` for HTML text or ``url``.

    from boilerpipe.extract import Extractor

    extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

    extracted_text = extractor.getText()

	

    extracted_html = extractor.getHTML()

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/slaveofcode/boilerpipe3

Awesome Lists containing this project

README