Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gregors/boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
https://github.com/gregors/boilerpipe-ruby
boilerpipe boilerpipe-algorithm content-extraction news webscraping
Last synced: about 2 months ago
JSON representation
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
- Host: GitHub
- URL: https://github.com/gregors/boilerpipe-ruby
- Owner: gregors
- License: other
- Created: 2016-03-11T05:34:40.000Z (almost 9 years ago)
- Default Branch: main
- Last Pushed: 2021-02-21T23:58:11.000Z (almost 4 years ago)
- Last Synced: 2024-09-27T22:34:00.873Z (3 months ago)
- Topics: boilerpipe, boilerpipe-algorithm, content-extraction, news, webscraping
- Language: Ruby
- Homepage:
- Size: 240 KB
- Stars: 41
- Watchers: 2
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Boilerpipe
[![CircleCI](https://circleci.com/gh/gregors/boilerpipe-ruby/tree/main.svg?style=shield)](https://circleci.com/gh/gregors/boilerpipe-ruby/tree/main)
[![Gem Version](https://badge.fury.io/rb/boilerpipe-ruby.svg)](https://badge.fury.io/rb/boilerpipe-ruby)A pure ruby implemenation of the boilerpipe algorithm.
This is a text extraction utility first written by Christian Kohlshutter - [presentation](http://videolectures.net/wsdm2010_kohlschutter_bdu/)
I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe.
I saw other gems making use of boilerpipe via the [free api](http://boilerpipe-web.appspot.com) but depending on time of day the api goes down due to exceeding the hosting plan. I also checked out some gems making use of Jruby but I had all kinds of dependency and bug issues. So I made some tweaks on my fork and created a new [jruby-boilerpipe gem](https://rubygems.org/gems/jruby-boilerpipe).
This solution works great if you're using Jruby but I wanted a pure ruby solution to use on MRI. Open vim - start coding...
Here's a high level [diagram](boilerpipe_flow.md) of how the system works.
# TLDR
Just use either ArticleExtractor, DefaultExtractor or KeepEverythingExtractor - try out the others when you feel like experimenting...
Presently the follow Extractors are implemented
* [x] ArticleExtractor
* [x] ArticleSentenceExtractor
* [x] CanolaExtractor
* [x] DefaultExtractor
* [x] KeepEverythingExtractor
* [x] KeepEverythingWithMinKWordsExtractor
* [x] LargestContentExtractor
* [x] NumWordsRulesExtractor## Installation
Add this line to your application's Gemfile:
```ruby
gem 'boilerpipe-ruby', require: 'boilerpipe'
```And then execute:
$ bundle
Or install it yourself as:
$ gem install boilerpipe-ruby
## Usage
gregors$ irb
> require 'boilerpipe'
=> true
> require 'open-uri'
=> true
> content = open('https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/').read; true;
> Boilerpipe::Extractors::ArticleExtractor.text(content).slice(0..40)
=> "Always Squash and Rebase your Git Commits"
> Boilerpipe::Extractors::DefaultExtractor.text(content).slice(0..40)
=> "Posted on\nWhat is the squash rebase workf"
> Boilerpipe::Extractors::LargestContentExtractor.text(content).slice(0, 40)
=> "git push origin master\nWhy should you ad"
> Boilerpipe::Extractors::KeepEverythingExtractor.text(content).slice(0..40)
=> "Toggle Navigation\nCarbon Five\nAbout\nWork\n"## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`.
### Running Tests on Docker
The default run command will run the tests
docker build -t boilerpipe .
docker run -it --rm boilerpipe## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/gregors/boilerpipe-ruby.