{"id":17872568,"url":"https://github.com/gregors/boilerpipe-ruby","last_synced_at":"2025-06-10T20:04:24.246Z","repository":{"id":54385581,"uuid":"53642353","full_name":"gregors/boilerpipe-ruby","owner":"gregors","description":"Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles","archived":false,"fork":false,"pushed_at":"2021-02-21T23:58:11.000Z","size":246,"stargazers_count":43,"open_issues_count":3,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-14T23:51:16.513Z","etag":null,"topics":["boilerpipe","boilerpipe-algorithm","content-extraction","news","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gregors.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-03-11T05:34:40.000Z","updated_at":"2025-01-09T09:58:48.000Z","dependencies_parsed_at":"2022-08-13T14:10:11.089Z","dependency_job_id":null,"html_url":"https://github.com/gregors/boilerpipe-ruby","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregors%2Fboilerpipe-ruby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregors%2Fboilerpipe-ruby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregors%2Fboilerpipe-ruby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregors%2Fboilerpipe-ruby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gregors","download_url":"https://codeload.github.com/gregors/boilerpipe-ruby/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregors%2Fboilerpipe-ruby/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259143566,"owners_count":22811903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boilerpipe","boilerpipe-algorithm","content-extraction","news","webscraping"],"created_at":"2024-10-28T10:43:25.831Z","updated_at":"2025-06-10T20:04:24.227Z","avatar_url":"https://github.com/gregors.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Boilerpipe\n\n[![CircleCI](https://circleci.com/gh/gregors/boilerpipe-ruby/tree/main.svg?style=shield)](https://circleci.com/gh/gregors/boilerpipe-ruby/tree/main)\n[![Gem Version](https://badge.fury.io/rb/boilerpipe-ruby.svg)](https://badge.fury.io/rb/boilerpipe-ruby)\n\nA pure ruby implemenation of the boilerpipe algorithm.\n\nThis is a text extraction utility first written by Christian Kohlshutter - [presentation](http://videolectures.net/wsdm2010_kohlschutter_bdu/)\n\nI went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe.\n\nI saw other gems making use of boilerpipe via the [free api](http://boilerpipe-web.appspot.com) but depending on time of day the api goes down due to exceeding the hosting plan. I also checked out some gems making use of Jruby but I had all kinds of dependency and bug issues. So I made some tweaks on my fork and created a new [jruby-boilerpipe gem](https://rubygems.org/gems/jruby-boilerpipe).\n\nThis solution works great if you're using Jruby but I wanted a pure ruby solution to use on MRI. Open vim - start coding...\n\nHere's a high level [diagram](boilerpipe_flow.md) of how the system works.\n\n# TLDR\n\nJust use either ArticleExtractor, DefaultExtractor or KeepEverythingExtractor - try out the others when you feel like experimenting...\n\nPresently the follow Extractors are implemented\n* [x] ArticleExtractor\n* [x] ArticleSentenceExtractor\n* [x] CanolaExtractor\n* [x] DefaultExtractor\n* [x] KeepEverythingExtractor\n* [x] KeepEverythingWithMinKWordsExtractor\n* [x] LargestContentExtractor\n* [x] NumWordsRulesExtractor\n\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'boilerpipe-ruby', require: 'boilerpipe'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install boilerpipe-ruby\n\n## Usage\n\n    gregors$ irb\n    \u003e require 'boilerpipe'\n     =\u003e true\n    \u003e require 'open-uri'\n      =\u003e true\n    \u003e content = open('https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/').read; true;\n    \n    \u003e Boilerpipe::Extractors::ArticleExtractor.text(content).slice(0..40)\n     =\u003e \"Always Squash and Rebase your Git Commits\" \n    \n    \u003e Boilerpipe::Extractors::DefaultExtractor.text(content).slice(0..40)\n     =\u003e \"Posted on\\nWhat is the squash rebase workf\"\n    \n    \u003e Boilerpipe::Extractors::LargestContentExtractor.text(content).slice(0, 40)\n     =\u003e \"git push origin master\\nWhy should you ad\"\n    \n    \u003e Boilerpipe::Extractors::KeepEverythingExtractor.text(content).slice(0..40)\n     =\u003e \"Toggle Navigation\\nCarbon Five\\nAbout\\nWork\\n\"\n\n## Development\n\nAfter checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.\n\nTo install this gem onto your local machine, run `bundle exec rake install`.\n\n### Running Tests on Docker\n\nThe default run command will run the tests\n\n    docker build -t boilerpipe .\n    docker run -it --rm boilerpipe\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/gregors/boilerpipe-ruby.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregors%2Fboilerpipe-ruby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgregors%2Fboilerpipe-ruby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregors%2Fboilerpipe-ruby/lists"}