Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cantino/ruby-readability
Port of arc90's readability project to Ruby
https://github.com/cantino/ruby-readability
Last synced: 1 day ago
JSON representation
Port of arc90's readability project to Ruby
- Host: GitHub
- URL: https://github.com/cantino/ruby-readability
- Owner: cantino
- License: apache-2.0
- Created: 2010-02-07T20:08:01.000Z (almost 15 years ago)
- Default Branch: master
- Last Pushed: 2024-11-23T18:17:00.000Z (about 2 months ago)
- Last Synced: 2025-01-03T16:52:23.397Z (8 days ago)
- Language: Ruby
- Homepage:
- Size: 734 KB
- Stars: 928
- Watchers: 34
- Forks: 171
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- my-awesome-github-stars - cantino/ruby-readability - Port of arc90's readability project to Ruby (Ruby)
README
Ruby Readability
================Ruby Readability is a tool for extracting the primary readable content of a
webpage. It is a Ruby port of arc90's readability project.Build Status
------------[![Ruby](https://github.com/cantino/ruby-readability/actions/workflows/ruby.yml/badge.svg?branch=master)](https://github.com/cantino/ruby-readability/actions/workflows/ruby.yml)
Install
-------Command line:
(sudo) gem install ruby-readability
Bundler:
gem "ruby-readability", :require => 'readability'
Example
-------require 'rubygems'
require 'readability'
require 'open-uri'source = URI.parse('http://lab.arc90.com/experiments/readability/').read
puts Readability::Document.new(source).contentOptions
-------You may provide options to `Readability::Document.new`, including:
* `:tags`: the base whitelist of tags to sanitize, defaults to `%w[div p]`;
* `:remove_empty_nodes`: remove `` tags that have no text content; also
removes `` tags that contain only images;
* `:attributes`: whitelist of allowed attributes;
* `:debug`: provide debugging output, defaults false; supports setting a Proc;
* `:encoding`: if the page is of a known encoding, you can specify it; if left
unspecified, the encoding will be guessed (only in Ruby 1.9.x). If you wish
to disable guessing, supply `:do_not_guess_encoding => true`;
* `:html_headers`: in Ruby 1.9.x these will be passed to the
`guess_html_encoding` gem to aid with guessing the HTML encoding;
* `:ignore_image_format`: for use with .images. For example:
`:ignore_image_format => ["gif", "png"]`;
* `:min_image_height`: set a minimum image height for `#images`;
* `:min_image_width`: set a minimum image width for `#images`.
* `:blacklist` and `:whitelist` allow you to explicitly scope to, or remove, CSS selectors.Command Line Tool
-----------------Readability comes with a command-line tool for experimentation in
`bin/readability`.Usage: readability [options] URL
-d, --debug Show debug output
-i, --images Keep images and links
-h, --help Show this messageImages
------You can get a list of images in the content area with `Document#images`. This
feature requires that the `fastimage` gem be installed.rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false)
rbody.imagesRelated Projects
----------------* [readability.cr](https://github.com/joenas/readability.cr) - Port of ruby-readability's port of arc90's readability project to Crystal
* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.Potential Issues
----------------If you're on a Mac and are getting segmentation faults, see the discussion at
and consider updating
your version of `libxml2`. Version 2.7.8 of `libxml2`, installed with `brew`,
worked for me:gem install nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26
Or if you're using bundler and Rails 3, you can run this command to make
bundler always globally build `nokogiri` this way:bundle config build.nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26
License
-------This code is under the Apache License 2.0. See .
Ruby port by cantino, starrhorne, libc, and iterationlabs. Special thanks to fizx and marcosinger.