Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/reneklacan/webtractor

Last synced: 25 days ago
JSON representation

Host: GitHub
URL: https://github.com/reneklacan/webtractor
Owner: reneklacan
License: other
Created: 2014-05-25T17:11:30.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2014-05-26T18:21:42.000Z (over 10 years ago)
Last Synced: 2024-09-17T23:23:07.095Z (about 2 months ago)
Language: Ruby
Size: 129 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Webtractor

The Webtractor is a ruby library which is able to extract main content

from webpages like news, blogs, etc. As a result you can just have a main

content without any boilerplate (menu, footer, comments, etc).

## Installation

You can install it directly via gem:

```

gem install webtractor

```

Or you can put it in your Gemfile:

```ruby

gem 'webtractor'

```

Then run:

```

bundle install

```

## Basic usage

```ruby

extractor = Webtractor::Extractor.new

result = extractor.extract_from_url

'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'

puts result.text

```

Or

```ruby

extractor = Webtractor::Extractor.new

result = extractor.extract '...'

```

Or

```ruby

page = Nokogiri::HTML(...)

extractor = Webtractor::Extractor.new

result = extractor.extract_from_xml page

```

You can also access Nokogiri document from result via xml attribute:

```ruby

puts result.xml.xpath('...').text 

```

## Advanced usage

Process of getting main content from the webpage is really simple. It

consists of applying multiple filters on the document where every filter

gets on input output of the last applied filter.

You can look at the names of default filters:

```ruby

p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}

```

You can remove any filter:

```ruby

extractor.remove_filter Webtractor::Filters::RemoveComments

```

Or you can also create your own filter. It can be any class which

implements *process* method which takes page as an argument and returns

page:

```ruby

class RemoveBolds

  def process page

    page.css('b').remove

    page

  end

end

extractor.add_filter RemoveBolds.new

```

## License

This library is distributed under the Beerware license.