Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/reneklacan/webtractor
https://github.com/reneklacan/webtractor
Last synced: 25 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/reneklacan/webtractor
- Owner: reneklacan
- License: other
- Created: 2014-05-25T17:11:30.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-05-26T18:21:42.000Z (over 10 years ago)
- Last Synced: 2024-09-17T23:23:07.095Z (about 2 months ago)
- Language: Ruby
- Size: 129 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Webtractor
The Webtractor is a ruby library which is able to extract main content
from webpages like news, blogs, etc. As a result you can just have a main
content without any boilerplate (menu, footer, comments, etc).## Installation
You can install it directly via gem:
```
gem install webtractor
```Or you can put it in your Gemfile:
```ruby
gem 'webtractor'
```Then run:
```
bundle install
```## Basic usage
```ruby
extractor = Webtractor::Extractor.new
result = extractor.extract_from_url
'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'
puts result.text
```Or
```ruby
extractor = Webtractor::Extractor.new
result = extractor.extract '...'
```Or
```ruby
page = Nokogiri::HTML(...)
extractor = Webtractor::Extractor.new
result = extractor.extract_from_xml page
```You can also access Nokogiri document from result via xml attribute:
```ruby
puts result.xml.xpath('...').text
```## Advanced usage
Process of getting main content from the webpage is really simple. It
consists of applying multiple filters on the document where every filter
gets on input output of the last applied filter.You can look at the names of default filters:
```ruby
p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}
```You can remove any filter:
```ruby
extractor.remove_filter Webtractor::Filters::RemoveComments
```Or you can also create your own filter. It can be any class which
implements *process* method which takes page as an argument and returns
page:```ruby
class RemoveBolds
def process page
page.css('b').remove
page
end
endextractor.add_filter RemoveBolds.new
```## License
This library is distributed under the Beerware license.