Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tombenner/nikkou
Extract useful data from HTML and XML with ease!
https://github.com/tombenner/nikkou
Last synced: 3 months ago
JSON representation
Extract useful data from HTML and XML with ease!
- Host: GitHub
- URL: https://github.com/tombenner/nikkou
- Owner: tombenner
- License: mit
- Created: 2013-06-02T02:26:58.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2018-01-03T01:40:30.000Z (almost 7 years ago)
- Last Synced: 2024-07-17T12:55:54.769Z (4 months ago)
- Language: Ruby
- Size: 13.7 KB
- Stars: 57
- Watchers: 4
- Forks: 7
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: MIT-LICENSE
Awesome Lists containing this project
README
Nikkou
======
Extract useful data from HTML and XML with ease![](http://travis-ci.org/tombenner/nikkou)
Description
-----------Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with [Mechanize](https://github.com/sparklemotion/mechanize).
Installation
------------Add Nikkou to your Gemfile:
```ruby
gem 'nikkou'
```Method Overview
---------------Here's a summary of the methods Nikkou provides (see "Methods" for details):
### Formatting
**parse_text** - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet
**time(options={})** - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a `time_zone` option
**url(attribute='href')** - Converts the href (or other specified attribute) into an absolute URL using the document's URI; `Link` yields `http://mysite.com/p/1`
### Searching
**attr_equals(attribute, string)** - Finds nodes where the attribute equals the string
**attr_includes(attribute, string)** - Finds nodes where the attribute includes the string
**attr_matches(attribute, pattern)** - Finds nodes where the attribute matches the pattern
**drill(*methods)** - Nil-safe method chaining
**find(path)** - Same as `search` but returns the first matched node**text_equals(string)** - Finds nodes where the text equals the string
**text_includes(string)** - Finds nodes where the text includes the string
**text_matches(pattern)** - Finds nodes where the text matches the pattern## Methods
### Formatting
#### time(options={})
Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.
```ruby
# 3 hours ago
doc.search('a').first.time
```###### Options
`attribute`
The attribute to parse:
```ruby
# My link
doc.search('a').first.time(attribute: 'data-published-at')
````time_zone`
The document's time zone (the time will be converted from that to UTC):
```ruby
# 3 hours ago
doc.search('a').first.time(time_zone: 'America/New_York')
```#### url(attribute='href')
Returns an absolute URL; useful for parsing relative hrefs. The document's `uri` needs to be set for Nikkou to know what domain to add to relative paths.
```ruby
# My link
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"
```If Mechanize is being used, the `uri` doesn't need to be manually set.
###### Options
`attribute`
The attribute to parse:
```ruby
# My Link
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"
```### Searching
#### attr_equals(attribute, string)
Selects nodes where the specified attribute equals the string.
```ruby
#My Text
doc.attr_equals('data-type', 'news').first.text # "My Text"
```#### attr_includes(attribute, string)
Selects nodes where the specified attribute includes the string.
```ruby
#My Text
doc.attr_includes('data-type', 'news').first.text # "My Text"
```#### attr_matches(attribute, pattern)
Selects nodes with an attribute matching a pattern. The pattern's matches are available in `Node#matches`.
```ruby
# My Text
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
```#### drill(*methods)
Nil-safe method chaining. Replaces this:
```ruby
node = doc.find('.count')
if node
attribute = node.attr('data-count')
if attribute
return attribute.to_i
end
end
```With this:
```ruby
return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)
```#### find(path)
Same as `search`, but returns the first matched node. Replaces this:
```ruby
nodes = node.search('h4')
if nodes
return nodes.first
end
```With this:
```ruby
return node.find('h4')
```#### text_includes(string)
Selects nodes where the text includes the string.
```ruby
#My Text
doc.text_includes('Text').first.text # "My Text"
```#### text_matches(pattern)
Selects nodes with text matching a pattern. The pattern's matches are available in `Node#matches`.
```ruby
# 3 Comments
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
```License
-------Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.