Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/namusyaka/gammo
A pure Ruby HTML5-compliant parser with CSS selector and XPath 1.0 traversal
https://github.com/namusyaka/gammo
Last synced: 13 days ago
JSON representation
A pure Ruby HTML5-compliant parser with CSS selector and XPath 1.0 traversal
- Host: GitHub
- URL: https://github.com/namusyaka/gammo
- Owner: namusyaka
- License: mit
- Created: 2020-02-11T07:48:17.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-03-01T16:53:02.000Z (9 months ago)
- Last Synced: 2024-10-14T19:42:44.181Z (about 1 month ago)
- Language: Ruby
- Homepage:
- Size: 179 KB
- Stars: 194
- Watchers: 3
- Forks: 6
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Gammo - A pure-Ruby HTML5 parser
[![Testing](https://github.com/namusyaka/gammo/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/namusyaka/gammo/actions/workflows/test.yml)
[![GitHub issues](https://img.shields.io/github/issues/namusyaka/gammo)](https://github.com/namusyaka/gammo/issues)
[![GitHub forks](https://img.shields.io/github/forks/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/network)
[![GitHub stars](https://img.shields.io/github/stars/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/stargazers)
[![GitHub license](https://img.shields.io/github/license/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/blob/master/LICENSE.txt)
[![Documentation](http://img.shields.io/:yard-docs-38c800.svg)](http://www.rubydoc.info/gems/gammo/frames)Gammo provides a pure Ruby HTML5-compliant parser and CSS selector / XPath support for traversing the DOM tree built by Gammo.
The implementation of the HTML5 parsing algorithm in Gammo conforms [the WHATWG specification](https://html.spec.whatwg.org/multipage/parsing.html). Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm, these implementations are provided without any external dependencies.Gammo, its naming is inspired by [Gumbo](https://github.com/google/gumbo-parser). But Gammo is a fried tofu fritter made with vegetables.
```ruby
require 'gammo'
require 'open-uri'parser = URI.open('https://google.com') { |f| Gammo.new(f.read) }
document = parser.parse #=> #puts document.css('title').first.inner_text #=> 'Google'
```* [Overview](#overview)
* [Features](#features)
* [Tokenizaton](#tokenizaton)
* [Token types](#token-types)
* [Parsing](#parsing)
* [Notes](#notes)
* [Node](#node)
* [DOM Tree Traversal](#dom-tree-traversal)
* [XPath 1.0 (experimental)](#xpath-10-experimental)
* [Example](#example)
* [Axis Specifiers](#axis-specifiers)
* [Node Test](#node-test)
* [Operators](#operators)
* [Functions](#functions)
* [Node set functions](#node-set-functions)
* [String Functions](#string-functions)
* [Boolean Functions](#boolean-functions)
* [Number Functions](#number-functions)
* [CSS Selector (experimental)](#css-selector-experimental)
* [Example](#example)
* [Groups of selectors](#groups-of-selectors)
* [Simple selectors](#simple-selectors)
* [Type selector & Universal selector](#type-selector--universal-selector)
* [Attribute selectors](#attribute-selectors)
* [Class selectors](#class-selectors)
* [ID selectors](#id-selectors)
* [Pseudo-classes](#pseudo-classes)
* [Combinators](#combinators)
* [Performance](#performance)
* [References](#references)
* [License](#license)
* [Release History](#release-history)## Overview
### Features
- [Tokenization](#tokenization): Gammo has a tokenizer for implementing [the tokenization algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tokenization).
- [Parsing](#parsing): Gammo provides a parser which implements the parsing algorithm by the above tokenization and [the tree-construction algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction).
- [Node](#node): Gammo provides the nodes which implement [WHATWG DOM specification](https://dom.spec.whatwg.org/) partially.
- [DOM Tree Traversal](#dom-tree-traversal): Gammo provides a way of DOM tree traversal (CSS selector / XPath).
- [Performance](#performance): Gammo does not prioritize performance, and there are a few potential performance notes.## Tokenizaton
`Gammo::Tokenizer` implements the tokenization algorithm in WHATWG. You can get tokens in order by calling `Gammo::Tokenizer#next_token`.
Here is a simple example for performing only the tokenizer.
```ruby
def dump_for(token)
puts "data: #{token.data}, class: #{token.class}"
endtokenizer = Gammo::Tokenizer.new('')
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken
```The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.
### Token types
The tokens generated by the tokenizer will be categorized into one of the following types:
Token type
Description
Gammo::Tokenizer::ErrorToken
Represents an error token, it usually means end-of-string.
Gammo::Tokenizer::TextToken
Represents a text token like "foo" which is inner text of elements.
Gammo::Tokenizer::StartTagToken
Represents a start tag token like<a>
.
Gammo::Tokenizer::EndTagToken
Represents an end tag token like</a>
.
Gammo::Tokenizer::SelfClosingTagToken
Represents a self closing tag token like<img />
Gammo::Tokenizer::CommentToken
Represents a comment token like<!-- comment -->
.
Gammo::Tokenizer::DoctypeToken
Represents a doctype token like<!doctype html>
.
## Parsing
`Gammo::Parser` implements processing in [the tree-construction stage](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction) based on the tokenization described above.
A successfully parsed parser has the `document` accessor as the root document (this is the same as the return value of the `Gammo::Parser#parse`). From the `document` accessor, you can traverse the DOM tree constructed by the parser.
```ruby
require 'gammo'
require 'pp'document = Gammo.new('').parse
def dump_for(node, strm)
strm << node.to_h
return unless node && (child = node.first_child)
while child
dump_for(child, (strm.last[:children] ||= []))
child = child.next_sibling
end
strm
endpp dump_for(document, [])
```### Notes
Currently, it's not possible to traverse the DOM tree with css selector or xpath like [Nokogiri](https://nokogiri.org/).
However, Gammo plans to implement these features in the future.## Node
The nodes generated by the parser will be categorized into one of the following types:
Node type
Description
Gammo::Node::Error
Represents error node, it usually means end-of-string.
Gammo::Node::Text
Represents the text node like "foo" which is inner text of elements.
Gammo::Node::Document
Represents the root document type. It's always returned byGammo::Parser#document
.
Gammo::Node::Element
Represents any elements of HTML like<p>
.
Gammo::Node::Comment
Represents comments like<!-- foo -->
Gammo::Node::Doctype
Represents doctype like<!doctype html>
For some nodes such as `Gammo::Node::Element` and `Gammo::Node::Document`, they contain pointers to nodes that can be referenced by itself, such as `Gammo::Node#next_sibling` or `Gammo::Node#first_child`. In addition, APIs such as `Gammo::Node#append_child` and `Gammo::Node#remove_child` that perform operations defined in DOM living standard are also provided.
## DOM Tree Traversal
CSS selector and XPath-1.0 are the way for traversing DOM tree built by Gammo.
### XPath 1.0 (experimental)
Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo.
Here is a simple example:```ruby
document = Gammo.new('').parse
node_set = document.xpath('//input[@type="button"]') #=> ""node_set.length #=> 1
node_set.first #=> ""
```**Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature.**
Please [file an issue](/issues/new) if you find bugs.#### Example
Before proceeding at the details of XPath support, let's have a look at a few simple examples.
Given a sample HTML text and its DOM tree:```ruby
document = Gammo.new(<<-EOS).parse
namusyaka.com
Here is a sample web site.
- hello
- world
- Google google.com
- GitHub github.com/namusyaka
EOS
```
The following XPath expression gets all `li` elements and prints those text contents:
```ruby
document.xpath('//li').each do |elm|
puts elm.inner_text
end
```
The following XPath expression gets all `li` elements under the `ul` element having the `id=links` attribute:
```ruby
document.xpath('//ul[@id="links"]/li').each do |elm|
puts elm.inner_text
end
```
The following XPath expression gets each text node for each `li` element under the `ul` element having the `id=links` attribute:
```ruby
document.xpath('//ul[@id="links"]/li/text()').each do |elm|
puts elm.data
end
```
#### Axis Specifiers
In the combination with Gammo, the axis specifier indicates navigation direction within the DOM tree built by Gammo. Here is list of axes. As you can see, Gammo fully supports the all of axes.
Full Syntax
Abbreviated Syntax
Supported
Notes
ancestor
yes
ancestor-or-self
yes
attribute
@
yes
@abc
is the alias for attribute::abc
child
yes
abc
is the short for child::abc
descendant
yes
descendant-or-self
//
yes
//
is the alias for /descendant-or-self::node()/
following
yes
following-sibling
yes
namespace
yes
parent
..
yes
..
is the alias for parent::node()
preceding
yes
preceding-sibling
yes
self
.
yes
.
is the alias for self::node()
#### Node Test
Node tests consist of specific node names or more general expressions. Although particular syntax like `:` should work for specifying namespace prefix in XPath, Gammo does not support it yet as it's [not a core feature in HTML5](https://html.spec.whatwg.org/multipage/introduction.html#html-vs-xhtml).
Full Syntax
Supported
Notes
text()
yes
Finds a node of type text, e.g. hello
in <p>hello <a href="https://hello">world</a></p>
comment()
yes
Finds a node of type comment, e.g. <!-- comment -->
node()
yes
Finds any node at all.
Also note that the `processing-instruction` is not supported. There is no plan to support it.
#### Operators
- The `/`, `//` and `[]` are used in the path expression.
- The union operator `|` forms the union of two node sets.
- The boolean operators: `and`, `or`
- The arithmetic operators: `+`, `-`, `*`, `div` and `mod`
- Comparison operators: `=`, `!=`, `<`, `>`, `<=`, `>=`
#### Functions
XPath 1.0 defines four data types (nodeset, string, number, boolean) and there are various functions based on the types. Gammo supports those functions partially, please check it to be supported before using functions.
##### Node set functions
Function Name
Supported
Specification
last()
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-last
position()
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-position
count(node-set)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-count
##### String Functions
Function Name
Supported
Specification
string(object?)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
concat(string, string, string*)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-concat
starts-with(string, string)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-starts-with
contains(string, string)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-contains
substring-before(string, string)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-before
substring-after(string, string)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-after
substring(string, number, number?)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring
string-length(string?)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-length
normalize-space(string?)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-normalize-space
translate(string, string, string)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-translate
##### Boolean Functions
Function Name
Supported
Specification
boolean(object)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-boolean
not(object)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-not
true()
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-true
false()
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-false
lang()
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-lang
##### Number Functions
Function Name
Supported
Specification
number(object?)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-number
sum(node-set)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-sum
floor(number)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-floor
ceiling(number)
yes
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-ceiling
round(number)
no
https://www.w3.org/TR/1999/REC-xpath-19991116/#function-round
### CSS Selector (experimental)
Gammo has an original lexer/parser for CSS Selector, it's provided as a helper in the DOM tree built by Gammo.
Here is a simple example:
```ruby
document = Gammo.new('').parse
node_set = document.css('input[type="button"]') #=> ""
node_set.length #=> 1
node_set.first #=> ""
```
Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.
#### Example
Before proceeding at the details of CSS Selector support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:
```ruby
document = Gammo.new(<<-EOS).parse
namusyaka.com
Here is a sample web site.
- hello
- world
- Google google.com
- GitHub github.com/namusyaka
EOS
```
The following CSS selector gets all `li` elements and prints thoese text contents:
```ruby
document.css('li').each do |elm|
puts elm.inner_text
end
```
The following CSS selector gets all `li` elements under the `ul` element having the `id=links` attribute:
```ruby
document.xpath('ul#links li').each do |elm|
puts elm.inner_text
end
```
#### Groups of selectors
Gammo supports [groups of selectors](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#grouping), this means you can use `,` to traverse DOM tree by multiple selectors.
```ruby
require 'gammo'
@doc = Gammo.new(<<-EOS).parse
hello
hello
world
EOS
@doc.css('#hello, #world').map(&:inner_text).join(' ') #=> 'hello world'
```
#### Simple selectors
##### Type selector & Universal selector
Gammo supports the basic grammar of type selector and universal selector, but not namespaces.
##### Attribute selectors
See more details: [6.3. Attribute selectors](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#attribute-selectors)
Syntax
Supported
[att]
yes
[att=val]
yes
[att~=val]
yes
[att|=val]
yes
##### Class selectors
Supported. See more details: [6.4. Class selectors](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#class-html)
##### ID selectors
Supported. See more details: [6.5. ID selectors](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#id-selectors)
##### Pseudo-classes
Partially supported. See the table below.
Class name
Supported
Can support?
:link
no
no
:visited
no
no
:hover
no
no
:active
no
no
:focus
no
no
:target
no
no
:lang
no
yes
:enabled
yes
yes
:disabled
yes
yes
:checked
yes
yes
:root
yes
yes
:nth-child
yes
yes
:nth-last-child
no
yes
:nth-of-type
no
yes
:nth-last-of-type
no
yes
:first-child
no
yes
:last-child
no
yes
:first-of-type
no
yes
:last-of-type
no
yes
:only-child
no
yes
:only-of-type
no
yes
:empty
no
yes
:not
yes
yes
#### Combinators
See more details: [8. Combinators](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#combinators)
Syntax
Supported
Desc
h1 em
yes
Descendant combinator
h1 > em
yes
Child combinator
math + p
yes
Next-sibling combinator
h1 ~ pre
yes
Subsequent-sibling combinator
## Performance
As mentioned in the features at the beginning, Gammo doesn't prioritize its performance.
Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user).
Instead, the goal is to work well with batch processing such as crawlers.
Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.
## References
This was developed with reference to the following softwares.
- [x/net/html](https://godoc.org/golang.org/x/net/html): I've been working on this package, it gave me strong reason to make this happen.
- [Blink](https://www.chromium.org/blink): Blink gave me great impression about tree construction.
- [html5lib-tests](https://github.com/html5lib/html5lib-tests): Gammo relies on this test.
## License
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
## Release History
- v0.3.0
- CSS selector support [#11](https://github.com/namusyaka/gammo/pull/11)
- v0.2.0
- XPath 1.0 support [#4](https://github.com/namusyaka/gammo/pull/4)
- v0.1.0
- Initial Release