An open API service indexing awesome lists of open source software.

https://github.com/naqvis/crystal-html5

Crystal implementation of HTML5-Compliant Tokenizer and Parser with XPath & CSS Selector support
https://github.com/naqvis/crystal-html5

crystal crystal-lang crystal-language crystal-shard css-selectors html-parser html-tokenizer html5 xpath2

Last synced: 11 months ago
JSON representation

Crystal implementation of HTML5-Compliant Tokenizer and Parser with XPath & CSS Selector support

Awesome Lists containing this project

README

          

# Crystal-HTML5
![CI](https://github.com/naqvis/crystal-html5/workflows/CI/badge.svg)
[![GitHub release](https://img.shields.io/github/release/naqvis/crystal-html5.svg)](https://github.com/naqvis/crystal-html5/releases)
[![Docs](https://img.shields.io/badge/docs-available-brightgreen.svg)](https://naqvis.github.io/crystal-html5/)

Crystal-HTML5 shard is a **Pure Crystal** implementation of an **HTML5-compliant** `Tokenizer` and `Parser`.
The relevant specifications include:
- [https://html.spec.whatwg.org/multipage/syntax.html](https://html.spec.whatwg.org/multipage/syntax.html)
- [https://html.spec.whatwg.org/multipage/syntax.html#tokenization](https://html.spec.whatwg.org/multipage/syntax.html#tokenization)

Shard also provides **CSS Selector support** by implementing **W3** [Selectors Level 3](http://www.w3.org/TR/css3-selectors/) specification

Tokenization is done by creating a `Tokenizer` for an `IO`. It is the caller
responsibility to ensure that provided IO provides UTF-8 encoded HTML.
The tokenization algorithm implemented by this shard is not a line-by-line
transliteration of the relatively verbose state-machine in the **WHATWG**
specification. A more direct approach is used instead, where the program
counter implies the state, such as whether it is tokenizing a tag or a text
node. Specification compliance is verified by checking expected and actual
outputs over a test suite rather than aiming for algorithmic fidelity.

Parsing is done by calling `HTML5.parse` with either a String containing HTML
or an IO instance. `HTML5.parse` returns a document root as `HTML5::Node` instance.

Parsing a fragment is done by calling `HTML5.parse_fragment` with either a String containing fragment of HTML5
or an IO instance. If the fragment is the InnerHTML for an existing element, pass that element in context.
`HTML5.parse_fragment` returns a list of `HTML5::Node` that were found.

## Installation

1. Add the dependency to your `shard.yml`:

```yaml
dependencies:
html5:
github: naqvis/crystal-html5
```

2. Run `shards install`

## Usage

### Example 1: Process each anchor `` node.
```crystal
require "html5"

html = <<-HTML5

Hello,World!


City Gallery

London


Mountain View

London is the capital city of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.


Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.

Copyright © W3Schools.com

HTML5

def process(node)
if node.element? && node.data == "a"
# Do something with node
href = node["href"]?
puts "#{node.first_child.try &.data} => #{href.try &.val}"

# print all attributes
node.attr.each do |a|
# puts "#{a.key} = \"#{a.val}\""
end
end
c = node.first_child
while c
process(c)
c = c.next_sibling
end
end

doc = HTML5.parse(html)
process(doc)

# Output
# London => /London
# Paris => /Paris
# Tokyo => /Tokyo
```

### Example 2: Parse an HTML or Fragment of HTML
```crystal
require "html5"

def parse_html(html, context)
if context.empty?
doc = HTML5.parse(html)
else
namespace = ""
if (i = context.index(' ')) && (i >= 0)
namespace, context = context[...i], context[i + 1..]
end
cnode = HTML5::Node.new(
type: HTML5::NodeType::Element,
data_atom: HTML5::Atom.lookup(context.to_slice),
data: context,
namespace: namespace,
)

nodes = HTML5.parse_fragment(html, cnode)
doc = HTML5::Node.new(type: HTML5::NodeType::Document)
nodes.each do |n|
doc.append_child(n)
end
end
doc
end

html = %(

Links:


)
doc = parse_html(html, "body")
process(doc)

# Output
# Foo => foo
# BarBaz => /bar/baz
```

### Example 3: Render `HTML5::Node` to HTML
```crystal
require "html5"

html = %(

Links:


)
doc = HTML5.parse(html)
doc.render(STDOUT)

# Output
#

Links:



```

### Example 3: XPath Query
```crystal
require "html5"

html = %(

Links:


)
doc = HTML5.parse(html)

# Find all A elements
list = html.xpath_nodes("//a")

# Find all A elements that have `href` attribute.
list = html.xpath_nodes("//a[@href]")

# Find all A elements with `href` attribute and only return `href` value.
list = html.xpath_nodes("//a/@href")
list.each {|a| pp a.inner_text}

# Find the second `a` element
a = html.xpath("//a[2]")

# Count the number of all a elements.
v = html.xpath_float("//a")
```

Refer to specs for more sample usages. And refer to [Crystal XPath2 Shard](https://github.com/naqvis/crystal-xpath2) for details of what functions and functionality is supported by XPath implementation.

### Example 4: CSS Selector
```crystal
html = <<-HTML



Hello


123other
foocolumns
barare
xyzignored



HTML

node = HTML5.parse(html)
p node.css("#t2 tr td:first-child").map(&.inner_text).to_a # => ["123", "foo", "bar", "xyz"]
p node.css("#t2 tr td:first-child").map(&.to_html(true)).to_a # => "123", "foo", "bar", "xyz"]

html = <<-HTML


a header


another header

HTML
node = HTML5.parse(html)
p node.css("h2#foo").map(&.to_html(true)).to_a # => ["

a header

"]

html = <<-HTML







link1
link2











HTML

node = HTML5.parse(html)
# select all p nodes which id like `*p*`
p node.css("p[id*=p]").map(&.["id"].val).to_a # => ["p1", "p2", "p3", "p4", "p5", "p6"]

# select all nodes with class "jo"
p node.css("p.jo").map(&.["id"].val).to_a # => ["p2", "p4", "p6"]
p node.css(".jo").map(&.["id"].val).to_a # => ["p2", "p4", "p6"]

# a element with href ends like .png
p node.css(%q{a[href$=".png"]}).map(&.["id"].val).to_a # => ["a2"]

# find all a tags inside

, which href contain `html`
p node.css(%q{p[id=p3] > a[href*="html"]}).map(&.["id"].val).to_a # => ["a1"]
```
Refer to `spec/css` specs for more sample usages.

## Development

To run all tests:

```
crystal spec
```

## Contributing

1. Fork it ()
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request

## Contributors

- [Ali Naqvi](https://github.com/naqvis) - creator and maintainer