Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kostya/myhtml

Fast HTML5 Parser with css selectors for Crystal language
https://github.com/kostya/myhtml

crystal fast html myhtml parser

Last synced: 28 days ago
JSON representation

Fast HTML5 Parser with css selectors for Crystal language

Awesome Lists containing this project

README

        

# MyHTML

[![Build Status](https://github.com/kostya/myhtml/actions/workflows/ci.yml/badge.svg)](https://github.com/kostya/myhtml/actions/workflows/ci.yml?query=branch%3Amaster+event%3Apush)

Fast HTML5 Parser (Crystal binding for awesome lexborisov's [myhtml](https://github.com/lexborisov/myhtml) and [Modest](https://github.com/lexborisov/Modest)). This shard used in production to parse millions of pages per day, very stable and fast.

## WARNING: original libraries (myhtml and Modest) not maintained since july 2020, i recommend switch to successor parser: [Lexbor](https://github.com/kostya/lexbor).

## Installation

Add this to your application's `shard.yml`:

```yaml
dependencies:
myhtml:
github: kostya/myhtml
```

And run `shards install`

## Usage example

```crystal
require "myhtml"

html = <<-HTML



O_o




HTML

myhtml = Myhtml::Parser.new(html)

myhtml.nodes(:div).each do |node|
id = node.attribute_by("id")

if first_link = node.scope.nodes(:a).first?
href = first_link.attribute_by("href")
link_text = first_link.inner_text

puts "div with id #{id} have link [#{link_text}](#{href})"
else
puts "div with id #{id} have no links"
end
end

# Output:
# div with id t1 have link [O_o](/#)
# div with id t2 have no links
```

## Css selectors example

```crystal
require "myhtml"

html = <<-HTML



Hello


123other
foocolumns
barare
xyzignored



HTML

myhtml = Myhtml::Parser.new(html)

p myhtml.css("#t2 tr td:first-child").map(&.inner_text).to_a
# => ["123", "foo", "bar", "xyz"]

p myhtml.css("#t2 tr td:first-child").map(&.to_html).to_a
# => ["123", "foo", "bar", "xyz"]
```

## More Examples

[examples](https://github.com/kostya/myhtml/tree/master/examples)

## Development Setup:

```shell
git clone https://github.com/kostya/myhtml.git
cd myhtml
make
crystal spec
```

## Benchmark

Parse 1000 times google page(600Kb), and 1000 times css select. [myhtml-program](https://github.com/kostya/myhtml/tree/master/bench/test-myhtml.cr), [crystagiri-program](https://github.com/kostya/myhtml/tree/master/bench/test-libxml.cr), [nokogiri-program](https://github.com/kostya/myhtml/tree/master/bench/test-libxml.rb)

| Lang | Shard | Lib | Parse time, s | Css time, s | Memory, MiB |
| -------- | ---------- | --------------- | ------------- | ----------- | ----------- |
| Crystal | lexbor | lexbor | 2.54 | 0.099 | 7.8 |
| Crystal | myhtml | myhtml(+modest) | 3.17 | 0.16 | 8.4 |
| Ruby 2.7 | Nokogiri | libxml2 | 9.19 | 10.76 | 139.8 |
| Crystal | Crystagiri | libxml2 | 11.27 | - | 25.0 |