Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kostya/myhtml
Fast HTML5 Parser with css selectors for Crystal language
https://github.com/kostya/myhtml
crystal fast html myhtml parser
Last synced: 28 days ago
JSON representation
Fast HTML5 Parser with css selectors for Crystal language
- Host: GitHub
- URL: https://github.com/kostya/myhtml
- Owner: kostya
- License: mit
- Created: 2016-03-18T22:46:08.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-10-19T17:15:27.000Z (about 2 years ago)
- Last Synced: 2024-10-25T01:22:22.645Z (about 2 months ago)
- Topics: crystal, fast, html, myhtml, parser
- Language: Crystal
- Homepage:
- Size: 441 KB
- Stars: 154
- Watchers: 7
- Forks: 12
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-crystal - myhtml - Fast HTML5 Parser that includes CSS selectors (HTML/XML Parsing)
- awesome-crystal - myhtml - Fast HTML5 Parser (HTML/XML Parsing)
- awesome-crystal - myhtml - Fast HTML5 Parser that includes CSS selectors (HTML/XML Parsing)
README
# MyHTML
[![Build Status](https://github.com/kostya/myhtml/actions/workflows/ci.yml/badge.svg)](https://github.com/kostya/myhtml/actions/workflows/ci.yml?query=branch%3Amaster+event%3Apush)
Fast HTML5 Parser (Crystal binding for awesome lexborisov's [myhtml](https://github.com/lexborisov/myhtml) and [Modest](https://github.com/lexborisov/Modest)). This shard used in production to parse millions of pages per day, very stable and fast.
## WARNING: original libraries (myhtml and Modest) not maintained since july 2020, i recommend switch to successor parser: [Lexbor](https://github.com/kostya/lexbor).
## Installation
Add this to your application's `shard.yml`:
```yaml
dependencies:
myhtml:
github: kostya/myhtml
```And run `shards install`
## Usage example
```crystal
require "myhtml"html = <<-HTML
HTMLmyhtml = Myhtml::Parser.new(html)
myhtml.nodes(:div).each do |node|
id = node.attribute_by("id")if first_link = node.scope.nodes(:a).first?
href = first_link.attribute_by("href")
link_text = first_link.inner_textputs "div with id #{id} have link [#{link_text}](#{href})"
else
puts "div with id #{id} have no links"
end
end# Output:
# div with id t1 have link [O_o](/#)
# div with id t2 have no links
```## Css selectors example
```crystal
require "myhtml"html = <<-HTML
Hello
123other
foocolumns
barare
xyzignored
HTMLmyhtml = Myhtml::Parser.new(html)
p myhtml.css("#t2 tr td:first-child").map(&.inner_text).to_a
# => ["123", "foo", "bar", "xyz"]p myhtml.css("#t2 tr td:first-child").map(&.to_html).to_a
# => ["123", "foo", "bar", "xyz"]
```## More Examples
[examples](https://github.com/kostya/myhtml/tree/master/examples)
## Development Setup:
```shell
git clone https://github.com/kostya/myhtml.git
cd myhtml
make
crystal spec
```## Benchmark
Parse 1000 times google page(600Kb), and 1000 times css select. [myhtml-program](https://github.com/kostya/myhtml/tree/master/bench/test-myhtml.cr), [crystagiri-program](https://github.com/kostya/myhtml/tree/master/bench/test-libxml.cr), [nokogiri-program](https://github.com/kostya/myhtml/tree/master/bench/test-libxml.rb)
| Lang | Shard | Lib | Parse time, s | Css time, s | Memory, MiB |
| -------- | ---------- | --------------- | ------------- | ----------- | ----------- |
| Crystal | lexbor | lexbor | 2.54 | 0.099 | 7.8 |
| Crystal | myhtml | myhtml(+modest) | 3.17 | 0.16 | 8.4 |
| Ruby 2.7 | Nokogiri | libxml2 | 9.19 | 10.76 | 139.8 |
| Crystal | Crystagiri | libxml2 | 11.27 | - | 25.0 |