Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/html-extract/hext
Domain-specific language for extracting structured data from HTML documents
https://github.com/html-extract/hext
cpp data-extraction dsl html html-extraction node php python ruby scraping
Last synced: 18 days ago
JSON representation
Domain-specific language for extracting structured data from HTML documents
- Host: GitHub
- URL: https://github.com/html-extract/hext
- Owner: html-extract
- License: apache-2.0
- Created: 2016-03-03T01:18:42.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-04-25T11:12:42.000Z (7 months ago)
- Last Synced: 2024-04-25T16:07:01.826Z (7 months ago)
- Topics: cpp, data-extraction, dsl, html, html-extraction, node, php, python, ruby, scraping
- Language: C++
- Homepage: https://hext.thomastrapp.com
- Size: 2.06 MB
- Stars: 51
- Watchers: 4
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hext — Extract Data from HTML
[![PyPI Version](https://img.shields.io/pypi/v/hext.svg?color=blue)](https://pypi.org/project/hext/) [![npm version](https://img.shields.io/npm/v/hext.svg)](https://www.npmjs.com/package/hext)
Hext is a domain-specific language for extracting structured data from HTML documents.
Hext is written in C++ but language bindings are available for [Python](https://hext.thomastrapp.com/download#hext-for-python), [Node](https://hext.thomastrapp.com/download#hext-for-node), [JavaScript](https://hext.thomastrapp.com/download#hext-for-javascript), [Ruby](https://hext.thomastrapp.com/download#hext-for-ruby) and [PHP](https://hext.thomastrapp.com/download#hext-for-php).
![Hext Logo](https://raw.githubusercontent.com/html-extract/html-extract.github.io/master/hext-logo-x100.png)
See https://hext.thomastrapp.com for
[documentation](https://hext.thomastrapp.com/documentation),
[installation instructions](https://hext.thomastrapp.com/download) and a live demo.The Hext project is released under the terms of the Apache License v2.0.
## Example
Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an
anchor tag <a>, an attribute called href and a text that visitors can
click. The following Hext template will produce a dictionary for every matched
element. Each dictionary will contain the keys `link` and `title` which refer
to the href attribute and the text content of the matched <a>.# Extract links and their text
[» Load example in editor](https://hext.thomastrapp.com/#attribute)
Visit [Hext's project page](https://hext.thomastrapp.com) to learn more about
Hext. For examples that use the libhext C++ library check out `/libhext/examples`
and
[libhext's C++ library overview](https://hext.thomastrapp.com/libhext-overview).## Components of this Project
* `htmlext`: Command line utility that applies Hext templates to an HTML document
and produces JSON.
* `libhext`: C++ library that contains a Hext parser but also allows for
customization.
* `libhext-test`: Unit tests for libhext.
* `Hext bindings`: Bindings for scripting languages. There are extensions for
Node.js, Python, Ruby and PHP that are able to parse Hext and extract values
from HTML.## Project layout
├── build Build directory for htmlext
├── cmake CMake modules used by the project
├── htmlext Source for the htmlext command line tool
├── libhext The libhext project
│ ├── bindings Hext bindings for scripting languages
│ ├── build Build directory for libhext
│ ├── doc Doxygen documentation for libhext
│ ├── examples Examples making use of libhext
│ ├── include Public libhext API
│ ├── ragel Ragel input files
│ ├── scripts Helper scripts for libhext
│ ├── src libhext implementation files
│ └── test The libhext-test project
│ ├── build Build directory for libhext-test
│ └── src Source for libhext-test
├── man Htmlext man page
├── scripts Scripts for building and testing releases
├── syntaxhl Syntax highlighters for Vim and ACE
└── test Blackbox tests for htmlext## Dependencies for development
* [Ragel](http://www.colm.net/open-source/ragel/) generates the state machine
that is used to parse Hext
* The unit tests for libhext are written with
[Google Test](https://github.com/google/googletest)
* libhext's public API documentation is generated by
[Doxygen](http://www.stack.nl/~dimitri/doxygen/)
* libhext's scripting language bindings are generated by
[Swig](http://www.swig.org/)## Tests
There are unit tests for libhext and blackbox tests for Hext as a language,
whose main purpose is to detect unwanted change in syntax or behavior.
The libhext-test project is located in `/libhext/test` and depends on Google
Test. Nothing fancy, just build the project and run the executable
`libhext-test`. How to write test cases with Google Test is described
[here](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
The blackbox tests are located in `/test`. There you'll find a shell script
called `blackbox.sh`. This script applies Hext templates to HTML documents and
compares the result to a third file that contains the expected output. For
example, there is a test case `icase-quoted-regex` that consists of three files:
`icase-quoted-regex.hext`, `icase-quoted-regex.html`, and
`icase-quoted-regex.expected`. To run this test case you would do the following:$ ./blackbox.sh case/icase-quoted-regex.hext
`blackbox.sh` will then look for the corresponding `.html` and `.expected` files
of the same name in the directory of `icase-quoted-regex.hext`. Then it will
invoke `htmlext` with the given Hext template and HTML document and compare the
result to `icase-quoted-regex.expected`. To run all blackbox tests in
succession:$ ./blackbox.sh case/*.hext
By default `blackbox.sh` will look for the `htmlext` binary in `$PATH`. Failing
that, it looks for the binary in the default build directory. You can tell
`blackbox.sh` which command to use by setting HTMLEXT. For example, to run all
tests through valgrind you'd run the following:$ HTMLEXT="valgrind -q ../build/htmlext" ./blackbox.sh case/*.hext
## Acknowledgements
* [Gumbo](https://github.com/google/gumbo-parser)
— **An HTML5 parsing library in pure C99**
Gumbo is used as the HTML parser behind `hext::Html`. It's fast, easy to
integrate and even fixes invalid HTML.
* [Ragel](http://www.colm.net/open-source/ragel/)
— **Ragel State Machine Compiler**
The state machine that is used to parse Hext templates is generated by Ragel.
You can find the definition of this machine in `/libhext/ragel/hext-machine.rl`.
* [RapidJSON](http://rapidjson.org/)
— **A fast JSON parser/generator for C++**
RapidJSON powers the JSON output of the `htmlext` command line utility.
* [jq](https://stedolan.github.io/jq/)
— **A lightweight and flexible command-line JSON processor**
An indispensable tool when dealing with JSON in the shell.
Piping the output of `htmlext` into `jq` lets you do all sorts of crazy things.
* [Ace](https://ace.c9.io/) — **A Code Editor for the Web**
Used as the code editor in the
"[Try Hext in your Browser!](https://hext.thomastrapp.com)" section and as a
highlighter for all code examples. The highlighting rules for Hext are
included in this project in `/syntaxhl/ace`. Also, there's a script in
`/libhext/scripts/syntax-hl-ace` that uses Ace to transform a code template
into highlighted HTML.
* [Boost.Beast](https://github.com/boostorg/beast)
— **HTTP and WebSocket built on Boost.Asio in C++11**
The Websocket server behind the "[Try Hext in your Browser!](https://hext.thomastrapp.com)"
section is built with Beast. See [github.com/html-extract/hext-on-websockets](https://github.com/html-extract/hext-on-websockets) for more.