Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/html-extract/hext

Domain-specific language for extracting structured data from HTML documents
https://github.com/html-extract/hext

cpp data-extraction dsl html html-extraction node php python ruby scraping

Last synced: 18 days ago
JSON representation

Domain-specific language for extracting structured data from HTML documents

Awesome Lists containing this project

README

        

# Hext — Extract Data from HTML

[![PyPI Version](https://img.shields.io/pypi/v/hext.svg?color=blue)](https://pypi.org/project/hext/) [![npm version](https://img.shields.io/npm/v/hext.svg)](https://www.npmjs.com/package/hext)

Hext is a domain-specific language for extracting structured data from HTML documents.

Hext is written in C++ but language bindings are available for [Python](https://hext.thomastrapp.com/download#hext-for-python), [Node](https://hext.thomastrapp.com/download#hext-for-node), [JavaScript](https://hext.thomastrapp.com/download#hext-for-javascript), [Ruby](https://hext.thomastrapp.com/download#hext-for-ruby) and [PHP](https://hext.thomastrapp.com/download#hext-for-php).

![Hext Logo](https://raw.githubusercontent.com/html-extract/html-extract.github.io/master/hext-logo-x100.png)

See https://hext.thomastrapp.com for
[documentation](https://hext.thomastrapp.com/documentation),
[installation instructions](https://hext.thomastrapp.com/download) and a live demo.

The Hext project is released under the terms of the Apache License v2.0.

## Example
Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an
anchor tag <a>, an attribute called href and a text that visitors can
click. The following Hext template will produce a dictionary for every matched
element. Each dictionary will contain the keys `link` and `title` which refer
to the href attribute and the text content of the matched <a>.

# Extract links and their text

[» Load example in editor](https://hext.thomastrapp.com/#attribute)

Visit [Hext's project page](https://hext.thomastrapp.com) to learn more about
Hext. For examples that use the libhext C++ library check out `/libhext/examples`
and
[libhext's C++ library overview](https://hext.thomastrapp.com/libhext-overview).

## Components of this Project
* `htmlext`: Command line utility that applies Hext templates to an HTML document
and produces JSON.
* `libhext`: C++ library that contains a Hext parser but also allows for
customization.
* `libhext-test`: Unit tests for libhext.
* `Hext bindings`: Bindings for scripting languages. There are extensions for
Node.js, Python, Ruby and PHP that are able to parse Hext and extract values
from HTML.

## Project layout
├── build Build directory for htmlext
├── cmake CMake modules used by the project
├── htmlext Source for the htmlext command line tool
├── libhext The libhext project
│   ├── bindings Hext bindings for scripting languages
│   ├── build Build directory for libhext
│   ├── doc Doxygen documentation for libhext
│   ├── examples Examples making use of libhext
│   ├── include Public libhext API
│   ├── ragel Ragel input files
│   ├── scripts Helper scripts for libhext
│   ├── src libhext implementation files
│   └── test The libhext-test project
│   ├── build Build directory for libhext-test
│   └── src Source for libhext-test
├── man Htmlext man page
├── scripts Scripts for building and testing releases
├── syntaxhl Syntax highlighters for Vim and ACE
└── test Blackbox tests for htmlext

## Dependencies for development
* [Ragel](http://www.colm.net/open-source/ragel/) generates the state machine
that is used to parse Hext
* The unit tests for libhext are written with
[Google Test](https://github.com/google/googletest)
* libhext's public API documentation is generated by
[Doxygen](http://www.stack.nl/~dimitri/doxygen/)
* libhext's scripting language bindings are generated by
[Swig](http://www.swig.org/)

## Tests
There are unit tests for libhext and blackbox tests for Hext as a language,
whose main purpose is to detect unwanted change in syntax or behavior.
The libhext-test project is located in `/libhext/test` and depends on Google
Test. Nothing fancy, just build the project and run the executable
`libhext-test`. How to write test cases with Google Test is described
[here](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
The blackbox tests are located in `/test`. There you'll find a shell script
called `blackbox.sh`. This script applies Hext templates to HTML documents and
compares the result to a third file that contains the expected output. For
example, there is a test case `icase-quoted-regex` that consists of three files:
`icase-quoted-regex.hext`, `icase-quoted-regex.html`, and
`icase-quoted-regex.expected`. To run this test case you would do the following:

$ ./blackbox.sh case/icase-quoted-regex.hext

`blackbox.sh` will then look for the corresponding `.html` and `.expected` files
of the same name in the directory of `icase-quoted-regex.hext`. Then it will
invoke `htmlext` with the given Hext template and HTML document and compare the
result to `icase-quoted-regex.expected`. To run all blackbox tests in
succession:

$ ./blackbox.sh case/*.hext

By default `blackbox.sh` will look for the `htmlext` binary in `$PATH`. Failing
that, it looks for the binary in the default build directory. You can tell
`blackbox.sh` which command to use by setting HTMLEXT. For example, to run all
tests through valgrind you'd run the following:

$ HTMLEXT="valgrind -q ../build/htmlext" ./blackbox.sh case/*.hext

## Acknowledgements
* [Gumbo](https://github.com/google/gumbo-parser)
— **An HTML5 parsing library in pure C99**
Gumbo is used as the HTML parser behind `hext::Html`. It's fast, easy to
integrate and even fixes invalid HTML.
* [Ragel](http://www.colm.net/open-source/ragel/)
— **Ragel State Machine Compiler**
The state machine that is used to parse Hext templates is generated by Ragel.
You can find the definition of this machine in `/libhext/ragel/hext-machine.rl`.
* [RapidJSON](http://rapidjson.org/)
— **A fast JSON parser/generator for C++**
RapidJSON powers the JSON output of the `htmlext` command line utility.
* [jq](https://stedolan.github.io/jq/)
— **A lightweight and flexible command-line JSON processor**
An indispensable tool when dealing with JSON in the shell.
Piping the output of `htmlext` into `jq` lets you do all sorts of crazy things.
* [Ace](https://ace.c9.io/) — **A Code Editor for the Web**
Used as the code editor in the
"[Try Hext in your Browser!](https://hext.thomastrapp.com)" section and as a
highlighter for all code examples. The highlighting rules for Hext are
included in this project in `/syntaxhl/ace`. Also, there's a script in
`/libhext/scripts/syntax-hl-ace` that uses Ace to transform a code template
into highlighted HTML.
* [Boost.Beast](https://github.com/boostorg/beast)
— **HTTP and WebSocket built on Boost.Asio in C++11**
The Websocket server behind the "[Try Hext in your Browser!](https://hext.thomastrapp.com)"
section is built with Beast. See [github.com/html-extract/hext-on-websockets](https://github.com/html-extract/hext-on-websockets) for more.