https://github.com/html-extract/hext.js
Use Hext in a browser or with node. Hext is a domain-specific language for extracting structured data from HTML documents.
https://github.com/html-extract/hext.js
javascript nodejs wasm webassembly
Last synced: 10 months ago
JSON representation
Use Hext in a browser or with node. Hext is a domain-specific language for extracting structured data from HTML documents.
- Host: GitHub
- URL: https://github.com/html-extract/hext.js
- Owner: html-extract
- License: apache-2.0
- Created: 2019-01-25T01:39:10.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-25T12:31:32.000Z (about 2 years ago)
- Last Synced: 2024-04-26T12:38:21.962Z (about 2 years ago)
- Topics: javascript, nodejs, wasm, webassembly
- Language: C++
- Homepage: https://hext.thomastrapp.com
- Size: 3.2 MB
- Stars: 5
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hext.js - Hext for JavaScript
[Hext](https://github.com/html-extract/hext) is a domain-specific language for extracting structured data from HTML documents.
Visit [Hext's documentation](https://hext.thomastrapp.com/) to learn more about Hext.
Hext.js is a JavaScript/WebAssembly module that can be used in a browser.
```html
(function() {
// loadHext() returns a promise
loadHext().then(hext => {
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
const html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
const rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
const result = rule.extract(html);
// hext.Rule.extract has a second, optional parameter
// of type unsigned int, called max_searches.
// The search for matching elements is aborted by
// throwing an exception after this limit is reached.
// The default is 0, which never aborts. If running
// untrusted hext templates, it is recommend to set
// max_searches to some high value, like 10000, to
// protect against resource exhaustion.
// const result = rule.extract(html, 10000);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}
});
})();
```
The current development version is found in [dist/hext.js](./dist/hext.js).
Hext.js also works in Node ([example](./htmlext.wasm.js)). If performance is important, you may prefer using [Hext for Node](https://hext.thomastrapp.com/download) instead. Hext for Node is a native node addon for Linux and Mac OS. For other language bindings visit [hext.thomastrapp.com/download](https://hext.thomastrapp.com/download).
# Building Hext.js from source
[Hext](https://github.com/html-extract/hext) is written in C++. This repo contains a full build process for compiling Hext and all its dependencies to JavaScript/WebAssembly.
In order to build this, you need [Emscripten](https://emscripten.org/docs/getting_started/downloads.html) and the following packages:
`wget git python3 build-essential libxml2 libtool autoconf rapidjson-dev cmake`.
Then compilation is done with a single command:
make
This will download and build all of Hext's dependencies. Then it will build libhext itself and a Hext wrapper which compiles to a JavaScript/WebAssembly module for use in browsers.
## Testing
Running `make test` will run libhext's [blackbox tests](https://github.com/html-extract/hext/tree/master/test/case) through [htmlext.wasm.js](./htmlext.wasm.js), which uses node.
To test the latest version of hext.js in your browser, visit [hext.thomastrapp.com/hext.js-test/](https://hext.thomastrapp.com/hext.js-test/).