{"id":13435522,"url":"https://github.com/html-extract/hext","last_synced_at":"2025-04-15T00:35:46.813Z","repository":{"id":38533826,"uuid":"53009532","full_name":"html-extract/hext","owner":"html-extract","description":"Domain-specific language for extracting structured data from HTML documents","archived":false,"fork":false,"pushed_at":"2025-03-27T13:53:57.000Z","size":2223,"stargazers_count":52,"open_issues_count":5,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T09:05:29.342Z","etag":null,"topics":["cpp","data-extraction","dsl","html","html-extraction","node","php","python","ruby","scraping"],"latest_commit_sha":null,"homepage":"https://hext.thomastrapp.com","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/html-extract.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-03T01:18:42.000Z","updated_at":"2025-03-27T13:54:02.000Z","dependencies_parsed_at":"2024-01-16T01:25:56.716Z","dependency_job_id":"efafeaf4-da64-4a84-a027-d995d5ceffe9","html_url":"https://github.com/html-extract/hext","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/html-extract","download_url":"https://codeload.github.com/html-extract/hext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248985996,"owners_count":21194020,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","data-extraction","dsl","html","html-extraction","node","php","python","ruby","scraping"],"created_at":"2024-07-31T03:00:36.509Z","updated_at":"2025-04-15T00:35:46.789Z","avatar_url":"https://github.com/html-extract.png","language":"C++","readme":"# Hext — Extract Data from HTML\n\n[![PyPI Version](https://img.shields.io/pypi/v/hext.svg?color=blue)](https://pypi.org/project/hext/) [![npm version](https://img.shields.io/npm/v/hext.svg)](https://www.npmjs.com/package/hext)\n\nHext is a domain-specific language for extracting structured data from HTML documents.\n\nHext is written in C++ but language bindings are available for [Python](https://hext.thomastrapp.com/download#hext-for-python), [Node](https://hext.thomastrapp.com/download#hext-for-node), [JavaScript](https://hext.thomastrapp.com/download#hext-for-javascript), [Ruby](https://hext.thomastrapp.com/download#hext-for-ruby) and [PHP](https://hext.thomastrapp.com/download#hext-for-php).\n\n![Hext Logo](https://raw.githubusercontent.com/html-extract/html-extract.github.io/master/hext-logo-x100.png)\n\nSee https://hext.thomastrapp.com for\n[documentation](https://hext.thomastrapp.com/documentation), \n[installation instructions](https://hext.thomastrapp.com/download) and a live demo.\n\nThe Hext project is released under the terms of the Apache License v2.0.\n\n## Example\nSuppose you want to extract all hyperlinks from a web page. Hyperlinks have an\nanchor tag \u0026lt;a\u0026gt;, an attribute called href and a text that visitors can\nclick. The following Hext template will produce a dictionary for every matched\nelement. Each dictionary will contain the keys `link` and `title` which refer\nto the href attribute and the text content of the matched \u0026lt;a\u0026gt;.\n\n    # Extract links and their text\n    \u003ca href:link @text:title /\u003e\n\n[\u0026raquo; Load example in editor](https://hext.thomastrapp.com/#attribute)\n\nVisit [Hext's project page](https://hext.thomastrapp.com) to learn more about\nHext. For examples that use the libhext C++ library check out `/libhext/examples`\nand\n[libhext's C++ library overview](https://hext.thomastrapp.com/libhext-overview).\n\n\n## Components of this Project\n* `htmlext`: Command line utility that applies Hext templates to an HTML document\n  and produces JSON.\n* `libhext`: C++ library that contains a Hext parser but also allows for\n  customization.\n* `libhext-test`: Unit tests for libhext.\n* `Hext bindings`: Bindings for scripting languages. There are extensions for\n  Node.js, Python, Ruby and PHP that are able to parse Hext and extract values\n  from HTML.\n\n## Project layout\n    ├── build             Build directory for htmlext\n    ├── cmake             CMake modules used by the project\n    ├── htmlext           Source for the htmlext command line tool\n    ├── libhext           The libhext project\n    │   ├── bindings      Hext bindings for scripting languages\n    │   ├── build         Build directory for libhext\n    │   ├── doc           Doxygen documentation for libhext\n    │   ├── examples      Examples making use of libhext\n    │   ├── include       Public libhext API\n    │   ├── ragel         Ragel input files\n    │   ├── scripts       Helper scripts for libhext\n    │   ├── src           libhext implementation files\n    │   └── test          The libhext-test project\n    │       ├── build     Build directory for libhext-test\n    │       └── src       Source for libhext-test\n    ├── man               Htmlext man page\n    ├── scripts           Scripts for building and testing releases\n    ├── syntaxhl          Syntax highlighters for Vim and ACE\n    └── test              Blackbox tests for htmlext\n\n## Dependencies for development\n* [Ragel](http://www.colm.net/open-source/ragel/) generates the state machine\n  that is used to parse Hext\n* The unit tests for libhext are written with\n  [Google Test](https://github.com/google/googletest)\n* libhext's public API documentation is generated by\n  [Doxygen](http://www.stack.nl/~dimitri/doxygen/)\n* libhext's scripting language bindings are generated by\n  [Swig](http://www.swig.org/)\n\n## Tests\nThere are unit tests for libhext and blackbox tests for Hext as a language,\nwhose main purpose is to detect unwanted change in syntax or behavior.  \nThe libhext-test project is located in `/libhext/test` and depends on Google\nTest. Nothing fancy, just build the project and run the executable\n`libhext-test`. How to write test cases with Google Test is described\n[here](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).  \nThe blackbox tests are located in `/test`. There you'll find a shell script\ncalled `blackbox.sh`. This script applies Hext templates to HTML documents and\ncompares the result to a third file that contains the expected output. For\nexample, there is a test case `icase-quoted-regex` that consists of three files:\n`icase-quoted-regex.hext`, `icase-quoted-regex.html`, and\n`icase-quoted-regex.expected`. To run this test case you would do the following:\n\n    $ ./blackbox.sh case/icase-quoted-regex.hext\n\n`blackbox.sh` will then look for the corresponding `.html` and `.expected` files\nof the same name in the directory of `icase-quoted-regex.hext`. Then it will\ninvoke `htmlext` with the given Hext template and HTML document and compare the\nresult to `icase-quoted-regex.expected`. To run all blackbox tests in\nsuccession:\n\n    $ ./blackbox.sh case/*.hext\n\nBy default `blackbox.sh` will look for the `htmlext` binary in `$PATH`. Failing\nthat, it looks for the binary in the default build directory. You can tell\n`blackbox.sh` which command to use by setting HTMLEXT. For example, to run all\ntests through valgrind you'd run the following:\n\n    $ HTMLEXT=\"valgrind -q ../build/htmlext\" ./blackbox.sh case/*.hext\n\n## Acknowledgements\n* [Gumbo](https://github.com/google/gumbo-parser)\n  — **An HTML5 parsing library in pure C99**  \n  Gumbo is used as the HTML parser behind `hext::Html`. It's fast, easy to\n  integrate and even fixes invalid HTML.\n* [Ragel](http://www.colm.net/open-source/ragel/)\n  — **Ragel State Machine Compiler**  \n  The state machine that is used to parse Hext templates is generated by Ragel.\n  You can find the definition of this machine in `/libhext/ragel/hext-machine.rl`.\n* [RapidJSON](http://rapidjson.org/)\n  — **A fast JSON parser/generator for C++**  \n  RapidJSON powers the JSON output of the `htmlext` command line utility.\n* [jq](https://stedolan.github.io/jq/)\n  — **A lightweight and flexible command-line JSON processor**  \n  An indispensable tool when dealing with JSON in the shell.\n  Piping the output of `htmlext` into `jq` lets you do all sorts of crazy things.\n* [Ace](https://ace.c9.io/) — **A Code Editor for the Web**  \n  Used as the code editor in the\n  \"[Try Hext in your Browser!](https://hext.thomastrapp.com)\" section and as a\n  highlighter for all code examples. The highlighting rules for Hext are\n  included in this project in `/syntaxhl/ace`. Also, there's a script in\n  `/libhext/scripts/syntax-hl-ace` that uses Ace to transform a code template\n  into highlighted HTML.\n* [Boost.Beast](https://github.com/boostorg/beast)\n  — **HTTP and WebSocket built on Boost.Asio in C++11**  \n  The Websocket server behind the \"[Try Hext in your Browser!](https://hext.thomastrapp.com)\"\n  section is built with Beast. See [github.com/html-extract/hext-on-websockets](https://github.com/html-extract/hext-on-websockets) for more.\n\n","funding_links":[],"categories":["C++"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhtml-extract%2Fhext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhtml-extract%2Fhext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhtml-extract%2Fhext/lists"}