{"id":23265664,"url":"https://github.com/html-extract/hext.js","last_synced_at":"2025-08-20T21:32:38.530Z","repository":{"id":57263835,"uuid":"167465925","full_name":"html-extract/hext.js","owner":"html-extract","description":"Use Hext in a browser or with node. Hext is a domain-specific language for extracting structured data from HTML documents.","archived":false,"fork":false,"pushed_at":"2024-04-25T12:31:32.000Z","size":3357,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-04-26T12:38:21.962Z","etag":null,"topics":["javascript","nodejs","wasm","webassembly"],"latest_commit_sha":null,"homepage":"https://hext.thomastrapp.com","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/html-extract.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-25T01:39:10.000Z","updated_at":"2024-04-25T12:25:31.000Z","dependencies_parsed_at":"2023-11-09T16:28:51.783Z","dependency_job_id":"6ca77d88-4c35-4a6c-a345-784971731828","html_url":"https://github.com/html-extract/hext.js","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext.js","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext.js/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext.js/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/html-extract%2Fhext.js/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/html-extract","download_url":"https://codeload.github.com/html-extract/hext.js/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230457463,"owners_count":18229028,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","nodejs","wasm","webassembly"],"created_at":"2024-12-19T15:31:11.853Z","updated_at":"2024-12-19T15:31:12.634Z","avatar_url":"https://github.com/html-extract.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hext.js - Hext for JavaScript\n\n[Hext](https://github.com/html-extract/hext) is a domain-specific language for extracting structured data from HTML documents.\nVisit [Hext's documentation](https://hext.thomastrapp.com/) to learn more about Hext.\n\nHext.js is a JavaScript/WebAssembly module that can be used in a browser.\n\n```html\n\u003c!-- latest hext.js release --\u003e\n\u003cscript src=\"https://cdn.jsdelivr.net/gh/html-extract/hext.js/dist/hext.js\"\u003e\u003c/script\u003e\n\u003cscript\u003e\n(function() {\n  // loadHext() returns a promise\n  loadHext().then(hext =\u003e {\n    // hext.Html's constructor expects a single argument\n    // containing an UTF-8 encoded string of HTML.\n    const html = new hext.Html(\n      '\u003ca href=\"one.html\"\u003e  \u003cimg src=\"one.jpg\" /\u003e  \u003c/a\u003e' +\n      '\u003ca href=\"two.html\"\u003e  \u003cimg src=\"two.jpg\" /\u003e  \u003c/a\u003e' +\n      '\u003ca href=\"three.html\"\u003e\u003cimg src=\"three.jpg\" /\u003e\u003c/a\u003e');\n\n    // hext.Rule's constructor expects a single argument\n    // containing a Hext snippet.\n    // Throws an Error on invalid syntax, with\n    // Error.message containing the error description.\n    const rule = new hext.Rule('\u003ca href:link\u003e' +\n                               '  \u003cimg src:image /\u003e' +\n                               '\u003c/a\u003e');\n\n    // hext.Rule.extract expects an argument of type\n    // hext.Html. Returns an Array containing Objects\n    // which contain key-value pairs of type String.\n    const result = rule.extract(html);\n\n    // hext.Rule.extract has a second, optional parameter\n    // of type unsigned int, called max_searches.\n    // The search for matching elements is aborted by\n    // throwing an exception after this limit is reached.\n    // The default is 0, which never aborts. If running\n    // untrusted hext templates, it is recommend to set\n    // max_searches to some high value, like 10000, to\n    // protect against resource exhaustion.\n    // const result = rule.extract(html, 10000);\n\n    // print each key-value pair\n    for(var i in result)\n    {\n      for(var key in result[i])\n        console.log(key, \"-\u003e\", result[i][key]);\n      console.log()\n    }\n  });\n})();\n\u003c/script\u003e\n```\n\nThe current development version is found in [dist/hext.js](./dist/hext.js).\n\nHext.js also works in Node ([example](./htmlext.wasm.js)). If performance is important, you may prefer using [Hext for Node](https://hext.thomastrapp.com/download) instead. Hext for Node is a native node addon for Linux and Mac OS. For other language bindings visit [hext.thomastrapp.com/download](https://hext.thomastrapp.com/download).\n\n\n# Building Hext.js from source\n\n[Hext](https://github.com/html-extract/hext) is written in C++. This repo contains a full build process for compiling Hext and all its dependencies to JavaScript/WebAssembly.\n\nIn order to build this, you need [Emscripten](https://emscripten.org/docs/getting_started/downloads.html) and the following packages:\n`wget git python3 build-essential libxml2 libtool autoconf rapidjson-dev cmake`.\n\nThen compilation is done with a single command:\n\n    make\n\nThis will download and build all of Hext's dependencies. Then it will build libhext itself and a Hext wrapper which compiles to a JavaScript/WebAssembly module for use in browsers.\n\n\n## Testing\n\nRunning `make test` will run libhext's [blackbox tests](https://github.com/html-extract/hext/tree/master/test/case) through [htmlext.wasm.js](./htmlext.wasm.js), which uses node.\nTo test the latest version of hext.js in your browser, visit [hext.thomastrapp.com/hext.js-test/](https://hext.thomastrapp.com/hext.js-test/).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhtml-extract%2Fhext.js","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhtml-extract%2Fhext.js","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhtml-extract%2Fhext.js/lists"}