{"id":16389618,"url":"https://github.com/zadean/htmerl","last_synced_at":"2025-10-30T00:39:12.365Z","repository":{"id":50234046,"uuid":"191927554","full_name":"zadean/htmerl","owner":"zadean","description":"HTML Parser in Erlang","archived":false,"fork":false,"pushed_at":"2023-02-12T12:42:32.000Z","size":62,"stargazers_count":14,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-12T04:33:45.245Z","etag":null,"topics":["erlang","html-parser","html5"],"latest_commit_sha":null,"homepage":null,"language":"Erlang","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zadean.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-14T10:54:13.000Z","updated_at":"2024-03-28T14:21:51.000Z","dependencies_parsed_at":"2022-09-16T16:26:46.514Z","dependency_job_id":null,"html_url":"https://github.com/zadean/htmerl","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zadean%2Fhtmerl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zadean%2Fhtmerl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zadean%2Fhtmerl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zadean%2Fhtmerl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zadean","download_url":"https://codeload.github.com/zadean/htmerl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221811386,"owners_count":16884305,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["erlang","html-parser","html5"],"created_at":"2024-10-11T04:33:48.359Z","updated_at":"2025-10-30T00:39:12.357Z","avatar_url":"https://github.com/zadean.png","language":"Erlang","funding_links":[],"categories":[],"sub_categories":[],"readme":"htmerl\n=====\n\nAn OTP library for parsing HTML documents.\n\nThis library attempts to follow the [HTML 5.2 specification](https://www.w3.org/TR/html52/)\nfor tokenizing and parsing the HTML syntax as closely as possible.\nThis means that common errors that browsers accept are also accepted here and sanitized.\n\nThe output from `htmerl:sax/2` is identical to the XML SAX events produced\nby `xmerl_sax_parser` except that here all values and names are UTF-8 binary\nand not lists.\n\nUsage\n-----\n\nThere are two ways to use `htmerl`.\nFirstly, to build a tree directly from the parsed input. Notice here that the missing \"head\" element was added.\n\n```erlang\n1\u003e htmerl:simple(\u003c\u003c\"\u003c!DOCTYPE html\u003e\u003chtml\u003e\u003cbody\u003eHello\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e).\n{htmlDocument,\u003c\u003c\"html\"\u003e\u003e,\u003c\u003c\u003e\u003e,\u003c\u003c\u003e\u003e,\n    [{htmlElement,\u003c\u003c\"html\"\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\n         [],\n         [{htmlElement,\u003c\u003c\"head\"\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\n              [],[]},\n          {htmlElement,\u003c\u003c\"body\"\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\n              [],\n              [{htmlText,\u003c\u003c\"Hello\"\u003e\u003e,text}]}]}]}\n```\n\nSecondly, as a SAX parser. Calling `htmerl:sax/1` returns a list of SAX events.\n`htmerl:sax/2` calls a user defined function.\n\nOptions for `htmerl:sax/2` are as follows:\n\n- `preserve_ws`: If all text nodes, incluiding pure whitespace should be preserved (default `false`).\n- `user_state`: A term to hold any user-defined state. Will be passed to the `EventFun`.\n- `event_fun`: Arity 3 function that takes `Event`, `Postion`, `UserState` and returns the new `UserState`.\n\n```erlang\n2\u003e htmerl:sax(\u003c\u003c\"\u003c!DOCTYPE html\u003e\u003chtml\u003e\u003cbody\u003eHello\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e).\n{ok,[startDocument,\n     {startDTD,\u003c\u003c\"html\"\u003e\u003e,\u003c\u003c\u003e\u003e,\u003c\u003c\u003e\u003e},\n     endDTD,\n     {startPrefixMapping,\u003c\u003c\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e},\n                   []},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e},\n                   []},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e}},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e},\n                   []},\n     {characters,\u003c\u003c\"Hello\"\u003e\u003e},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e}},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e}},\n     {endPrefixMapping,\u003c\u003c\u003e\u003e},\n     endDocument],\n    []}\n```\n\n or with a user defined function and state\n\n```erlang\n3\u003e F = fun(E, _, S) -\u003e io:format(\"Event: ~p~n\", [E]), S end,\nOpts = [{event_fun, F}, {user_state, []}],\nhtmerl:sax(\u003c\u003c\"\u003c!DOCTYPE html\u003e\u003chtml\u003e\u003cbody\u003eHello\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e, Opts).\nEvent: startDocument\nEvent: {startDTD,\u003c\u003c\"html\"\u003e\u003e,\u003c\u003c\u003e\u003e,\u003c\u003c\u003e\u003e}\nEvent: endDTD\nEvent: {startPrefixMapping,\u003c\u003c\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e}\nEvent: {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                     {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e},\n                     []}\nEvent: {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                     {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e},\n                     []}\nEvent: {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e}}\nEvent: {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                     {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e},\n                     []}\nEvent: {characters,\u003c\u003c\"Hello\"\u003e\u003e}\nEvent: {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e}}\nEvent: {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e}}\nEvent: {endPrefixMapping,\u003c\u003c\u003e\u003e}\nEvent: endDocument\n{ok,[],[]}\n```\n\nor extracting values using the SAX events in a module:\n\n```erlang\n-module(htmerl_example).\n\n-export([run/0]).\n\nrun() -\u003e\n    Html =\n        \u003c\u003c\"\u003chtml\u003e\u003cbody\u003e\u003cp\u003eCheck\u003c/p\u003enothing here\u003cp\u003ethis \u003cb\u003ebold garbage\u003c/b\u003e\u003c/p\u003eg\"\n          \"arbage\u003cp\u003eout!\u003c/p\u003e\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e,\n    XPath = \u003c\u003c\"html/body/p\"\u003e\u003e,\n    Path =\n        lists:reverse(\n            binary:split(XPath, \u003c\u003c\"/\"\u003e\u003e, [global])),\n    Opts = [{event_fun, fun xpath/3}, {user_state, {[], Path, []}}],\n    {ok, TextList, []} = htmerl:sax(Html, Opts),\n    TextList.\n\nxpath({characters, Text}, _LineNum, {Path, Path, Acc}) -\u003e\n    {Path, Path, [Text | Acc]};\nxpath({endElement, _Ns, Ln, _}, _LineNum, {[Ln | Path], XPath, Acc}) -\u003e\n    {Path, XPath, Acc};\nxpath({startElement, _Ns, Ln, _, _Atts}, _LineNum, {Path, XPath, Acc}) -\u003e\n    {[Ln | Path], XPath, Acc};\nxpath(endDocument, _LineNum, {_Path, _XPath, Acc}) -\u003e\n    lists:reverse(Acc);\nxpath(_Event, _LineNum, State) -\u003e\n    State.\n```\n\n```erlang\n4\u003e htmerl_example:run().\n[\u003c\u003c\"Check\"\u003e\u003e,\u003c\u003c\"this\"\u003e\u003e,\u003c\u003c\"out!\"\u003e\u003e]\n```\n\nPreserve all whitespaces in the document body of an incomplete document:\n\n```erlang\n5\u003e htmerl:sax(\u003c\u003c\"\u003cp\u003e   Well,\\t\\n Hello!!   \"\u003e\u003e, [{preserve_ws, true}]).\n{ok,[startDocument,\n     {startPrefixMapping,\u003c\u003c\u003e\u003e,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e},\n                   []},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e},\n                   []},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"head\"\u003e\u003e}},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e},\n                   []},\n     {startElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"p\"\u003e\u003e,\n                   {\u003c\u003c\u003e\u003e,\u003c\u003c\"p\"\u003e\u003e},\n                   []},\n     {characters,\u003c\u003c\"   Well,\\t\\n Hello!!   \"\u003e\u003e},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"p\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"p\"\u003e\u003e}},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"body\"\u003e\u003e}},\n     {endElement,\u003c\u003c\"http://www.w3.org/1999/xhtml\"\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e,\n                 {\u003c\u003c\u003e\u003e,\u003c\u003c\"html\"\u003e\u003e}},\n     {endPrefixMapping,\u003c\u003c\u003e\u003e},\n     endDocument],\n    []}\n```\n\nBuild\n-----\n\n```shell\nrebar3 compile\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzadean%2Fhtmerl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzadean%2Fhtmerl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzadean%2Fhtmerl/lists"}