{"id":50284600,"url":"https://github.com/f34nk/lexbor_erl","last_synced_at":"2026-05-28T01:32:05.551Z","repository":{"id":325654994,"uuid":"1070784589","full_name":"f34nk/lexbor_erl","owner":"f34nk","description":"An Erlang port based binding for the Lexbor HTML parser and DOM C library","archived":false,"fork":false,"pushed_at":"2025-12-04T07:23:30.000Z","size":187,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-19T18:44:56.657Z","etag":null,"topics":["dom-manipulation","erlang","html-parser","lexbor"],"latest_commit_sha":null,"homepage":"","language":"Erlang","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/f34nk.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-06T12:42:07.000Z","updated_at":"2026-03-16T16:41:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/f34nk/lexbor_erl","commit_stats":null,"previous_names":["f34nk/lexbor_erl"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/f34nk/lexbor_erl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f34nk%2Flexbor_erl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f34nk%2Flexbor_erl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f34nk%2Flexbor_erl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f34nk%2Flexbor_erl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/f34nk","download_url":"https://codeload.github.com/f34nk/lexbor_erl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/f34nk%2Flexbor_erl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33590884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-27T02:00:06.184Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dom-manipulation","erlang","html-parser","lexbor"],"created_at":"2026-05-28T01:32:04.841Z","updated_at":"2026-05-28T01:32:05.542Z","avatar_url":"https://github.com/f34nk.png","language":"Erlang","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lexbor_erl\n\n[![CI](https://github.com/f34nk/lexbor_erl/actions/workflows/ci.yml/badge.svg)](https://github.com/f34nk/lexbor_erl/actions/workflows/ci.yml)\n[![lexbor_erl version](https://img.shields.io/hexpm/v/lexbor_erl.svg)](https://hex.pm/packages/lexbor_erl)\n[![Hex.pm](https://img.shields.io/hexpm/dt/lexbor_erl.svg)](https://hex.pm/packages/lexbor_erl)\n\nAn Erlang wrapper for the [Lexbor](https://github.com/lexbor/lexbor) HTML parser and DOM library via a port-based architecture.\n\n## Overview\n\n`lexbor_erl` provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.\n\n## Features\n\n- **HTML5-tolerant parsing** with automatic error recovery\n- **CSS selector queries** (class, ID, tag, attributes, combinators, pseudo-classes)\n- **DOM manipulation** - modify attributes, text content, and tree structure\n- **Streaming parser** - parse large HTML documents incrementally\n- **Stateless operations** for quick one-off tasks\n- **Stateful document management** for complex workflows\n- **Parallel processing** - worker pool architecture for concurrent operations\n- **Safe for the BEAM** - crashes in native code don't bring down the VM\n- **No atom leaks** - all user input stays as binaries\n\n## Prerequisites\n\n- Erlang/OTP (tested with OTP 24+)\n- CMake 3.10+\n- [Lexbor library](https://github.com/lexbor/lexbor) installed on your system\n\n### Installing Lexbor\n\nOn macOS with Homebrew:\n```bash\nbrew install lexbor\n```\n\nOn Ubuntu/Debian:\n```bash\nsudo apt-get install liblexbor-dev\n```\n\nOr build from source:\n```bash\ngit clone https://github.com/lexbor/lexbor.git\ncd lexbor\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake\nsudo make install\n```\n\n## Building\n\n```bash\nmake\n```\n\n## Quick Start\n\n```erlang\n1\u003e lexbor_erl:start().\nok\n\n%% Stateless: parse and serialize\n2\u003e {ok, Html} = lexbor_erl:parse_serialize(\u003c\u003c\"\u003cdiv\u003eHello\u003cspan\u003eWorld\"\u003e\u003e).\n{ok,\u003c\u003c\"\u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003e\u003cdiv\u003eHello\u003cspan\u003eWorld\u003c/span\u003e\u003c/div\u003e\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e}\n\n%% Stateless: select elements\n3\u003e {ok, List} = lexbor_erl:select_html(\n     \u003c\u003c\"\u003cul\u003e\u003cli class=a\u003eA\u003c/li\u003e\u003cli class=b\u003eB\u003c/li\u003e\u003c/ul\u003e\"\u003e\u003e, \n     \u003c\u003c\"li.b\"\u003e\u003e).\n{ok,[\u003c\u003c\"\u003cli class=\\\"b\\\"\u003eB\u003c/li\u003e\"\u003e\u003e]}\n\n%% Stateful: parse document\n4\u003e {ok, Doc} = lexbor_erl:parse(\n     \u003c\u003c\"\u003cdiv id=app\u003e\u003cul\u003e\u003cli class=a\u003eA\u003c/li\u003e\u003cli class=b\u003eB\u003c/li\u003e\u003c/ul\u003e\u003c/div\u003e\"\u003e\u003e).\n{ok,1}\n\n%% Select nodes\n5\u003e {ok, Nodes} = lexbor_erl:select(Doc, \u003c\u003c\"#app li\"\u003e\u003e).\n{ok,[{node,140735108544752},{node,140735108544896}]}\n\n%% Get node HTML\n6\u003e [lexbor_erl:outer_html(Doc, N) || N \u003c- Nodes].\n[{ok,\u003c\u003c\"\u003cli class=\\\"a\\\"\u003eA\u003c/li\u003e\"\u003e\u003e},{ok,\u003c\u003c\"\u003cli class=\\\"b\\\"\u003eB\u003c/li\u003e\"\u003e\u003e}]\n\n%% DOM manipulation: modify attributes\n7\u003e {ok, [Li]} = lexbor_erl:select(Doc, \u003c\u003c\"li.a\"\u003e\u003e).\n{ok,[{node,140735108544752}]}\n\n8\u003e lexbor_erl:set_attribute(Doc, Li, \u003c\u003c\"class\"\u003e\u003e, \u003c\u003c\"modified\"\u003e\u003e).\nok\n\n9\u003e lexbor_erl:get_attribute(Doc, Li, \u003c\u003c\"class\"\u003e\u003e).\n{ok,\u003c\u003c\"modified\"\u003e\u003e}\n\n%% DOM manipulation: modify text content\n10\u003e lexbor_erl:set_text(Doc, Li, \u003c\u003c\"New Text\"\u003e\u003e).\nok\n\n11\u003e lexbor_erl:get_text(Doc, Li).\n{ok,\u003c\u003c\"New Text\"\u003e\u003e}\n\n%% Content manipulation: append HTML to matching elements\n12\u003e {ok, NumModified} = lexbor_erl:append_content(Doc, \u003c\u003c\"ul\"\u003e\u003e, \u003c\u003c\"\u003cli\u003eNew Item\u003c/li\u003e\"\u003e\u003e).\n{ok,1}\n\n13\u003e {ok, Html} = lexbor_erl:serialize(Doc).\n{ok,\u003c\u003c\"\u003c!DOCTYPE html\u003e\u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003e\u003cdiv id=\\\"app\\\"\u003e\u003cul\u003e\u003cli class=\\\"modified\\\"\u003eNew Text\u003c/li\u003e\u003cli class=\\\"b\\\"\u003eB\u003c/li\u003e\u003cli\u003eNew Item\u003c/li\u003e\u003c/ul\u003e\u003c/div\u003e\u003c/body\u003e\u003c/html\u003e\"\u003e\u003e}\n\n%% Streaming parser: parse incrementally\n14\u003e {ok, Session} = lexbor_erl:parse_stream_begin().\n{ok,72057594037927937}\n\n15\u003e ok = lexbor_erl:parse_stream_chunk(Session, \u003c\u003c\"\u003cdiv\u003e\u003cp\u003eHe\"\u003e\u003e).\nok\n\n16\u003e ok = lexbor_erl:parse_stream_chunk(Session, \u003c\u003c\"llo\u003c/p\u003e\u003c/div\u003e\"\u003e\u003e).\nok\n\n17\u003e {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).\n{ok,72057594037927938}\n\n%% Release documents\n18\u003e ok = lexbor_erl:release(Doc).\nok\n\n19\u003e ok = lexbor_erl:release(StreamDoc).\nok\n\n20\u003e lexbor_erl:stop().\nok\n```\n\nAlso check out [examples/](https://github.com/f34nk/lexbor_erl/tree/main/examples) directory.\n\n## Supported Operations\n\n### Document Lifecycle\n- `parse/1` - Parse HTML document, returns document handle\n- `release/1` - Release document and free resources\n- `serialize/1` - Serialize document to HTML5 binary\n\n### Stateless Operations\n- `parse_serialize/1` - Parse and serialize in one call (convenience function)\n- `select_html/2` - Parse, select elements, return HTML fragments\n\n### CSS Selectors\n- `select/2` - Find elements using CSS selectors\n- Supports: ID (`#id`), class (`.class`), tag (`div`), attributes (`[attr=value]`)\n- Supports: combinators (Descendant ` `, Child `\u003e`, Adjacent sibling `+`, General sibling `~`), pseudo-classes (`:first-child`, `:nth-child()`, etc.)\n\n### DOM Queries\n- `outer_html/2` - Get outer HTML of element (including the element tag)\n- `inner_html/2` - Get inner HTML of element (children only)\n\n### Attribute Manipulation\n- `get_attribute/3` - Get attribute value\n- `set_attribute/4` - Set attribute value\n- `remove_attribute/3` - Remove attribute\n\n### Text Content\n- `get_text/2` - Get text content recursively\n- `set_text/3` - Set text content (removes all children, replaces with text)\n\n### HTML Content Manipulation\n- `set_inner_html/3` - Replace element's children with parsed HTML\n- `append_content/3` - Append HTML content to all elements matching selector\n- `prepend_content/3` - Prepend as first child\n- `insert_before_content/3` - Insert HTML as sibling before matched elements\n- `insert_after_content/3` - Insert HTML as sibling after matched elements\n- `replace_content/3` - Replace matched elements with HTML content\n\n### DOM Tree Manipulation\n- `create_element/2` - Create new element\n- `append_child/3` - Append child node to parent\n- `insert_before/4` - Insert node before reference node\n- `remove_node/2` - Remove node from tree\n\n### Streaming Parser\n- `parse_stream_begin/0` - Start streaming parse session\n- `parse_stream_chunk/2` - Add HTML chunk to stream\n- `parse_stream_end/1` - Finalize stream and get document\n\n### Application Management\n- `start/0` - Start lexbor_erl application\n- `stop/0` - Stop lexbor_erl application\n- `alive/0` - Check if service is running\n\n## How to use it in your application?\n\nAdd to your `rebar.config`:\n\n```erlang\n{deps, [\n    {lexbor_erl, \"0.3.0\"}\n]}.\n```\n\nThen run:\n\n```shell\nrebar3 get-deps\nrebar3 compile\n```\n\n**Note**: lexbor_erl is a port-based application and cannot be packaged as an escript. \nIt must be used as a library dependency with access to the compiled C port executable.\n\nSee the [demo/](https://github.com/f34nk/lexbor_erl/tree/main/demo) directory for complete working application.\n\n## Additional configuration\n\nIn your `sys.config`:\n\n```erlang\n{lexbor_erl, [\n  {port_cmd, \"priv/lexbor_port\"},\n  {op_timeout_ms, 3000}\n]}.\n```\n\n## Parallelism and Concurrency\n\n`lexbor_erl` uses a **worker pool architecture** to enable true parallel processing of HTML operations:\n\n### Architecture\n\n- **Multiple port workers**: Configurable pool of independent C port processes\n- **Smart routing**: \n  - Stateless operations (e.g., `parse_serialize/1`, `select_html/2`) use time-based hash distribution for load balancing\n  - Stateful operations route by `DocId` to ensure the same worker handles all operations for a given document\n- **Isolation**: Each worker process is independent with its own document registry\n- **Individual supervision**: Each worker is supervised independently - if one crashes, only that worker restarts\n- **Fault tolerance**: Worker crashes don't affect other workers or the BEAM VM; documents on crashed worker are lost but other workers continue serving\n\n### Configuration\n\nSet the pool size in your `sys.config`:\n\n```erlang\n{lexbor_erl, [\n  {pool_size, 8},              % Number of parallel workers (default: scheduler count)\n  {op_timeout_ms, 3000}        % Timeout per operation\n]}.\n```\n\nOr via environment variable when starting the application:\n\n```erlang\napplication:set_env(lexbor_erl, pool_size, 8).\n```\n\n### Thread Safety and Fault Tolerance\n\n- **Safe by design**: Each worker is single-threaded, processing one request at a time\n- **No shared state**: Documents are isolated to their respective workers\n- **Concurrent operations**: Multiple workers can process different documents simultaneously\n- **Deterministic routing**: A document always routes to the same worker via the worker ID encoded in the `DocId`\n- **Individual worker restart**: If a worker crashes, only that worker is restarted by the supervisor\n- **Limited blast radius**: Worker crashes only affect documents on that specific worker\n- **Automatic recovery**: Crashed workers are automatically restarted and can accept new documents\n\n### Performance Characteristics\n\n- **Parallelism**: Leverages all CPU cores for concurrent HTML parsing and manipulation\n- **No contention**: No locks or shared mutable state between workers\n- **Linear scaling**: Performance scales linearly with the number of workers (up to CPU core count)\n- **Stateless optimization**: Stateless operations (`parse_serialize`, `select_html`) can use any available worker\n\n## License\n\nLGPL-2.1-or-later\n\n## Credits\n\nBuilt on top of the [Lexbor](https://github.com/lexbor/lexbor) HTML parser library.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ff34nk%2Flexbor_erl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ff34nk%2Flexbor_erl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ff34nk%2Flexbor_erl/lists"}