{"id":13424687,"url":"https://github.com/mylogin/htmlparser","last_synced_at":"2025-03-15T18:35:43.658Z","repository":{"id":179744512,"uuid":"430797223","full_name":"mylogin/htmlparser","owner":"mylogin","description":"Fast and lightweight C++ HTML parser","archived":false,"fork":false,"pushed_at":"2024-01-28T22:55:41.000Z","size":164,"stargazers_count":21,"open_issues_count":2,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-26T23:55:39.049Z","etag":null,"topics":["cpp11","css-selectors","formatter","html","parser","whatwg"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mylogin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-11-22T17:09:33.000Z","updated_at":"2024-10-15T10:32:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"36c046f7-e49f-4c54-ab7e-0072873b8011","html_url":"https://github.com/mylogin/htmlparser","commit_stats":null,"previous_names":["mylogin/htmlparser"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mylogin%2Fhtmlparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mylogin%2Fhtmlparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mylogin%2Fhtmlparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mylogin%2Fhtmlparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mylogin","download_url":"https://codeload.github.com/mylogin/htmlparser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243775913,"owners_count":20346285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp11","css-selectors","formatter","html","parser","whatwg"],"created_at":"2024-07-31T00:00:57.864Z","updated_at":"2025-03-15T18:35:38.640Z","avatar_url":"https://github.com/mylogin.png","language":"C++","readme":"## Usage\r\n\r\n### Access nodes\r\n```cpp\r\nhtml::parser p;\r\nhtml::node_ptr node = p.parse(R\"(\u003c!DOCTYPE html\u003e\u003cbody\u003e\u003cdiv attr=\"val\"\u003etext\u003c/div\u003e\u003c!--comment--\u003e\u003c/body\u003e)\");\r\n\r\n// `parse` method returns root node of type html::node_t::none\r\nassert(node-\u003etype_node == html::node_t::none);\r\nassert(node-\u003eat(0)-\u003etype_node == html::node_t::doctype);\r\nassert(node-\u003eat(1)-\u003etype_node == html::node_t::tag);\r\nassert(node-\u003eat(1)-\u003eat(0)-\u003eat(0)-\u003etype_node == html::node_t::text);\r\nassert(node-\u003eat(1)-\u003eat(1)-\u003etype_node == html::node_t::comment);\r\n\r\nstd::cout \u003c\u003c \"Number of child elements: \" \u003c\u003c node-\u003esize() \u003c\u003c std::endl \u003c\u003c std::endl; // 2\r\n\r\nstd::cout \u003c\u003c \"Loop through child nodes: \" \u003c\u003c std::endl;\r\nfor(auto\u0026 n : *(node-\u003eat(1))) {\r\n\tstd::cout \u003c\u003c n-\u003eto_html() \u003c\u003c std::endl;\r\n}\r\nstd::cout \u003c\u003c std::endl;\r\n\r\nstd::cout \u003c\u003c \"Get node properties: \" \u003c\u003c std::endl;\r\nstd::cout \u003c\u003c \"DOCTYPE name: \" \u003c\u003c node-\u003eat(0)-\u003econtent \u003c\u003c std::endl; // html\r\nstd::cout \u003c\u003c \"BODY tag: \" \u003c\u003c node-\u003eat(1)-\u003etag_name \u003c\u003c std::endl; // body\r\nstd::cout \u003c\u003c \"Attr value: \" \u003c\u003c node-\u003eat(1)-\u003eat(0)-\u003eget_attr(\"attr\") \u003c\u003c std::endl; // val\r\nstd::cout \u003c\u003c \"Text node: \" \u003c\u003c node-\u003eat(1)-\u003eat(0)-\u003eat(0)-\u003econtent \u003c\u003c std::endl; // text\r\nstd::cout \u003c\u003c \"Comment: \" \u003c\u003c node-\u003eat(1)-\u003eat(1)-\u003econtent \u003c\u003c std::endl; // comment\r\n```\r\n\r\n### Find nodes using `select` method\r\n[List of available selectors](#selectors)\r\n```cpp\r\nhtml::parser p;\r\nhtml::node_ptr node = p.parse(R\"(\u003cdiv id=\"my_id\"\u003e\u003cp class=\"my_class\"\u003e\u003c/p\u003e\u003c/div\u003e)\");\r\nstd::vector\u003chtml::node*\u003e selected = node-\u003eselect(\"div#my_id p.my_class\");\r\nfor(auto elem : selected) {\r\n\tstd::cout \u003c\u003c elem-\u003eto_html() \u003c\u003c std::endl;\r\n}\r\n```\r\n\r\n### Access nodes using callback (called when the document is parsed)\r\n```cpp\r\nhtml::parser p;\r\np.set_callback(\"meta[http-equiv='Content-Type'][content*='charset=']\", [](html::node\u0026 n) {\r\n\tif (n.type_node == html::node_t::tag \u0026\u0026 n.type_tag == html::tag_t::open) {\r\n\t\tstd::cout \u003c\u003c \"Callback with selector to filter elements:\" \u003c\u003c std::endl;\r\n\t\tstd::cout \u003c\u003c n.to_html() \u003c\u003c std::endl \u003c\u003c std::endl;\r\n\t}\r\n});\r\np.set_callback([](html::node\u0026 n) {\r\n\tif(n.type_node == html::node_t::tag \u0026\u0026 n.type_tag == html::tag_t::open \u0026\u0026 n.tag_name == \"meta\") {\r\n\t\tif(n.get_attr(\"http-equiv\") == \"Content-Type\" \u0026\u0026 n.get_attr(\"content\").find(\"charset=\") != std::string::npos) {\r\n\t\t\tstd::cout \u003c\u003c \"Callback without selector:\" \u003c\u003c std::endl;\r\n\t\t\tstd::cout \u003c\u003c n.to_html() \u003c\u003c std::endl;\r\n\t\t}\r\n\t}\r\n});\r\np.parse(R\"(\u003chead\u003e\u003ctitle\u003eTitle\u003c/title\u003e\u003cmeta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" /\u003e\u003c/head\u003e)\");\r\n```\r\n\r\n### Manual search\r\n```cpp\r\nstd::cout \u003c\u003c \"Search `li` tags which not in `ol`:\" \u003c\u003c std::endl;\r\nhtml::parser p;\r\nhtml::node_ptr node = p.parse(\"\u003cul\u003e\u003cli\u003eli1\u003c/li\u003e\u003cli\u003eli2\u003c/li\u003e\u003c/ul\u003e\u003col\u003e\u003cli\u003eli\u003c/li\u003e\u003c/ol\u003e\");\r\nnode-\u003ewalk([](html::node\u0026 n) {\r\n\tif(n.type_node == html::node_t::tag \u0026\u0026 n.tag_name == \"ol\") {\r\n\t\treturn false; // not scan child tags\r\n\t}\r\n\tif(n.type_node == html::node_t::tag \u0026\u0026 n.tag_name == \"li\") {\r\n\t\tstd::cout \u003c\u003c n.to_html() \u003c\u003c std::endl;\r\n\t}\r\n\treturn true; // scan child tags\r\n});\r\n```\r\n\r\n### Finding unclosed tags\r\n```cpp\r\nhtml::parser p;\r\n\r\n// Callback to handle errors\r\np.set_callback([](html::err_t e, html::node\u0026 n) {\r\n\tif(e == html::err_t::tag_not_closed) {\r\n\t\tstd::cout \u003c\u003c \"Tag not closed: \" \u003c\u003c n.to_html(' ', false);\r\n\t\tstd::string msg;\r\n\t\thtml::node* current = \u0026n;\r\n\t\twhile(current-\u003eget_parent()) {\r\n\t\t\tmsg.insert(0, \" \" + current-\u003etag_name);\r\n\t\t\tcurrent = current-\u003eget_parent();\r\n\t\t}\r\n\t\tmsg.insert(0, \"\\nPath:\");\r\n\t\tstd::cout \u003c\u003c msg \u003c\u003c std::endl;\r\n\t}\r\n});\r\np.parse(\"\u003cdiv\u003e\u003cp\u003e\u003ca\u003e\u003c/p\u003e\u003c/div\u003e\");\r\n```\r\n\r\n### Print document formatted\r\n```cpp\r\nhtml::parser p;\r\nhtml::node_ptr node = p.parse(\"\u003cul\u003e\u003cli\u003eli1\u003c/li\u003e\u003cli\u003eli2\u003c/li\u003e\u003c/ul\u003e\u003col\u003e\u003cli\u003eli\u003c/li\u003e\u003c/ol\u003e\");\r\n\r\n// method takes two arguments, the indentation character and whether to output child elements (tabulation and true by default)\r\nstd::cout \u003c\u003c node-\u003eto_html(' ', true) \u003c\u003c std::endl;\r\n```\r\n\r\n### Print text content of a node\r\n```cpp\r\nhtml::parser p;\r\nhtml::node_ptr node = p.parse(\"\u003cdiv\u003e\u003cp\u003e\u003cb\u003eFirst\u003c/b\u003e p\u003c/p\u003e\u003cp\u003e\u003ci\u003eSecond\u003c/i\u003e p\u003c/p\u003eText\u003cbr /\u003eText\u003c/div\u003e\");\r\n\r\nstd::cout \u003c\u003c \"Print text with line breaks preserved:\" \u003c\u003c std::endl;\r\nstd::cout \u003c\u003c node-\u003eto_text() \u003c\u003c std::endl \u003c\u003c std::endl;\r\n\r\nstd::cout \u003c\u003c \"Print text with line breaks replaced with spaces:\" \u003c\u003c std::endl;\r\nstd::cout \u003c\u003c node-\u003eto_text(true) \u003c\u003c std::endl;\r\n```\r\n\r\n### Build document\r\n```cpp\r\nstd::cout \u003c\u003c \"Using helpers:\" \u003c\u003c std::endl;\r\n\r\nhtml::node hdiv = html::utils::make_node(html::node_t::tag, \"div\");\r\nhdiv.append(html::utils::make_node(html::node_t::text, \"Link:\"));\r\nhdiv.append(html::utils::make_node(html::node_t::tag, \"br\"));\r\nhtml::node ha = html::utils::make_node(html::node_t::tag, \"a\", {{\"href\", \"https://github.com/\"}, {\"class\", \"a_class\"}});\r\nha.append(html::utils::make_node(html::node_t::text, \"Github.com\"));\r\nstd::cout \u003c\u003c hdiv.append(ha).to_html() \u003c\u003c std::endl \u003c\u003c std::endl;\r\n\r\nstd::cout \u003c\u003c \"Without helpers:\" \u003c\u003c std::endl;\r\n\r\nhtml::node div;\r\ndiv.type_node = html::node_t::tag;\r\ndiv.tag_name = \"div\";\r\n\r\nhtml::node text;\r\ntext.type_node = html::node_t::text;\r\ntext.content = \"Link:\";\r\ndiv.append(text);\r\n\r\nhtml::node br;\r\nbr.type_node = html::node_t::tag;\r\nbr.tag_name = \"br\";\r\nbr.self_closing = true;\r\ndiv.append(br);\r\n\r\nhtml::node a;\r\na.type_node = html::node_t::tag;\r\na.tag_name = \"a\";\r\na.set_attr(\"href\", \"https://github.com/\");\r\na.set_attr(\"class\", \"a_class\");\r\n\r\nhtml::node a_text;\r\na_text.type_node = html::node_t::text;\r\na_text.content = \"Github.com\";\r\na.append(a_text);\r\n\r\ndiv.append(a);\r\n\r\nstd::cout \u003c\u003c div.to_html() \u003c\u003c std::endl;\r\n```\r\n\r\n## Selectors\r\n| Selector example | Description | select | callback |\r\n|-|-|-|-|\r\n| * | all elements | √ | √ |\r\n| div | tag name | √ | √ |\r\n| #id1 | id=\"id1\" | √ | √ |\r\n| .class1 | class=\"class1\" | √ | √ |\r\n| .class1.class2 | class=\"class1 class2\" | √ | √ |\r\n| :first | first element | √ | √ |\r\n| :last | last element | √ | - |\r\n| :eq(3) | element index = 3 (starts from 0) | √ | √ |\r\n| :gt(3) | element index \u003e 3 (starts from 0) | √ | √ |\r\n| :lt(3) | element index \u003c 3 (starts from 0) | √ | √ |\r\n| [attr] | element that have attribute \"attr\" | √ | √ |\r\n| [attr='val'] | attribute is equal to \"val\" | √ | √ |\r\n| [attr!='val'] | attribute is not equal to \"val\" or does not exist | √ | √ |\r\n| [attr^='http:'] | attribute starts with \"http:\" | √ | √ |\r\n| [attr$='.jpeg'] | attribute ends with \".jpeg\" | √ | √ |\r\n| [attr*='/path/'] | attribute contains \"/path/\" | √ | √ |\r\n| [attr~='flower'] | attribute contains word \"flower\" | √ | √ |\r\n| [attr\u0026vert;='en'] | attribute equal to \"en\" or starting with \"en-\" | √ | √ |\r\n| div#id1.class1[attr='val'] | element that matches all of these selectors | √ | √ |\r\n| p,div | element that matches any of these selectors | √ | √ |\r\n| div p | all `\u003cp\u003e` elements inside `\u003cdiv\u003e` elements | √ | - |\r\n| div\u003ep | all `\u003cp\u003e` elements where the parent is a `\u003cdiv\u003e` element | √ | - |\r\n| div div\u003ep\u003ei | combination of nested selectors  | √ | - |","funding_links":[],"categories":["Text Handling","C++"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmylogin%2Fhtmlparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmylogin%2Fhtmlparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmylogin%2Fhtmlparser/lists"}