{"id":22764608,"url":"https://github.com/kuria/simple-html-parser","last_synced_at":"2025-04-14T23:13:30.627Z","repository":{"id":57009959,"uuid":"117297944","full_name":"kuria/simple-html-parser","owner":"kuria","description":"Minimalistic HTML parser","archived":false,"fork":false,"pushed_at":"2023-04-22T14:38:46.000Z","size":21,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-14T23:13:09.943Z","etag":null,"topics":["encoding","html","parser","php"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kuria.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-12T23:32:28.000Z","updated_at":"2023-09-19T17:45:25.000Z","dependencies_parsed_at":"2024-12-11T12:19:35.617Z","dependency_job_id":null,"html_url":"https://github.com/kuria/simple-html-parser","commit_stats":{"total_commits":16,"total_committers":1,"mean_commits":16.0,"dds":0.0,"last_synced_commit":"d0150b6a25b9230ddb500c4a4b854f9bdab4899b"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuria%2Fsimple-html-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuria%2Fsimple-html-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuria%2Fsimple-html-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuria%2Fsimple-html-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kuria","download_url":"https://codeload.github.com/kuria/simple-html-parser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248975328,"owners_count":21192210,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encoding","html","parser","php"],"created_at":"2024-12-11T12:09:28.760Z","updated_at":"2025-04-14T23:13:30.604Z","avatar_url":"https://github.com/kuria.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"Simple HTML parser\n##################\n\nMinimalistic HTML parser.\n\n.. image:: https://travis-ci.com/kuria/simple-html-parser.svg?branch=master\n   :target: https://travis-ci.com/kuria/simple-html-parser\n\n.. NOTE::\n\n   If you need advanced DOM manipulation, consider using ``kuria/dom`` instead.\n\n.. contents::\n\n\nFeatures\n********\n\n- parsing opening tags\n- parsing closing tags\n- parsing comments\n- parsing DTDs\n- extracting parts of HTML content\n- determining encoding of HTML documents\n- handling \"raw text\" tags (``\u003cstyle\u003e``, ``\u003cscript\u003e``, ``\u003cnoscript\u003e``, etc.)\n\n\nRequirements\n************\n\n- PHP 7.1+\n\n\nUsage\n*****\n\nCreating the parser\n===================\n\n.. code:: php\n\n   \u003c?php\n\n   use Kuria\\SimpleHtmlParser\\SimpleHtmlParser;\n\n   $parser = new SimpleHtmlParser($html);\n\n\nIterating elements\n==================\n\nThe parser implements ``Iterator`` so it can be traversed using the standard\niterator methods.\n\n.. code:: php\n\n   \u003c?php\n\n   foreach ($parser as $element) {\n       print_r($element);\n   }\n\n.. code:: php\n\n   \u003c?php\n\n   $parser-\u003erewind();\n\n   if ($parser-\u003evalid()) {\n       print_r($parser-\u003ecurrent());\n   }\n\n\nElement array structure\n=======================\n\nEach element is an array with the following keys. Some keys are available only\nfor specific element types.\n\n========== ================================== ==============================================\nKey        Types                              Description\n========== ================================== ==============================================\n``type``   *any*                              The type of the element, see `Element types`_\n``start``  *any*                              Byte offset at which the element begins\n``end``    *any*                              Byte offset at which the element ends\n``name``   ``SimpleHtmlParser::OPENING_TAG``, Tag name (e.g. ``div``),\n                                              see `Tag name and attribute normalization`_\n           ``SimpleHtmlParser::CLOSING_TAG``\n``attrs``  ``SimpleHtmlParser::OPENING_TAG``  Array with the opening tag's attributes,\n                                              see `Tag name and attribute normalization`_\n``symbol`` ``SimpleHtmlParser::OTHER``        ``!``, ``?`` or ``/``\n========== ================================== ==============================================\n\n\nElement types\n-------------\n\n- ``SimpleHtmlParser::COMMENT`` - a comment, e.g. ``\u003c!-- foo --\u003e``\n- ``SimpleHtmlParser::OPENING_TAG`` - an opening tag, e.g. ``\u003cspan class=\"bar\"\u003e``\n- ``SimpleHtmlParser::CLOSING_TAG`` - a closing tag, e.g. ``\u003c/span\u003e``\n- ``SimpleHtmlParser::OTHER`` - special element, e.g. doctype, XML header\n- ``SimpleHtmlParser::INVALID`` - invalid or incomplete tags\n\n\nTag name and attribute normalization\n====================================\n\nTag and attribute names that contain only ASCII characters are lowercased.\n\n\nManaging parser state\n=====================\n\nThe state methods can be used to temporarily store and/or revert state of the\nparser.\n\n- ``pushState()`` - push current state of the parser onto the stack\n- ``popState()`` - pop (discard) state stored on top of the stack\n- ``revertState()`` - pop and restore state stored on top of the stack\n- ``countStates()`` - count the number of states currently on the stack\n- ``clearStates()`` - discard all states\n\n\n``getHtml()`` - get HTML content\n================================\n\nThe ``getHtml()`` method may be used to get the entire HTML content or HTML\nof a single element.\n\n.. code:: php\n\n   \u003c?php\n\n   $parser-\u003egetHtml(); // get entire document\n   $parser-\u003egetHtml($element); // get single element\n\n\n``getSlice()`` - get part of the HTML\n=====================================\n\nThe ``getSlice()`` method returns a part of the HTML content.\n\nReturns an empty string for negative or out-of-bounds ranges.\n\n.. code:: php\n\n   \u003c?php\n\n   $slice = $parser-\u003egetSlice(100, 200);\n\n\n``getSliceBetween()`` - get content between 2 elements\n======================================================\n\nThe ``getSliceBetween()`` method returns a part of the HTML content that is between\n2 elements (usually opening and closing tag).\n\n.. code:: php\n\n   \u003c?php\n\n   $slice = $parser-\u003egetSliceBetween($openingTag, $closingTag);\n\n\n``getLength()`` - get length of the HTML\n========================================\n\nThe ``getLength()`` returns total length of the HTML content.\n\n\n``getEncoding()`` - determine encoding of the HTML document\n===========================================================\n\nThe ``getEncoding()`` method attempts to determine encoding of the HTML document.\n\nIf the encoding cannot be determined or is not supported, the fallback encoding\nwill be used instead.\n\nThis method does not alter the parser's state.\n\n\n``getEncodingTag()`` - find the encoding-specifying meta tag\n============================================================\n\nThe ``getEncodingTag()`` method attempts to find the ``\u003cmeta charset=\"...\"\u003e``\nor ``\u003cmeta http-equiv=\"Content-Type\" content=\"...\"\u003e`` tag in the first 1024\nbytes of the HTML document.\n\nReturns ``NULL`` if the tag was not found.\n\nThis method does not alter the parser's state.\n\n\n``usesFallbackEncoding()`` - see if the fallback encoding is being used\n=======================================================================\n\nThe ``usesFallbackEncoding()`` indicates whether the fallback encoding\nis being used. This is the case when the encoding is not specified or\nis not supported.\n\nThis method does not alter the parser's state.\n\n\n``setFallbackEncoding()`` - set fallback encoding\n=================================================\n\nThe ``setFallbackEncoding()`` method specifies an encoding to be used in case\nthe document has no encoding specified or specifies an unsupported encoding.\n\nThe fallback encoding must be supported by ``htmlspecialchars()``.\n\n\n``getDoctypeElement()`` - find the doctype element\n==================================================\n\nThe ``getDoctypeElement()`` method attempts to find the doctype in the first 1024\nbytes of the HTML document.\n\nReturns ``NULL`` if no doctype was found.\n\n\n``escape()`` - escape a string\n==============================\n\nThe ``escape()`` method escapes a string using ``htmlspecialchars()`` using\nthe HTML document's encoding.\n\n\n``find()`` - match a specific element\n=====================================\n\nThe ``find()`` method attempts to find a specific element starting from the\ncurrent position, optionally stopping after a given number of bytes.\n\nReturns ``NULL`` if no element was matched.\n\n.. code:: php\n\n   \u003c?php\n\n   $element = $parser-\u003efind(SimpleHtmlParser::OPENING_TAG, 'title');\n\n\n``getOffset()`` - get current offset\n====================================\n\nThe ``getOffset()`` method returns the current parser offset in bytes.\n\n\nExample: Reading document's title\n*********************************\n\n.. code:: php\n\n   \u003c?php\n\n   $html = \u003c\u003c\u003cHTML\n   \u003c!doctype html\u003e\n   \u003cmeta charset=\"utf-8\"\u003e\n   \u003ctitle\u003eFoo bar\u003c/title\u003e\n   \u003ch1\u003eBaz qux\u003c/h1\u003e\n   HTML;\n\n   $parser = new SimpleHtmlParser($html);\n\n   $titleOpen = $parser-\u003efind(SimpleHtmlParser::OPENING_TAG, 'title');\n\n   if ($titleOpen) {\n       $titleClose = $parser-\u003efind(SimpleHtmlParser::CLOSING_TAG, 'title');\n\n       if ($titleClose) {\n           $title = $parser-\u003egetSliceBetween($titleOpen, $titleClose);\n\n           var_dump($title);\n       }\n   }\n\nOutput:\n\n::\n\n  string(7) \"Foo bar\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuria%2Fsimple-html-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuria%2Fsimple-html-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuria%2Fsimple-html-parser/lists"}