{"id":22894885,"url":"https://github.com/duzun/hquery.php","last_synced_at":"2025-05-14T18:05:26.917Z","repository":{"id":22597179,"uuid":"25939194","full_name":"duzun/hQuery.php","owner":"duzun","description":"An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.","archived":false,"fork":false,"pushed_at":"2025-03-22T20:11:16.000Z","size":4056,"stargazers_count":362,"open_issues_count":17,"forks_count":74,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-05-14T18:05:20.056Z","etag":null,"topics":["broken-html","crawler","css-selectors","domcrawler","fast","hquery","html","html-parser","invalid-html","jquery-like","jquery-selectors","parser","php","psr-0","psr-4","scraper","selectors","xml","xml-parser"],"latest_commit_sha":null,"homepage":"https://duzun.me/playground/hquery","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/duzun.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-10-29T20:07:13.000Z","updated_at":"2025-03-30T20:14:05.000Z","dependencies_parsed_at":"2023-12-12T00:54:35.682Z","dependency_job_id":"147578d5-9fe7-4f8e-bf20-885f4fa770ae","html_url":"https://github.com/duzun/hQuery.php","commit_stats":{"total_commits":217,"total_committers":7,"mean_commits":31.0,"dds":0.03686635944700456,"last_synced_commit":"591542a1626d30e9644be45379e4bed6e80d1346"},"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duzun%2FhQuery.php","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duzun%2FhQuery.php/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duzun%2FhQuery.php/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duzun%2FhQuery.php/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/duzun","download_url":"https://codeload.github.com/duzun/hQuery.php/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254198514,"owners_count":22030965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["broken-html","crawler","css-selectors","domcrawler","fast","hquery","html","html-parser","invalid-html","jquery-like","jquery-selectors","parser","php","psr-0","psr-4","scraper","selectors","xml","xml-parser"],"created_at":"2024-12-13T23:27:17.593Z","updated_at":"2025-05-14T18:05:21.908Z","avatar_url":"https://github.com/duzun.png","language":"PHP","funding_links":["https://www.paypal.me/duzuns"],"categories":[],"sub_categories":[],"readme":"hQuery.php  [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.me/duzuns)\n==========\n\nAn extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.\n\nYou can use the familiar jQuery/CSS selector syntax to easily find the data you need.\n\nIn my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document.\nIn reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when\nselecting thousands of elements, and on average uses x2 less RAM.\n\nSee [tests/README.md](https://github.com/duzun/hQuery.php/blob/master/tests/README.md).\n\n[API Documentation](https://duzun.github.io/hQuery.php/docs/class-hQuery.html)\n\n## 💡 Features\n\n- Very fast parsing and lookup\n- Parses broken HTML\n- jQuery-like style of DOM traversal\n- Low memory usage\n- Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)\n- Doesn't require cURL to be installed and automatically handles redirects (see [hQuery::fromUrl()](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromURL))\n- Caches response for multiple processing tasks\n- [PSR-7](https://www.php-fig.org/psr/psr-7/) friendly (see hQuery::fromHTML($message))\n- PHP 5.3+\n- No dependencies\n\n## 🛠 Install\n\nJust add this folder to your project and `include_once 'hquery.php';` and you are ready to `hQuery`.\n\nAlternatively `composer require duzun/hquery`\n\nor using `npm install hquery.php`, `require_once 'node_modules/hquery.php/hquery.php';`.\n\n## ⚙ Usage\n\n### Basic setup:\n\n```php\n// Optionally use namespaces\nuse duzun\\hQuery;\n\n// Either use composer, or include this file:\ninclude_once '/path/to/libs/hquery.php';\n\n// Set the cache path - must be a writable folder\n// If not set, hQuery::fromURL() would make a new request on each call\nhQuery::$cache_path = \"/path/to/cache\";\n\n// Time to keep request data in cache, seconds\n// A value of 0 disables cache\nhQuery::$cache_expires = 3600; // default one hour\n```\n\nI would recommend using [php-http/cache-plugin](http://docs.php-http.org/en/latest/plugins/cache.html)\nwith a [PSR-7 client](http://docs.php-http.org/en/latest/clients.html) for better flexibility.\n\n### Load HTML from a file\n###### [hQuery::fromFile](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromFile)( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )\n\n```php\n// Local\n$doc = hQuery::fromFile('/path/to/filesystem/doc.html');\n\n// Remote\n$doc = hQuery::fromFile('https://example.com/', false, $context);\n```\n\nWhere `$context` is created with [stream_context_create()](https://secure.php.net/manual/en/function.stream-context-create.php).\n\nFor an example of using `$context` to make a HTTP request with proxy see [#26](https://github.com/duzun/hQuery.php/issues/26#issuecomment-351032382).\n\n### Load HTML from a string\n###### [hQuery::fromHTML](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromHTML)( string `$html`, string `$url` = NULL )\n\n```php\n$doc = hQuery::fromHTML('\u003chtml\u003e\u003chead\u003e\u003ctitle\u003eSample HTML Doc\u003c/title\u003e\u003cbody\u003eContents...\u003c/body\u003e\u003c/html\u003e');\n\n// Set base_url, in case the document is loaded from local source.\n// Note: The base_url property is used to retrieve absolute URLs from relative ones.\n$doc-\u003ebase_url = 'http://desired-host.net/path';\n```\n\n### Load a remote HTML document\n###### [hQuery::fromUrl](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromURL)( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )\n\n```php\nuse duzun\\hQuery;\n\n// GET the document\n$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' =\u003e 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);\n\nvar_dump($doc-\u003eheaders); // See response headers\nvar_dump(hQuery::$last_http_result); // See response details of last request\n\n// with POST\n$doc = hQuery::fromUrl(\n    'http://example.com/someDoc.html', // url\n    ['Accept' =\u003e 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers\n    ['username' =\u003e 'Me', 'fullname' =\u003e 'Just Me'], // request body - could be a string as well\n    ['method' =\u003e 'POST', 'timeout' =\u003e 7, 'redirect' =\u003e 7, 'decode' =\u003e 'gzip'] // options\n);\n\n```\n\nFor building advanced requests (POST, parameters etc) see [hQuery::http_wr()](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_http_wr),\nthough I recommend using a specialized ([PSR-7](https://www.php-fig.org/psr/psr-7/)?) library for making requests\nand `hQuery::fromHTML($html, $url=NULL)` for processing results.\nSee [Guzzle](http://docs.guzzlephp.org/en/stable/) for eg.\n\n#### [PSR-7](https://www.php-fig.org/psr/psr-7/) example:\n\n```sh\ncomposer require php-http/message php-http/discovery php-http/curl-client\n```\n\nIf you don't have [cURL PHP extension](https://secure.php.net/curl),\njust replace `php-http/curl-client` with `php-http/socket-client` in the above command.\n\n```php\nuse duzun\\hQuery;\n\nuse Http\\Discovery\\HttpClientDiscovery;\nuse Http\\Discovery\\MessageFactoryDiscovery;\n\n$client = HttpClientDiscovery::find();\n$messageFactory = MessageFactoryDiscovery::find();\n\n$request = $messageFactory-\u003ecreateRequest(\n  'GET',\n  'http://example.com/someDoc.html',\n  ['Accept' =\u003e 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']\n);\n\n$response = $client-\u003esendRequest($request);\n\n$doc = hQuery::fromHTML($response, $request-\u003egetUri());\n\n```\n\nAnother option is to use [stream_context_create()](https://secure.php.net/manual/en/function.stream-context-create.php)\nto create a `$context`, then call `hQuery::fromFile($url, false, $context)`.\n\n### Processing the results\n###### [hQuery::find](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_find)( string `$sel`, array|string `$attr` = NULL, hQuery\\Node `$ctx` = NULL )\n\n```php\n// Find all banners (images inside anchors)\n$banners = $doc-\u003efind('a[href] \u003e img[src]:parent');\n\n// Extract links and images\n$links  = array();\n$images = array();\n$titles = array();\n\n// If the result of find() is not empty\n// $banners is a collection of elements (hQuery\\Element)\nif ( $banners ) {\n\n    // Iterate over the result\n    foreach($banners as $pos =\u003e $a) {\n        // $a-\u003ehref property is the resolved $a-\u003eattr('href') relative to the\n        // documents \u003cbase href=...\u003e, if present, or $doc-\u003ebaseURL.\n        $links[$pos] = $a-\u003ehref; // get absolute URL from href property\n        $titles[$pos] = trim($a-\u003etext()); // strip all HTML tags and leave just text\n\n        // Filter the result\n        if ( !$a-\u003ehasClass('logo') ) {\n            // $a-\u003estyle property is the parsed $a-\u003eattr('style'), same as $a-\u003eattr('style', true)\n            if ( strtolower($a-\u003estyle['position']) == 'fixed' ) continue;\n\n            $img = $a-\u003efind('img')[0]; // ArrayAccess\n            if ( $img ) $images[$pos] = $img-\u003esrc; // short for $img-\u003eattr('src', true)\n        }\n    }\n\n    // If at least one element has the class .home\n    if ( $banners-\u003ehasClass('home') ) {\n        echo 'There is .home button!', PHP_EOL;\n\n        // ArrayAccess for elements and properties.\n        if ( $banners[0]['href'] == '/' ) {\n            echo 'And it is the first one!';\n        }\n    }\n}\n\n// Read charset of the original document (internally it is converted to UTF-8)\n$charset = $doc-\u003echarset;\n\n// Get the size of the document ( strlen($html) )\n$size = $doc-\u003esize;\n\n// The URL at which the document was requested\n$requestUri = $doc-\u003ehref;\n\n// \u003cbase href=...\u003e, if present, or the origin + dir path part from $doc-\u003ehref.\n// The .href and .src props are resolved using this value.\n$baseURL = $doc-\u003ebaseURL;\n```\n\nNote: In case the charset meta attribute has a wrong value or the internal conversion fails for any other reason, `hQuery` would ignore the error and continue processing with the original HTML, but would register an error message on `$doc-\u003ehtml_errors['convert_encoding']`.\n\n## 🖧 Live Demo\n\nOn [DUzun.Me](https://duzun.me/playground/hquery#sel=%20a%20%3E%20img%3Aparent\u0026url=https%3A%2F%2Fgithub.com%2Fduzun)\n\nA lot of people ask for sources of my **Live Demo** page. Here we go:\n\n[view-source:https://duzun.me/playground/hquery](https://github.com/duzun/hQuery.php/blob/master/examples/duzun.me_playground_hquery.php)\n\n### 🏃 Run the playground\n\nYou can easily run any of the `examples/` on your local machine.\nAll you need is PHP installed in your system.\nAfter you clone the repo with `git clone https://github.com/duzun/hQuery.php.git`,\nyou have several options to start a web-server.\n\n###### Option 1:\n\n```sh\ncd hQuery.php/examples\nphp -S localhost:8000\n\n# open browser http://localhost:8000/\n```\n\n###### Option 2 (browser-sync):\n\nThis option starts a live-reload server and is good for playing with the code.\n\n```sh\nnpm install\ngulp\n\n# open browser http://localhost:8080/\n```\n\n###### Option 3 (VSCode):\n\nIf you are using VSCode, simply open the project and run debugger (`F5`).\n\n## 🔧 TODO\n\n- Unit tests everything\n- Document everything\n- ~~Cookie support~~ (implemented in mem for redirects)\n- ~~Improve selectors to be able to select by attributes~~\n- Add more selectors\n- Use [HTTPlug](http://httplug.io/) internally\n\n## 💖 Support my projects\n\nI love Open Source. Whenever possible I share cool things with the world (check out [NPM](https://duzun.me/npm) and [GitHub](https://github.com/duzun/)).\n\nIf you like what I'm doing and this project helps you reduce time to develop, please consider to:\n\n- ★ Star and Share the projects you like (and use)\n- ☕ Give me a cup of coffee - [PayPal.me/duzuns](https://www.paypal.me/duzuns) (contact at duzun.me)\n- ₿ Send me some **Bitcoin** at this addres: `bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa` (or using the QR below)\n![bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa](https://cdn.duzun.me/files/qr_bitcoin-3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduzun%2Fhquery.php","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fduzun%2Fhquery.php","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduzun%2Fhquery.php/lists"}