{"id":18616000,"url":"https://github.com/bjoern-hempel/php-web-crawler","last_synced_at":"2025-04-11T01:31:36.636Z","repository":{"id":110076776,"uuid":"124810205","full_name":"bjoern-hempel/php-web-crawler","owner":"bjoern-hempel","description":"A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.","archived":false,"fork":false,"pushed_at":"2024-02-24T14:20:37.000Z","size":228,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-25T07:12:30.572Z","etag":null,"topics":["crawler","mit-license","php","recursive","webcrawler","webscraper","xpath"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bjoern-hempel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-11T23:43:52.000Z","updated_at":"2024-02-24T14:20:41.000Z","dependencies_parsed_at":"2024-11-07T03:37:23.074Z","dependency_job_id":"e1523016-c130-43ae-a077-bca1b006e338","html_url":"https://github.com/bjoern-hempel/php-web-crawler","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjoern-hempel%2Fphp-web-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjoern-hempel%2Fphp-web-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjoern-hempel%2Fphp-web-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjoern-hempel%2Fphp-web-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bjoern-hempel","download_url":"https://codeload.github.com/bjoern-hempel/php-web-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248325185,"owners_count":21084882,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","mit-license","php","recursive","webcrawler","webscraper","xpath"],"created_at":"2024-11-07T03:33:45.785Z","updated_at":"2025-04-11T01:31:31.628Z","avatar_url":"https://github.com/bjoern-hempel.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003e **Attention**: This package is outdated and is no longer maintained. Use [PHP Web Crawler](https://github.com/ixnode/php-web-crawler) instead.\n\n# WebCrawler\n\nThis php class allows you to crawl recursively a given webpage (or a given html file) and collect some data from it. Simply define the url (or a html file) and a set of xpath expressions which should map with the output data object. The final representation will be a php array which can be easily converted into the json format for further processing.\n\n## 0. Introduction\n\n### 0.1 Installation\n\n```bash\nuser$ git clone git@github.com:bjoern-hempel/php-web-crawler.git .\n```\n\n### 0.2 requirements\n\nTODO...\n\n## 1. Execute the examples\n\n```bash\nuser$ php examples/simple.php \n{\n    \"version\": \"1.0.0\",\n    \"title\": \"Test Title\",\n    \"paragraph\": \"Test Paragraph\"\n}\n```\n\n## 2. How to use\n\n### 2.1 Basic usage [simple.php](examples/simple.php) (simple html page)\n\n[basic.html](examples/html/basic.html)\n\n```html\n\u003chtml\u003e\n    \u003chead\u003e\n        \u003ctitle\u003eTest Page\u003c/title\u003e\n    \u003c/head\u003e\n    \u003cbody\u003e\n        \u003ch1\u003eTest Title\u003c/h1\u003e\n        \u003cp\u003eTest Paragraph\u003c/p\u003e\n    \u003c/body\u003e\n\u003c/html\u003e\n```\n\n[simple.php](examples/simple.php)\n\n```php5\n\u003c?php\n\ninclude dirname(__FILE__).'/../autoload.php';\n\nuse Ixno\\WebCrawler\\Output\\Field;\nuse Ixno\\WebCrawler\\Value\\Text;\nuse Ixno\\WebCrawler\\Value\\XpathTextnode;\nuse Ixno\\WebCrawler\\Source\\File;\n\n$file = dirname(__FILE__).'/html/basic.html';\n\n$html = new File(\n    $file,\n    new Field('version', new Text('1.0.0')),\n    new Field('title', new XpathTextnode('//h1')),\n    new Field('paragraph', new XpathTextnode('//p'))\n);\n\n$data = json_encode($html-\u003eparse(), JSON_PRETTY_PRINT);\n\nprint_r($data);\n\necho \"\\n\";\n```\n\nIt returns:\n\n```json\n{\n    \"version\": \"1.0.0\",\n    \"title\": \"Test Title\",\n    \"paragraph\": \"Test Paragraph\"\n}\n```\n\n#### 2.2 More examples\n\n* [examples/simple-wiki-page.php](examples/simple-wiki-page.php)\n* [examples/group.php](examples/group.php)\n* [examples/section.php](examples/section.php)\n* [examples/sections.php](examples/sections.php)\n* [examples/url.php](examples/url.php)\n\n\n### 2.2 Complex examples\n\nTODO...\n\n### 2.3 Converter\n\nTODO...\n\n### 2.4 Filters\n\nTODO...\n\n## 3. Running the tests\n\n```bash\nuser$ phpunit tests/Basic.php \nPHPUnit 7.0.2 by Sebastian Bergmann and contributors.\n\n..                                                                  2 / 2 (100%)\n\nTime: 126 ms, Memory: 8.00MB\n\nOK (2 tests, 16 assertions)\n```\n\n## Using `composer`'s autoload (manual installation)\n\nUsing the autoloader function of the `composer` it is possible to use this classes without including the source files.\n\nMake some changes to your `composer.json`:\n\n```javascript\n\"autoload\": {\n    \"psr-0\": {\n        ...\n        \"Ixno\\\\WebCrawler\\\\\":\"vendor/ixno/webcrawler/\",\n        ...\n    }\n},\n```\n\nAdd this project to your `vendor` directory:\n\n```bash\nuser$ cd /path/to/root/of/project\nuser$ mkdir vendor/ixno/webcrawler \u0026\u0026 cd vendor/ixno/webcrawler\nuser$ git clone git@github.com:bjoern-hempel/php-web-crawler.git . \u0026\u0026 cd ../../..\n```\n\nCall the `composer` to create the composer autoloading mappings:\n\n```bash\nuser$ composer.phar dumpautoload -o\n```\n\nCheck the result:\n\n```bash\nuser$ grep -r Ixno vendor/composer/.\n```\n\nYou will something like the following lines:\n\n```php\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\Converter\\\\Converter' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/Converter.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\Converter\\\\DateParser' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/DateParser.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\CrawlRule' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\Crawler' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\Page' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\PageGroup' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',\nvendor/composer/./autoload_classmap.php:    'Ixno\\\\WebCrawler\\\\PageList' =\u003e $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',\n...\n```\n\nNow you can simply use all classes without including the source files.\n\n## A. Authors\n\n* **Björn Hempel** - *Initial work* - [Björn Hempel](https://github.com/bjoern-hempel)\n\n## B. License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n\n## C. Closing words\n\nHave fun! :)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbjoern-hempel%2Fphp-web-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbjoern-hempel%2Fphp-web-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbjoern-hempel%2Fphp-web-crawler/lists"}