{"id":15025283,"url":"https://github.com/mcstreetguy/crawler","last_synced_at":"2025-04-09T20:04:05.657Z","repository":{"id":62526201,"uuid":"177111827","full_name":"MCStreetguy/Crawler","owner":"MCStreetguy","description":"An advanced web-crawler written in PHP.","archived":false,"fork":false,"pushed_at":"2019-04-05T14:05:48.000Z","size":229,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T20:04:02.022Z","etag":null,"topics":["composer","composer-library","crawler","crawler-engine","guzzle","http-requests","php","php-7","php-library","web-crawler","webcrawler"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MCStreetguy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-22T09:33:34.000Z","updated_at":"2025-04-08T17:12:09.000Z","dependencies_parsed_at":"2022-11-02T15:31:41.761Z","dependency_job_id":null,"html_url":"https://github.com/MCStreetguy/Crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCStreetguy%2FCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCStreetguy%2FCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCStreetguy%2FCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MCStreetguy%2FCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MCStreetguy","download_url":"https://codeload.github.com/MCStreetguy/Crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103865,"owners_count":21048245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["composer","composer-library","crawler","crawler-engine","guzzle","http-requests","php","php-7","php-library","web-crawler","webcrawler"],"created_at":"2024-09-24T20:01:58.045Z","updated_at":"2025-04-09T20:04:05.620Z","avatar_url":"https://github.com/MCStreetguy.png","language":"PHP","readme":"# MCStreetguy/Crawler\n\n**A highly configurable, modern web crawler for PHP.**\n\nThis library provides a very dynamic environment for all kinds of tasks based on recursive browsing of web pages.\nInternally, [Guzzle](http://guzzlephp.org) is used to send requests, and [paquettg's HTML parser](https://github.com/paquettg/php-html-parser) to search server responses for follow-up links.  \nThe rest of the crawl process is entirely up to you. Apart from the initial configuration of the crawler, this library relies solely on user-defined classes, for example, to check whether a link should be crawled at all or to process the server's response.\nThese classes are only very roughly pre-defined by interfaces, but usually these only require one function that is used to invoke them.\nThe crawler does not care about the inner workings of these classes, in case of a processor it doesn't even require any return value.  \nDue to this conception the library should be able to be integrated almost seamlessly into most frameworks.\n\n**TOC:**\n- [Installation](#installation)\n- [Getting Started](#getting-started)\n- [Reference](#reference)\n  - [Configuration](#configuration)\n  - [Validators](#validators)\n\n## Installation\n\nRequire the library through Composer:\n\n```\n$ composer require mcstreetguy/crawler\n```\n\n## Getting Started\n\n**Note:** _This example only covers the very basic requirements to get started. Have a look at the full documentation for more information if you wish._\n\n-----------\n\nFirst, you need an instance of the crawler class.\nThis is the \"root node\" of the library since most interaction takes place with an object of this class.\n\n``` php\n$crawler = new \\MCStreetguy\\Crawler\\Crawler();\n```\n\nThis already suffices to start crawling a webpage, but the crawler would not do anything at the moment.\nWe still need a processor to do something with the received responses.\n\n### Processor\n\nThe corresponding interface is not very complex and only defines one method.\nFor now let's just create one that simply echoes out each found url:\n\n``` php\nuse MCStreetguy\\Crawler\\Processing\\ProcessorInterface;\n\nclass DebugProcessor implements ProcessorInterface\n{\n    public function invoke(\\MCStreetguy\\Crawler\\Result\\CrawlResult $result)\n    {\n        echo 'Crawled: ' . $result-\u003egetUri() . PHP_EOL;\n    }\n}\n```\n\nThe `invoke` method receives an instance of `CrawlResult`, which holds several data concerning the crawled page.\nThis includes for example it's uri, the server response and further links found on that page.\n\nNow if we add an object of our new class to the crawler, we can already execute it against some url:\n\n``` php\n$crawler = new \\MCStreetguy\\Crawler\\Crawler();\n$crawler-\u003eaddProcessor(new DebugProcessor);\n$crawler-\u003eexecute('http://example.com/');\n```\n\n### Testing\n\nCopying the above parts together into a script file and executing it on the command line now produces the following output:\n\n\u003e machine:~ testuser$ php test.php  \n\u003e Crawled: http://example.com/  \n\u003e Crawled: http://www.iana.org/domains/example  \n\u003e Crawled: http://www.iana.org/_css/2015.1/screen.css  \n\u003e Crawled: http://www.iana.org/_css/2015.1/print.css  \n\u003e Crawled: http://www.iana.org/_img/bookmark_icon.ico  \n\u003e Crawled: http://www.iana.org/  \n\u003e Crawled: http://www.iana.org/domains  \n\u003e Crawled: http://www.iana.org/numbers  \n\u003e ^C  \n\u003e machine:~ testuser$  \n\n_Wait! Why is this even working?_\n\nWell, `example.com` is actually an existing website and it contains exactly one link.\nThat link leads directly to the Internet Assigned Numbers Authority (IANA), explaining the purpose of the example page.\nSo we can say for sure that our small test succeeded and the crawler works, as it reached `example.com` and found the link on it.\nBut is that intentional behavior? Not necessarily, but we have a solution for that, too.\n\n### Validation\n\nTo prevent our crawler from happily jumping across webpages and discovering the whole internet we need another custom class: a validator.\nA validator works nearly the same as a processor, but it's invoked far earlier in the process loop.\nIt get's the pending uri handed over as argument and is expected to return a boolean value, indicating if the uri shall be crawled.  \nYou may define as many validators as you like so you can split complex decisions up in several parts, but keep in mind that this works like a blacklist (i.e. if one validator returns false, the uri is dropped immediately).\n\n``` php\nuse MCStreetguy\\Crawler\\Processing\\Validation\\ValidatorInterface;\n\nclass DebugValidator implements ValidatorInterface\n{\n    protected $baseUri;\n\n    public function __construct(string $baseUri)\n    {\n        $this-\u003ebaseUri = $baseUri;\n    }\n\n    public function isValid(\\Psr\\Http\\Message\\UriInterface $target)\n    {\n        return (substr_compare((string) $target, $this-\u003ebaseUri, 0, strlen($this-\u003ebaseUri)) === 0);\n    }\n}\n```\n\nIf we now add this validator to our crawler before invocation, it should stop immediately after processing the first page as the link on it leads to another domain.\n\n``` php\n$crawler-\u003eaddValidator(new DebugValidator('http://example.com/'));\n```\n\n\u003e machine:~ testuser$ php test.php  \n\u003e Crawled: http://example.com/  \n\u003e machine:~ testuser$  \n\n### Summary\n\nThe basic usage of this library shall be clear at this point.\nHave a close look on the documentation, the source code and the `Tests/` folder for more information and some more advanced examples.\n\n## Reference\n\n### Configuration\n\nThe crawler can be configured through a configuration class.\n\n(_to be written_)\n\n### Validators\n\nThis package ships with several predefined validators that are commonly used.\n\n#### `RobotsTxtValidator`\n\nThis validator loads the `robots.txt` from the server to be accessed and matches received uris against these restrictions.\nIt returns false only if access to the uri is forbidden by these restrictions.\nIf no `robots.txt`-file could be loaded all uris are considered valid.\n\nSee http://www.robotstxt.org/ for more information on `robots.txt`-files.\n\n##### Usage\n\nThis validator is not meant to be used directly.\nInstead enable it by setting the `$ignoreRobots` property in your configuration.  \n(see the section above for more information)\n\n#### `DomainWhitelistValidator`\n\nThis validator only allows uris, that are on exactly the same domain as the base uri used to start the crawl.\n\n##### Example\n\nIf `http://www.example.com/` was our base uri to crawl, the following uris would be considered valid:\n\n- `http://www.example.com/some-page.html`\n- `http://www.example.com/assets/img/my-cool-image.jpg`\n- `https://www.example.com/`\n- `https://www.example.com/#section`\n- `https://www.example.com/?q=somequery`\n\nCounterwise to the before examples, the following will be considered invalid:\n\n- `http://example.com/`\n- `http://subdomain.example.com/`\n- `http://www.google.com/`\n\n##### Usage\n\nTo make use of this validator, create an instance of it and add it to the crawler:\n\n``` php\nuse MCStreetguy\\Crawler\\Processing\\Validation\\Core\\DomainWhitelistValidator;\n\n$baseUri = 'http://www.example.com/';\n$domainValidator = new DomainWhitelistValidator($baseUri);\n\n$crawler-\u003eaddValidator($domainValidator);\n```\n\n#### `SubDomainWhitelistValidator`\n\nThis validator only allows uris, that are on exactly the same domain as the base uri used to start the crawl, or at least on a subdomain of it.\n\n**Please note** that the `www.` preceding most urls is considered not to be part of the domain (in contrast to RFC regulations).\nThis is due to the fact, that most pages use it as a scheme-designator, even though it's actually a subdomain.\nIt get's removed from the base uri for host comparison, if it is present.  \nThis has no reductive effect on uri filtering, instead it enhances the probability that the crawler validates all subdomains properly.\n\n##### Example\n\nIf `http://www.example.com/` was our base uri to crawl, the following uris would be considered valid:\n\n- `http://www.example.com/some-page.html`\n- `http://www.example.com/assets/img/my-cool-image.jpg`\n- `https://www.example.com/`\n- `https://www.example.com/#section`\n- `https://www.example.com/?q=somequery`\n- `https://subdomain.www.example.com/`\n- `https://another.subdomain.www.example.com/`\n- `https://sub.www.example.com/my/path`\n\nCounterwise to the before examples, the following will be considered invalid:\n\n- `http://example.com/`\n- `http://subdomain.example.com/`\n- `http://www.subdomain.example.com/`\n- `http://www.google.com/`\n\n##### Usage\n\nTo make use of this validator, create an instance of it and add it to the crawler (as with the `DomainWhitelistValidator` before):\n\n``` php\nuse MCStreetguy\\Crawler\\Processing\\Validation\\Core\\SubDomainWhitelistValidator;\n\n$baseUri = 'http://www.example.com/';\n$subdomainValidator = new SubDomainWhitelistValidator($baseUri);\n\n$crawler-\u003eaddValidator($subdomainValidator);\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcstreetguy%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmcstreetguy%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcstreetguy%2Fcrawler/lists"}