{"id":13464688,"url":"https://github.com/crawlzone/crawlzone","last_synced_at":"2026-01-11T17:59:31.424Z","repository":{"id":37587677,"uuid":"158474447","full_name":"crawlzone/crawlzone","owner":"crawlzone","description":"Crawlzone is a fast asynchronous internet crawling framework for PHP. ","archived":false,"fork":false,"pushed_at":"2023-04-19T19:06:22.000Z","size":295,"stargazers_count":78,"open_issues_count":9,"forks_count":10,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-02-24T10:19:50.016Z","etag":null,"topics":["automated-testing","crawler","crawling-framework","middleware","php","web-scraping","web-search"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crawlzone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-11-21T01:40:22.000Z","updated_at":"2024-12-05T23:44:48.000Z","dependencies_parsed_at":"2024-01-18T19:11:35.632Z","dependency_job_id":null,"html_url":"https://github.com/crawlzone/crawlzone","commit_stats":{"total_commits":34,"total_committers":2,"mean_commits":17.0,"dds":"0.20588235294117652","last_synced_commit":"5b264d58054ddcd061cf2acdc1e862fa29541279"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlzone%2Fcrawlzone","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlzone%2Fcrawlzone/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlzone%2Fcrawlzone/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlzone%2Fcrawlzone/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crawlzone","download_url":"https://codeload.github.com/crawlzone/crawlzone/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245454118,"owners_count":20617982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated-testing","crawler","crawling-framework","middleware","php","web-scraping","web-search"],"created_at":"2024-07-31T14:00:48.672Z","updated_at":"2026-01-11T17:59:31.415Z","avatar_url":"https://github.com/crawlzone.png","language":"PHP","funding_links":[],"categories":["All","Crawlers","PHP"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/crawlzone/crawlzone.svg?branch=master)](https://travis-ci.org/crawlzone/crawlzone)\n[![Coverage Status](https://coveralls.io/repos/github/crawlzone/crawlzone/badge.svg?branch=master)](https://coveralls.io/github/crawlzone/crawlzone?branch=master)\n\n# Overview\n\nCrawlZone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.4, 8.0, 8.1.\n\n## Installation\n\n`composer require crawlzone/crawlzone`\n\n## Key Features\n\n- Asynchronous crawling with customizable concurrency.\n- Automatically throttling crawling speed based on the load of the website you are crawling\n- If configured, automatically filters out requests forbidden by the `robots.txt` exclusion standard.\n- Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.\n- Rich filtering capabilities.\n- Ability to set crawling depth\n- Easy to extend the core by hooking into the crawling process using events.\n- Shut down crawler any time and start over without losing the progress.\n\n## Architecture\n\n![Architecture](https://github.com/crawlzone/crawlzone/blob/master/resources/Web%20Crawler%20Architecture.svg)\n\nHere is what's happening for a single request when you run the client:\n\n1. The client queues the initial request (start_uri).\n2. The engine looks at the queue and checks if there are any requests.\n3. The engine gets the request from the queue and emits the `BeforeRequestSent` event. If the depth option is set in the config, then the `RequestDepth` extension validates the depth of the request. If the obey robots.txt option is set in the config, then the `RobotTxt` extension checks if the request complies with the rules. In a case when the request doesn't comply, the engine emits the `RequestFailed` event and gets the next request from the queue.\n4. The engine uses the request middleware stack to pass the request through it.\n5. The engine sends an asynchronous request using Guzzle HTTP Client\n6. The engine emits the `AfterRequestSent` event and stores the request in the history to avoid crawling the same request again.\n7. When response headers are received, but the body has not yet begun to download, the engine emits the `ResponseHeadersReceived` event.\n8. The engine emits the `TransferStatisticReceived` event. If the autothrottle option is set in the config, then the `AutoThrottle` extension is executed.\n9. The engine uses the response middleware stack to pass the response through it.\n10. The engine emits the `ResponseReceived` event. Additionally, if the request status code is greater than or equal to 400, the engine emits `RequestFailed` event.\n11. The `ResponseReceived` triggers the `ExtractAndQueueLinks` extension, which extracts and queues the links. The process starts over until the queue is empty.\n\n\n## Quick Start\n```php\n\u003c?php\n\nuse Psr\\Http\\Message\\RequestInterface;\nuse Psr\\Http\\Message\\ResponseInterface;\nuse Crawlzone\\Middleware\\BaseMiddleware;\nuse Crawlzone\\Client;\nuse Crawlzone\\Middleware\\ResponseMiddleware;\n\nrequire_once __DIR__ . '/../vendor/autoload.php';\n\n$config = [\n    'start_uri' =\u003e ['https://httpbin.org/'],\n    'concurrency' =\u003e 3,\n    'filter' =\u003e [\n        //A list of string containing domains which will be considered for extracting the links.\n        'allow_domains' =\u003e ['httpbin.org'],\n        //A list of regular expressions that the urls must match in order to be extracted.\n        'allow' =\u003e ['/get','/ip','/anything']\n    ]\n];\n\n$client = new Client($config);\n\n$client-\u003eaddResponseMiddleware(\n    new class implements ResponseMiddleware {\n        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface\n        {\n            printf(\"Process Response: %s %s \\n\", $request-\u003egetUri(), $response-\u003egetStatusCode());\n\n            return $response;\n        }\n    }\n);\n\n$client-\u003erun();\n```\n\n## Middlewares\n\nMiddleware can be written to perform a variety of tasks including authentication, filtering, headers, logging, etc.\nTo create middleware simply implement `Crawlzone\\Middleware\\RequestMiddleware` or `Crawlzone\\Middleware\\ResponseMiddleware` and\nthen add it to a client:\n\n\n```php\n...\n\n$config = [\n    'start_uri' =\u003e ['https://httpbin.org/ip']\n];\n\n$client = new Client($config);\n\n$client-\u003eaddRequestMiddleware(\n    new class implements RequestMiddleware {\n        public function processRequest(RequestInterface $request): RequestInterface\n        {\n            printf(\"Middleware 1 Request: %s \\n\", $request-\u003egetUri());\n            return $request;\n        }\n    }\n);\n\n$client-\u003eaddResponseMiddleware(\n    new class implements ResponseMiddleware {\n        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface\n        {\n            printf(\"Middleware 2 Response: %s %s \\n\", $request-\u003egetUri(), $response-\u003egetStatusCode());\n            return $response;\n        }\n    }\n);\n\n$client-\u003erun();\n\n/*\nOutput:\nMiddleware 1 Request: https://httpbin.org/ip\nMiddleware 2 Response: https://httpbin.org/ip 200\n*/\n\n```\n\nTo skip the request and go to the next middleware you can `throw new \\Crawlzone\\Exception\\InvalidRequestException` from any middleware. \nThe scheduler will catch the exception, notify all subscribers, and ignore the request.  \n\n## Processing server errors\n\nYou can use middlewares to handle 4xx or 5xx responses.\n\n```php\n...\n$config = [\n    'start_uri' =\u003e ['https://httpbin.org/status/500','https://httpbin.org/status/404'],\n    'concurrency' =\u003e 1,\n];\n\n$client = new Client($config);\n\n$client-\u003eaddResponseMiddleware(\n    new class implements ResponseMiddleware {\n        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface\n        {\n            printf(\"Process Failure: %s %s \\n\", $request-\u003egetUri(), $response-\u003egetStatusCode());\n\n            return $response;\n        }\n    }\n);\n\n$client-\u003erun();\n```\n\n## Filtering\n\nUse regular expression to allow or deny specific links. You can also pass array of allowed or denied domains as well. \nUse `robotstxt_obey` option to enable filtering. out requests forbidden by the `robots.txt` exclusion standard\n\n```php\n\n...\n$config = [\n    'start_uri' =\u003e ['http://site.local/'],\n    'concurrency' =\u003e 1,\n    'filter' =\u003e [\n        'robotstxt_obey' =\u003e true,\n        'allow' =\u003e ['/page\\d+','/otherpage'],\n        'deny' =\u003e ['/logout']\n        'allow_domains' =\u003e ['site.local'],\n        'deny_domains' =\u003e ['othersite.local'],\n    ]\n];\n$client = new Client($config);\n\n```\n\n## Autothrottle\n\nAutothrottle is enabled by default (use `autothrottle.enabled =\u003e false` to disable). It automatically adjusts scheduler to the optimum crawling speed trying to be nicer to the sites.\n\n\n**Throttling algorithm**\n\nAutoThrottle algorithm adjusts download delays based on the following rules:\n\n1. When a response is received, the target download delay is calculated as `latency / N` where latency is a latency of the response, and `N` is concurrency.\n3. Delay for next requests is set to the average of previous delay and the current delay;\n4. Latencies of non-200 responses are not allowed to decrease the delay;\n5. Delay can’t become less than `min_delay` or greater than `max_delay`\n\n\n```php\n\n...\n$config = [\n    'start_uri' =\u003e ['http://site.local/'],\n    'concurrency' =\u003e 3,\n    'autothrottle' =\u003e [\n        'enabled' =\u003e true,\n        'min_delay' =\u003e 0, // Sets minimum delay between the requests (default 0).\n        'max_delay' =\u003e 60, // Sets maximun delay between the requests (default 60).\n    ]\n];\n\n$client = new Client($config);\n...\n\n```\n\n## Extension\n\nBasically speaking, extensions are nothing more than event listeners based on the Symfony Event Dispatcher component.\nTo create extension simply extend `Crawlzone\\Extension\\Extension` and add it to a client. All extensions have access to a \n`Crawlzone\\Config\\Config` and `Crawlzone\\Session` object, which holds `GuzzleHttp\\Client`. This might be helpful if you want to \nmake some additional requests or reuse cookie headers for authentication.\n\n```php\n...\n\nuse GuzzleHttp\\Psr7\\Request;\nuse Psr\\Http\\Message\\RequestInterface;\nuse Psr\\Http\\Message\\ResponseInterface;\nuse Crawlzone\\Client;\nuse Crawlzone\\Event\\BeforeEngineStarted;\nuse Crawlzone\\Extension\\Extension;\nuse Crawlzone\\Middleware\\ResponseMiddleware;\n\n$config = [\n    'start_uri' =\u003e ['http://site.local/admin/']\n];\n\n$client = new Client($config);\n\n$loginUri = 'http://site.local/admin/';\n$username = 'test';\n$password = 'password';\n\n$client-\u003eaddExtension(new class($loginUri, $username, $password) extends Extension {\n    private $loginUri;\n    private $username;\n    private $password;\n\n    public function __construct(string $loginUri, string $username, string $password)\n    {\n        $this-\u003eloginUri = $loginUri;\n        $this-\u003eusername = $username;\n        $this-\u003epassword = $password;\n    }\n\n    public function authenticate(BeforeEngineStarted $event): void\n    {\n        $this-\u003elogin($this-\u003eloginUri, $this-\u003eusername, $this-\u003epassword);\n    }\n\n    private function login(string $loginUri, string $username, string $password)\n    {\n        $formParams = ['username' =\u003e $username, 'password' =\u003e $password];\n        $body = http_build_query($formParams, '', '\u0026');\n        $request = new Request('POST', $loginUri, ['content-type' =\u003e 'application/x-www-form-urlencoded'], $body);\n        $this-\u003egetSession()-\u003egetHttpClient()-\u003esendAsync($request)-\u003ewait();\n    }\n\n    public static function getSubscribedEvents(): array\n    {\n        return [\n            BeforeEngineStarted::class =\u003e 'authenticate'\n        ];\n    }\n});\n\n$client-\u003erun();\n\n```\n\n**List of supported events `Crawlzone\\Event`:**\n\n| Event                     | When?                                         |\n| ------------------------- | --------------------------------------------- |\n| BeforeEngineStarted       | Right before the engine starts crawling       |\n| BeforeRequestSent         | Before the request is scheduled to be sent    |\n| AfterRequestSent          | After the request is scheduled                |\n| TransferStatisticReceived | When a handler has finished sending a request. Allows you to get access to transfer statistics of a request and access the lower level transfer details. |\n| ResponseHeadersReceived   | When the HTTP headers of the response have been received but the body has not yet begun to download. Useful if you want to reject responses that are greater than certain size for example. |\n| RequestFailed             | When the request is failed or the exception `InvalidRequestException` has been  thrown from the middleware. |\n| ResponseReceived          | When the response is received                 |\n| AfterEngineStopped        | After engine stopped crawling                 |\n\n\n## Command Line Tool\n\nYou can use simple command line tool to crawl your site quickly.\nFirst create configuration file:\n\n```bash\n./crawler init \n\n```\n\nThen configure `crawler.yml` and run the crawler with a command:\n\n```bash\n./crawler start --config=./crawler.yml \n\n```\nTo get more details about request and response use `-vvv` option:\n\n```bash\n./crawler start --config=./crawler.yml -vvv \n\n```\n\n## Configuration\n\n```php\n\n$fullConfig = [\n    // A list of URIs to crawl. Required parameter. \n    'start_uri' =\u003e ['http://test.com', 'http://test1.com'],\n    \n    // The number of concurrent requests. Defaut is 10.\n    'concurrency' =\u003e 10,\n    \n    // The maximum depth that will be allowed to crawl (Mininum 1, unlimited if not set). \n    'depth' =\u003e 1,\n    \n    // The path to local file where the progress will be stored. Use \"memory\" to store the progress in memory (default behavior).\n    // The crawler uses Sqlite database to store the progress.\n    'save_progress_in' =\u003e '/path/to/my/sqlite.db',\n    \n    'filter' =\u003e [\n        // If enabled, crawler will respect robots.txt policies. Default is false\n        'robotstxt_obey' =\u003e false,\n        \n        // A list of regular expressions that the urls must match in order to be extracted. If not given (or empty), it will match all links..\n        'allow' =\u003e ['test','test1'],\n        \n        // A list of string containing domains which will be considered for extracting the links.\n        'allow_domains' =\u003e ['test.com','test1.com'],\n        \n        // A list of strings containing domains which won’t be considered for extracting the links. It has precedence over the allow_domains parameter.\n        'deny_domains' =\u003e ['test2.com','test3.com'],\n        \n        // A list of regular expressions) that the urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter.\n        'deny' =\u003e ['test2','test3'],\n    ],\n    // Crawler uses Guzzle HTTP Client so most of the Guzzle request options supported\n    // For more info go to http://docs.guzzlephp.org/en/stable/request-options.html\n    'request_options' =\u003e [\n        // Describes the SSL certificate verification behavior of a request.\n        'verify' =\u003e false,\n        \n        // Specifies whether or not cookies are used in a request or what cookie jar to use or what cookies to send.\n        'cookies' =\u003e CookieJar::fromArray(['name' =\u003e 'test', 'value' =\u003e 'test-value'],'localhost'),\n        \n        // Describes the redirect behavior of a request.\n        'allow_redirects' =\u003e false,\n        \n        // Set to true or to enable debug output with the handler used to send a request.\n        'debug' =\u003e true,\n        \n        // Float describing the number of seconds to wait while trying to connect to a server. Use 0 to wait indefinitely (the default behavior).\n        'connect_timeout' =\u003e 0,\n        \n        // Float describing the timeout of the request in seconds. Use 0 to wait indefinitely (the default behavior).\n        'timeout' =\u003e 0,\n        \n        // Float describing the timeout to use when reading a streamed body. Defaults to the value of the default_socket_timeout PHP ini setting\n        'read_timeout' =\u003e 60,\n        \n        // Specify whether or not Content-Encoding responses (gzip, deflate, etc.) are automatically decoded.\n        'decode_content' =\u003e true,\n        \n        // Set to \"v4\" if you want the HTTP handlers to use only ipv4 protocol or \"v6\" for ipv6 protocol.\n        'force_ip_resolve' =\u003e null,\n        \n        // Pass an array to specify different proxies for different protocols.\n        'proxy' =\u003e [\n            'http'  =\u003e 'tcp://localhost:8125', // Use this proxy with \"http\"\n            'https' =\u003e 'tcp://localhost:9124', // Use this proxy with \"https\",\n            'no' =\u003e ['.mit.edu', 'foo.com']    // Don't use a proxy with these\n         ],\n         \n         // Set to true to stream a response rather than download it all up-front.\n        'stream' =\u003e false,\n        \n        // Protocol version to use with the request.\n        'version' =\u003e '1.1',\n        \n        // Set to a string or an array to specify the path to a file containing a PEM formatted client side certificate and password.\n        'cert' =\u003e '/path/server.pem',\n        \n        // Specify the path to a file containing a private SSL key in PEM format.\n        'ssl_key' =\u003e ['/path/key.pem', 'password']\n    ],\n    \n    'autothrottle' =\u003e [\n        // Enables autothrottle extension. Default is true.\n        'enabled' =\u003e true,\n        \n        // Sets minimum delay between the requests.\n        'min_delay' =\u003e 0,\n        \n        // Sets maximun delay between the requests.\n        'max_delay' =\u003e 60\n    ]\n];\n\n```\n\n## Thanks for Inspiration\n\nhttps://scrapy.org/\n\nhttp://docs.guzzlephp.org/\n\nIf you feel that this project is helpful, please give it a star or leave some feedback. This will help me understand the needs and provide future library updates.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrawlzone%2Fcrawlzone","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrawlzone%2Fcrawlzone","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrawlzone%2Fcrawlzone/lists"}