{"id":24455711,"url":"https://github.com/baqend/php-spider","last_synced_at":"2025-10-05T16:55:41.477Z","repository":{"id":62491730,"uuid":"125547025","full_name":"Baqend/PHP-Spider","owner":"Baqend","description":"URL spider which crawls a page and all its subpages","archived":false,"fork":false,"pushed_at":"2018-03-16T18:08:24.000Z","size":38,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-10-01T17:44:00.757Z","etag":null,"topics":["composer-package","crawler","spider"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Baqend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-16T17:11:23.000Z","updated_at":"2020-01-20T18:36:27.000Z","dependencies_parsed_at":"2022-11-02T09:31:34.071Z","dependency_job_id":null,"html_url":"https://github.com/Baqend/PHP-Spider","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Baqend/PHP-Spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Baqend%2FPHP-Spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Baqend%2FPHP-Spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Baqend%2FPHP-Spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Baqend%2FPHP-Spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Baqend","download_url":"https://codeload.github.com/Baqend/PHP-Spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Baqend%2FPHP-Spider/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278486278,"owners_count":25994941,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["composer-package","crawler","spider"],"created_at":"2025-01-21T02:14:12.513Z","updated_at":"2025-10-05T16:55:41.433Z","avatar_url":"https://github.com/Baqend.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"PHP Spider\n==========\n_URL spider which crawls a page and all its subpages_\n\n* [Installation](#installation)\n* [Usage](#usage)\n* [Processors](#processors)\n* [URL Handlers](#url-handlers)\n* [Alternatives](#alternatives)\n\nInstallation\n------------\n\nMake sure you have [Composer] installed. Then execute:\n\n    composer require baqend/spider\n    \nThis package requires at least **PHP 5.5.9** and has **no package dependencies!**\n\n\nUsage\n-----\n\nThe entry point is the `Spider` class. For it to work, it requires the following services:\n\n* **Queue:** Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.\n* **URL Handler:** Checks if a URL should be processed. If no URL handler is provided, every URL is processed. [More about URL handlers](#url-handlers) \n* **Downloader:** Takes URLs and downloads them. To have no dependency on a HTTP client library like [Guzzle], you have to implement this class by yourself.\n* **Processor:** Retrieves downloaded assets and performs operations on it. [More about Processors](#processors) \n\nYou initialize the spider in the following way:\n\n```php\n\u003c?php\nuse Baqend\\Component\\Spider\\Processor;\nuse Baqend\\Component\\Spider\\Queue\\BreadthQueue;\nuse Baqend\\Component\\Spider\\Spider;\nuse Baqend\\Component\\Spider\\UrlHandler\\BlacklistUrlHandler;\n\n// Use the breadth-first queue\n$queue = new BreadthQueue();\n\n// Implement the DownloaderInterface\n$downloader /* your downloader implementation */;\n\n// Create a URL handler, e.g. the provided blacklist URL handler\n$urlHandler = new BlacklistUrlHandler(['**.php']);\n\n// Create some processors which will be executed after another\n// More details on the processors below!\n$processor = new Processor\\Processor();\n$processor-\u003eaddProcessor(new Processor\\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));\n$processor-\u003eaddProcessor($cssProcessor = new Processor\\CssProcessor());\n$processor-\u003eaddProcessor(new Processor\\HtmlProcessor($cssProcessor));\n$processor-\u003eaddProcessor(new Processor\\ReplaceProcessor('https://example.org', 'https://example.com/archive'));\n$processor-\u003eaddProcessor(new Processor\\StoreProcessor('https://example.com/archive', '/tmp/output'));\n\n// Create the spider instance\n$spider = new Spider($queue, $downloader, $urlHandler, $processor);\n\n// Enqueue some URLs\n$spider-\u003equeue('https://example.org/index.html');\n$spider-\u003equeue('https://example.org/news/other-landingpage.html');\n\n// Execute the crawling\n$spider-\u003ecrawl();\n``` \n\n\nProcessors\n----------\n\nThis package comes with the following built-in processors.\n\n### `Processor`\n\nThis is an aggregate processor which allows adding and removing other processors which it will execute one after the other.\n\n```php\n\u003c?php\nuse Baqend\\Component\\Spider\\Processor\\Processor;\n\n$processor = new Processor();\n$processor-\u003eaddProcessor($firstProcessor);\n$processor-\u003eaddProcessor($secondProcessor);\n$processor-\u003eaddProcessor($thirdProcessor);\n\n// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:\n$processor-\u003eprocess($asset, $queue);\n```\n\n### `HtmlProcessor`\n\nThis processor can process HTML assets and enqueue its containing URLs.\nIt will also modify all relative URLs and make them absolute.\nAlso, if you provide a [CssProcessor](#cssprocessor), `style` attributes are found and URLs within CSS will be resolved.\n \n### `CssProcessor`\n\nThis processor can process CSS assets and enqueue its containing URLs from `@import`s and `url(...)` statements.\n\n### `ReplaceProcessor`\n\nPerforms simple `str_replace` operations on asset contents:\n\n```php\n\u003c?php\nuse Baqend\\Component\\Spider\\Processor\\ReplaceProcessor;\n\n$processor = new ReplaceProcessor('Hello World', 'Hallo Welt');\n\n// This will replace all occurrences of\n// \"Hello World\" in the asset with \"Hallo Welt\":\n$processor-\u003eprocess($asset, $queue);\n```\n\nThe `ReplaceProcessor` does not enqueue other URLs.\n\n### `StoreProcessor`\n\nTakes a URL _prefix_ and a _directory_ and will store all assets relative to the _prefix_ in the according file structure in _directory_.\n\nThe `StoreProcessor` does not enqueue other URLs.\n\n### `UrlRewriteProcessor`\n\nChanges the URL of an asset to another prefix.\nUse this to let [HtmlProcessor](#htmlprocessor) and [CssProcessor](#cssprocessor) resolve relative URLs from a different origin.\n\nThe `UrlRewriteProcessor` does not enqueue other URLs.\nAlso, it does not modify the asset's content – only its URL.\n\n\nURL Handlers\n------------\n\nURL handlers tell the spider whether to download and process a URL.\nThere are the following built-in URL handlers:\n\n### `OriginUrlHandler`\n\nHandles only URLs coming from some given origin, i.e. \"https://example.org\". \n\n### `BlacklistUrlHandler`\n\nDoes not handle URLs being part of some blacklist.\nYou can use glob patterns to provide a blacklist:\n\n```php\n\u003c?php\nuse Baqend\\Component\\Spider\\UrlHandler\\BlacklistUrlHandler;\n\n$blacklist = [\n    'https://other.org/**',     // Don't handle anything from other.org over HTTPS    \n    'http{,s}://other.org/**',  // Don't handle anything from other.org over HTTP or HTTPS    \n    '**.{png,gif,jpg,jpeg}',    // Don't handle any image files    \n];\n\n$urlHandler = new BlacklistUrlHandler($blacklist);\n```\n \n\nAlternatives\n------------\n\nIf this project does not match your needs, check the following other projects:\n\n* [spatie/crawler](https://packagist.org/packages/spatie/crawler) (Requires PHP 7)\n* [vdb/php-spider](https://packagist.org/packages/vdb/php-spider)\n\n\n[Composer]: https://getcomposer.org/\n[Guzzle]: https://packagist.org/packages/guzzlehttp/guzzle\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaqend%2Fphp-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaqend%2Fphp-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaqend%2Fphp-spider/lists"}