{"id":13464591,"url":"https://github.com/mvdbos/php-spider","last_synced_at":"2025-05-14T13:05:58.938Z","repository":{"id":7121772,"uuid":"8416378","full_name":"mvdbos/php-spider","owner":"mvdbos","description":"A configurable and extensible PHP web spider","archived":false,"fork":false,"pushed_at":"2024-06-15T12:41:57.000Z","size":552,"stargazers_count":1333,"open_issues_count":6,"forks_count":233,"subscribers_count":87,"default_branch":"master","last_synced_at":"2024-10-29T15:33:01.565Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mvdbos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-02-25T19:27:52.000Z","updated_at":"2024-10-28T12:23:25.000Z","dependencies_parsed_at":"2024-11-26T07:16:46.942Z","dependency_job_id":null,"html_url":"https://github.com/mvdbos/php-spider","commit_stats":{"total_commits":193,"total_committers":14,"mean_commits":"13.785714285714286","dds":"0.17616580310880825","last_synced_commit":"6e4a3a55442858ee173bc5797a8565b556489007"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvdbos%2Fphp-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvdbos%2Fphp-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvdbos%2Fphp-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvdbos%2Fphp-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mvdbos","download_url":"https://codeload.github.com/mvdbos/php-spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254149948,"owners_count":22022851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T14:00:46.785Z","updated_at":"2025-05-14T13:05:58.912Z","avatar_url":"https://github.com/mvdbos.png","language":"PHP","readme":"![Build Status](https://github.com/mvdbos/php-spider/workflows/PHP-Spider/badge.svg?branch=master)\n[![Latest Stable Version](https://poser.pugx.org/vdb/php-spider/v)](https://packagist.org/packages/vdb/php-spider)\n[![Total Downloads](https://poser.pugx.org/vdb/php-spider/downloads)](https://packagist.org/packages/vdb/php-spider)\n[![License](https://poser.pugx.org/vdb/php-spider/license)](https://packagist.org/packages/vdb/php-spider)\n\n\nPHP-Spider Features\n======\n- supports two traversal algorithms: breadth-first and depth-first\n- supports crawl depth limiting, queue size limiting and max downloads limiting\n- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP\n- comes with a useful set of URI filters, such as robots.txt and Domain limiting\n- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)\n- supports custom request handling logic\n- supports Basic, Digest and NTLM HTTP authentication. See [example](example/example_basic_auth.php).\n- comes with a useful set of persistence handlers (memory, file)\n- supports custom persistence handlers\n- collects statistics about the crawl for reporting\n- dispatches useful events, allowing developers to add even more custom behavior\n- supports a politeness policy\n\nThis Spider does not support Javascript.\n\nInstallation\n------------\nThe easiest way to install PHP-Spider is with [composer](https://getcomposer.org/).  Find it on [Packagist](https://packagist.org/packages/vdb/php-spider).\n\n```bash\n$ composer require vdb/php-spider\n```\n\nUsage\n-----\nThis is a very simple example. This code can be found in [example/example_simple.php](example/example_simple.php). For a more complete example with some logging, caching and filters, see [example/example_complex.php](example/example_complex.php). That file contains a more real-world example.\n\n\u003e\u003e Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see [the link checker example](https://github.com/mvdbos/php-spider/blob/master/example/example_link_check.php). It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses. \n\nFirst create the spider\n```php\n$spider = new Spider('http://www.dmoz.org');\n```\nAdd a URI discoverer. Without it, the spider does nothing. In this case, we want all `\u003ca\u003e` nodes from a certain `\u003cdiv\u003e`\n\n```php\n$spider-\u003egetDiscovererSet()-\u003eset(new XPathExpressionDiscoverer(\"//div[@id='catalogs']//a\"));\n```\nSet some sane options for this example. In this case, we only get the first 10 items from the start page.\n\n```php\n$spider-\u003egetDiscovererSet()-\u003emaxDepth = 1;\n$spider-\u003egetQueueManager()-\u003emaxQueueSize = 10;\n```\nAdd a listener to collect stats from the Spider and the QueueManager.\nThere are more components that dispatch events you can use.\n\n```php\n$statsHandler = new StatsHandler();\n$spider-\u003egetQueueManager()-\u003egetDispatcher()-\u003eaddSubscriber($statsHandler);\n$spider-\u003egetDispatcher()-\u003eaddSubscriber($statsHandler);\n```\nExecute the crawl\n\n```php\n$spider-\u003ecrawl();\n```\nWhen crawling is done, we could get some info about the crawl\n```php\necho \"\\n  ENQUEUED:  \" . count($statsHandler-\u003egetQueued());\necho \"\\n  SKIPPED:   \" . count($statsHandler-\u003egetFiltered());\necho \"\\n  FAILED:    \" . count($statsHandler-\u003egetFailed());\necho \"\\n  PERSISTED:    \" . count($statsHandler-\u003egetPersisted());\n```\nFinally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources\n```php\necho \"\\n\\nDOWNLOADED RESOURCES: \";\nforeach ($spider-\u003egetDownloader()-\u003egetPersistenceHandler() as $resource) {\n    echo \"\\n - \" . $resource-\u003egetCrawler()-\u003efilterXpath('//title')-\u003etext();\n}\n\n```\nContributing\n------------\nContributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request.\nThe Symfony documentation contains an excellent guide for how to do that properly here: [Submitting a Patch](http://symfony.com/doc/current/contributing/code/patches.html#step-1-setup-your-environment).\n\nThere a few requirements for a Pull Request to be accepted:\n- Follow the coding standards: PHP-Spider follows the coding standards defined in the [PSR-0](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md), [PSR-1](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-1-basic-coding-standard.md) and [PSR-2](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md) Coding Style Guides;\n- Prove that the code works with unit tests and that coverage remains 100%;\n\n\u003e Note: An easy way to check if your code conforms to PHP-Spider is by running the script `bin/static-analysis`, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.  \n\n\u003e Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run `bin/coverage-enforce`.\n\nSupport\n-------\nFor things like reporting bugs and requesting features it is best to create an [issue](https://github.com/mvdbos/php-spider/issues) here on GitHub. It is even better to accompany it with a Pull Request. ;-)\n\nLicense\n-------\nPHP-Spider is licensed under the MIT license.\n","funding_links":[],"categories":["All","爬虫 Scraping","Table of Contents","Spiders","PHP","目录","类库"],"sub_categories":["Scraping","爬虫 Scraping","网页抓取/代理"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmvdbos%2Fphp-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmvdbos%2Fphp-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmvdbos%2Fphp-spider/lists"}