Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mvdbos/php-spider
A configurable and extensible PHP web spider
https://github.com/mvdbos/php-spider
Last synced: 2 days ago
JSON representation
A configurable and extensible PHP web spider
- Host: GitHub
- URL: https://github.com/mvdbos/php-spider
- Owner: mvdbos
- License: mit
- Created: 2013-02-25T19:27:52.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2024-06-15T12:41:57.000Z (6 months ago)
- Last Synced: 2024-10-29T15:33:01.565Z (about 1 month ago)
- Language: PHP
- Homepage:
- Size: 539 KB
- Stars: 1,333
- Watchers: 87
- Forks: 233
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-projects - PHP Spider - A configurable and extensible PHP web spider. (PHP / Scraping)
- web-stuff - PHP-Spider - well-done library to spider the web. (PHP)
- php-awesome - PHP-Spider
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php-cn - PHP Spider - 一个可配置的和可扩展的PHP web蜘蛛. (目录 / 爬虫 Scraping)
README
![Build Status](https://github.com/mvdbos/php-spider/workflows/PHP-Spider/badge.svg?branch=master)
[![Latest Stable Version](https://poser.pugx.org/vdb/php-spider/v)](https://packagist.org/packages/vdb/php-spider)
[![Total Downloads](https://poser.pugx.org/vdb/php-spider/downloads)](https://packagist.org/packages/vdb/php-spider)
[![License](https://poser.pugx.org/vdb/php-spider/license)](https://packagist.org/packages/vdb/php-spider)PHP-Spider Features
======
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as robots.txt and Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See [example](example/example_basic_auth.php).
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policyThis Spider does not support Javascript.
Installation
------------
The easiest way to install PHP-Spider is with [composer](https://getcomposer.org/). Find it on [Packagist](https://packagist.org/packages/vdb/php-spider).```bash
$ composer require vdb/php-spider
```Usage
-----
This is a very simple example. This code can be found in [example/example_simple.php](example/example_simple.php). For a more complete example with some logging, caching and filters, see [example/example_complex.php](example/example_complex.php). That file contains a more real-world example.>> Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see [the link checker example](https://github.com/mvdbos/php-spider/blob/master/example/example_link_check.php). It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.