Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mvdbos/php-spider
A configurable and extensible PHP web spider
https://github.com/mvdbos/php-spider
Last synced: 18 days ago
JSON representation
A configurable and extensible PHP web spider
- Host: GitHub
- URL: https://github.com/mvdbos/php-spider
- Owner: mvdbos
- License: mit
- Created: 2013-02-25T19:27:52.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2024-02-26T12:48:33.000Z (3 months ago)
- Last Synced: 2024-04-30T03:44:29.696Z (23 days ago)
- Language: PHP
- Homepage:
- Size: 548 KB
- Stars: 1,324
- Watchers: 86
- Forks: 237
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-projects - PHP Spider - A configurable and extensible PHP web spider. (PHP / Scraping)
- awesome-crawler - php-spider - A configurable and extensible PHP web spider. (PHP)
- web-stuff - PHP-Spider - well-done library to spider the web. (PHP)
- awesome-php-zh_CN - PHP Spider - 一个可配置和可扩展的PHP web爬虫 (爬虫 Scraping)
- php-awesome - PHP-Spider
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php-new - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Scraping)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-stripe - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- PHP_awesome-directus-duh-REKT-iss- - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-crawler - php-spider - A configurable and extensible PHP web spider. (PHP)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-php-scrapers-and-crawlers - mvdbos/PHP-Spider - A configurable and extensible PHP web spider. Various [Examples](https://github.com/mvdbos/php-spider/tree/master/example) available. (Spiders)
- awesome-php - PHP Spider - A configurable and extensible PHP web spider. (Table of Contents / Scraping)
- awesome-crawler-cn - php-spider - 一个基于PHP的高可扩展的网络爬虫. (PHP)
- awesome-php-cn - PHP Spider - 一个可配置的和可扩展的PHP web蜘蛛. (目录 / 爬虫 Scraping)
- awesome-crawlers - php-spider - 08-24 | A configurable and extensible PHP web spider. | (PHP)
README
![Build Status](https://github.com/mvdbos/php-spider/workflows/PHP-Spider/badge.svg?branch=master)
[![Latest Stable Version](https://poser.pugx.org/vdb/php-spider/v)](https://packagist.org/packages/vdb/php-spider)
[![Total Downloads](https://poser.pugx.org/vdb/php-spider/downloads)](https://packagist.org/packages/vdb/php-spider)
[![License](https://poser.pugx.org/vdb/php-spider/license)](https://packagist.org/packages/vdb/php-spider)PHP-Spider Features
======
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as robots.txt and Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See [example](example/example_basic_auth.php).
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policyThis Spider does not support Javascript.
Installation
------------
The easiest way to install PHP-Spider is with [composer](https://getcomposer.org/). Find it on [Packagist](https://packagist.org/packages/vdb/php-spider).```bash
$ composer require vdb/php-spider
```Usage
-----
This is a very simple example. This code can be found in [example/example_simple.php](example/example_simple.php). For a more complete example with some logging, caching and filters, see [example/example_complex.php](example/example_complex.php). That file contains a more real-world example.>> Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see [the link checker example](https://github.com/mvdbos/php-spider/blob/master/example/example_link_check.php). It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.