Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/baqend/php-spider

URL spider which crawls a page and all its subpages
https://github.com/baqend/php-spider

composer-package crawler spider

Last synced: 3 days ago
JSON representation

URL spider which crawls a page and all its subpages

Awesome Lists containing this project

README

        

PHP Spider
==========
_URL spider which crawls a page and all its subpages_

* [Installation](#installation)
* [Usage](#usage)
* [Processors](#processors)
* [URL Handlers](#url-handlers)
* [Alternatives](#alternatives)

Installation
------------

Make sure you have [Composer] installed. Then execute:

composer require baqend/spider

This package requires at least **PHP 5.5.9** and has **no package dependencies!**

Usage
-----

The entry point is the `Spider` class. For it to work, it requires the following services:

* **Queue:** Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.
* **URL Handler:** Checks if a URL should be processed. If no URL handler is provided, every URL is processed. [More about URL handlers](#url-handlers)
* **Downloader:** Takes URLs and downloads them. To have no dependency on a HTTP client library like [Guzzle], you have to implement this class by yourself.
* **Processor:** Retrieves downloaded assets and performs operations on it. [More about Processors](#processors)

You initialize the spider in the following way:

```php
addProcessor(new Processor\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor($cssProcessor = new Processor\CssProcessor());
$processor->addProcessor(new Processor\HtmlProcessor($cssProcessor));
$processor->addProcessor(new Processor\ReplaceProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor(new Processor\StoreProcessor('https://example.com/archive', '/tmp/output'));

// Create the spider instance
$spider = new Spider($queue, $downloader, $urlHandler, $processor);

// Enqueue some URLs
$spider->queue('https://example.org/index.html');
$spider->queue('https://example.org/news/other-landingpage.html');

// Execute the crawling
$spider->crawl();
```

Processors
----------

This package comes with the following built-in processors.

### `Processor`

This is an aggregate processor which allows adding and removing other processors which it will execute one after the other.

```php
addProcessor($firstProcessor);
$processor->addProcessor($secondProcessor);
$processor->addProcessor($thirdProcessor);

// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:
$processor->process($asset, $queue);
```

### `HtmlProcessor`

This processor can process HTML assets and enqueue its containing URLs.
It will also modify all relative URLs and make them absolute.
Also, if you provide a [CssProcessor](#cssprocessor), `style` attributes are found and URLs within CSS will be resolved.

### `CssProcessor`

This processor can process CSS assets and enqueue its containing URLs from `@import`s and `url(...)` statements.

### `ReplaceProcessor`

Performs simple `str_replace` operations on asset contents:

```php
process($asset, $queue);
```

The `ReplaceProcessor` does not enqueue other URLs.

### `StoreProcessor`

Takes a URL _prefix_ and a _directory_ and will store all assets relative to the _prefix_ in the according file structure in _directory_.

The `StoreProcessor` does not enqueue other URLs.

### `UrlRewriteProcessor`

Changes the URL of an asset to another prefix.
Use this to let [HtmlProcessor](#htmlprocessor) and [CssProcessor](#cssprocessor) resolve relative URLs from a different origin.

The `UrlRewriteProcessor` does not enqueue other URLs.
Also, it does not modify the asset's content – only its URL.

URL Handlers
------------

URL handlers tell the spider whether to download and process a URL.
There are the following built-in URL handlers:

### `OriginUrlHandler`

Handles only URLs coming from some given origin, i.e. "https://example.org".

### `BlacklistUrlHandler`

Does not handle URLs being part of some blacklist.
You can use glob patterns to provide a blacklist:

```php