Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/baqend/php-spider

URL spider which crawls a page and all its subpages
https://github.com/baqend/php-spider

composer-package crawler spider

Last synced: 3 days ago
JSON representation

URL spider which crawls a page and all its subpages

Host: GitHub
URL: https://github.com/baqend/php-spider
Owner: Baqend
License: mit
Created: 2018-03-16T17:11:23.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-03-16T18:08:24.000Z (almost 7 years ago)
Last Synced: 2024-04-24T16:46:50.354Z (9 months ago)
Topics: composer-package, crawler, spider
Language: PHP
Size: 37.1 KB
Stars: 6
Watchers: 13
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        PHP Spider

==========

_URL spider which crawls a page and all its subpages_

* [Installation](#installation)

* [Usage](#usage)

* [Processors](#processors)

* [URL Handlers](#url-handlers)

* [Alternatives](#alternatives)

Installation

------------

Make sure you have [Composer] installed. Then execute:

    composer require baqend/spider

    

This package requires at least **PHP 5.5.9** and has **no package dependencies!**

Usage

-----

The entry point is the `Spider` class. For it to work, it requires the following services:

* **Queue:** Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.

* **URL Handler:** Checks if a URL should be processed. If no URL handler is provided, every URL is processed. [More about URL handlers](#url-handlers) 

* **Downloader:** Takes URLs and downloads them. To have no dependency on a HTTP client library like [Guzzle], you have to implement this class by yourself.

* **Processor:** Retrieves downloaded assets and performs operations on it. [More about Processors](#processors) 

You initialize the spider in the following way:

```php

addProcessor(new Processor\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));

$processor->addProcessor($cssProcessor = new Processor\CssProcessor());

$processor->addProcessor(new Processor\HtmlProcessor($cssProcessor));

$processor->addProcessor(new Processor\ReplaceProcessor('https://example.org', 'https://example.com/archive'));

$processor->addProcessor(new Processor\StoreProcessor('https://example.com/archive', '/tmp/output'));

// Create the spider instance

$spider = new Spider($queue, $downloader, $urlHandler, $processor);

// Enqueue some URLs

$spider->queue('https://example.org/index.html');

$spider->queue('https://example.org/news/other-landingpage.html');

// Execute the crawling

$spider->crawl();

``` 

Processors

----------

This package comes with the following built-in processors.

### `Processor`

This is an aggregate processor which allows adding and removing other processors which it will execute one after the other.

```php

addProcessor($firstProcessor);

$processor->addProcessor($secondProcessor);

$processor->addProcessor($thirdProcessor);

// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:

$processor->process($asset, $queue);

```

### `HtmlProcessor`

This processor can process HTML assets and enqueue its containing URLs.

It will also modify all relative URLs and make them absolute.

Also, if you provide a [CssProcessor](#cssprocessor), `style` attributes are found and URLs within CSS will be resolved.

 

### `CssProcessor`

This processor can process CSS assets and enqueue its containing URLs from `@import`s and `url(...)` statements.

### `ReplaceProcessor`

Performs simple `str_replace` operations on asset contents:

```php

process($asset, $queue);

```

The `ReplaceProcessor` does not enqueue other URLs.

### `StoreProcessor`

Takes a URL _prefix_ and a _directory_ and will store all assets relative to the _prefix_ in the according file structure in _directory_.

The `StoreProcessor` does not enqueue other URLs.

### `UrlRewriteProcessor`

Changes the URL of an asset to another prefix.

Use this to let [HtmlProcessor](#htmlprocessor) and [CssProcessor](#cssprocessor) resolve relative URLs from a different origin.

The `UrlRewriteProcessor` does not enqueue other URLs.

Also, it does not modify the asset's content – only its URL.

URL Handlers

------------

URL handlers tell the spider whether to download and process a URL.

There are the following built-in URL handlers:

### `OriginUrlHandler`

Handles only URLs coming from some given origin, i.e. "https://example.org". 

### `BlacklistUrlHandler`

Does not handle URLs being part of some blacklist.

You can use glob patterns to provide a blacklist:

```php