Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/baqend/php-spider
URL spider which crawls a page and all its subpages
https://github.com/baqend/php-spider
composer-package crawler spider
Last synced: 3 days ago
JSON representation
URL spider which crawls a page and all its subpages
- Host: GitHub
- URL: https://github.com/baqend/php-spider
- Owner: Baqend
- License: mit
- Created: 2018-03-16T17:11:23.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-03-16T18:08:24.000Z (almost 7 years ago)
- Last Synced: 2024-04-24T16:46:50.354Z (9 months ago)
- Topics: composer-package, crawler, spider
- Language: PHP
- Size: 37.1 KB
- Stars: 6
- Watchers: 13
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
PHP Spider
==========
_URL spider which crawls a page and all its subpages_* [Installation](#installation)
* [Usage](#usage)
* [Processors](#processors)
* [URL Handlers](#url-handlers)
* [Alternatives](#alternatives)Installation
------------Make sure you have [Composer] installed. Then execute:
composer require baqend/spider
This package requires at least **PHP 5.5.9** and has **no package dependencies!**Usage
-----The entry point is the `Spider` class. For it to work, it requires the following services:
* **Queue:** Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.
* **URL Handler:** Checks if a URL should be processed. If no URL handler is provided, every URL is processed. [More about URL handlers](#url-handlers)
* **Downloader:** Takes URLs and downloads them. To have no dependency on a HTTP client library like [Guzzle], you have to implement this class by yourself.
* **Processor:** Retrieves downloaded assets and performs operations on it. [More about Processors](#processors)You initialize the spider in the following way:
```php
addProcessor(new Processor\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor($cssProcessor = new Processor\CssProcessor());
$processor->addProcessor(new Processor\HtmlProcessor($cssProcessor));
$processor->addProcessor(new Processor\ReplaceProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor(new Processor\StoreProcessor('https://example.com/archive', '/tmp/output'));// Create the spider instance
$spider = new Spider($queue, $downloader, $urlHandler, $processor);// Enqueue some URLs
$spider->queue('https://example.org/index.html');
$spider->queue('https://example.org/news/other-landingpage.html');// Execute the crawling
$spider->crawl();
```Processors
----------This package comes with the following built-in processors.
### `Processor`
This is an aggregate processor which allows adding and removing other processors which it will execute one after the other.
```php
addProcessor($firstProcessor);
$processor->addProcessor($secondProcessor);
$processor->addProcessor($thirdProcessor);// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:
$processor->process($asset, $queue);
```### `HtmlProcessor`
This processor can process HTML assets and enqueue its containing URLs.
It will also modify all relative URLs and make them absolute.
Also, if you provide a [CssProcessor](#cssprocessor), `style` attributes are found and URLs within CSS will be resolved.
### `CssProcessor`This processor can process CSS assets and enqueue its containing URLs from `@import`s and `url(...)` statements.
### `ReplaceProcessor`
Performs simple `str_replace` operations on asset contents:
```php
process($asset, $queue);
```The `ReplaceProcessor` does not enqueue other URLs.
### `StoreProcessor`
Takes a URL _prefix_ and a _directory_ and will store all assets relative to the _prefix_ in the according file structure in _directory_.
The `StoreProcessor` does not enqueue other URLs.
### `UrlRewriteProcessor`
Changes the URL of an asset to another prefix.
Use this to let [HtmlProcessor](#htmlprocessor) and [CssProcessor](#cssprocessor) resolve relative URLs from a different origin.The `UrlRewriteProcessor` does not enqueue other URLs.
Also, it does not modify the asset's content – only its URL.URL Handlers
------------URL handlers tell the spider whether to download and process a URL.
There are the following built-in URL handlers:### `OriginUrlHandler`
Handles only URLs coming from some given origin, i.e. "https://example.org".
### `BlacklistUrlHandler`
Does not handle URLs being part of some blacklist.
You can use glob patterns to provide a blacklist:```php