https://github.com/zrashwani/arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
https://github.com/zrashwani/arachnid

crawler php scraping seo

Last synced: 3 months ago
JSON representation

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

Host: GitHub
URL: https://github.com/zrashwani/arachnid
Owner: zrashwani
License: mit
Created: 2014-01-06T17:32:41.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2022-09-10T20:27:13.000Z (almost 3 years ago)
Last Synced: 2025-02-17T18:44:07.984Z (4 months ago)
Topics: crawler, php, scraping, seo
Language: PHP
Homepage:
Size: 216 KB
Stars: 256
Watchers: 21
Forks: 59
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md

Awesome Lists containing this project

awesome-jordan - Arachnid - Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites. (PHP / Gists)

README

        # Arachnid Web Crawler

This library will crawl all unique internal links found on a given website

up to a specified maximum page depth.

This library is using [_symfony/panther_](https://github.com/symfony/panther) & [FriendsOfPHP/Goutte](https://github.com/FriendsOfPHP/Goutte) libraries to scrap site pages and extract main SEO-related info, including: 

`title`, `h1 elements`, `h2 elements`, `statusCode`, `contentType`, `meta description`, `meta keyword` and `canonicalLink`.

This library is based on the original blog post by Zeid Rashwani here:

Josh Lockhart adapted the original blog post's code (with permission)

for Composer and Packagist and updated the syntax to conform with

the PSR-2 coding standard.

[![Build Status](https://travis-ci.com/zrashwani/arachnid.svg?branch=master)](https://travis-ci.com/zrashwani/arachnid)

[![codecov](https://codecov.io/gh/zrashwani/arachnid/branch/master/graph/badge.svg)](https://codecov.io/gh/zrashwani/arachnid)

## Sponsored By

[](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=447&url_id=32)

## How to Install

You can install this library with [Composer][composer]. Drop this into your `composer.json`

manifest file:

    {

        "require": {

            "zrashwani/arachnid": "dev-master"

        }

    }

Then run `composer install`.

## Getting Started

### Basic Usage:

Here's a quick demo to crawl a website:

```php

    traverse();

    // Get link data

    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method

    print_r($links);

```

### Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.

`enableHeadlessBrowserMode` method set the scraping adapter used to be `PantherChromeAdapter` which is based on [Symfony Panther](https://github.com/symfony/panther) library: 

```php

    $crawler = new \Arachnid\Crawler($url, $linkDepth);

    $crawler->enableHeadlessBrowserMode()

            ->traverse()

            ->getLinksArray();

```

In order to use this, you need to have [chrome-driver](https://sites.google.com/a/chromium.org/chromedriver/) installed on your machine, you can use `dbrekelmans/browser-driver-installer` to install chromedriver locally: 

```

composer require --dev dbrekelmans/bdi

./vendor/bin/bdi driver:chromedriver drivers

```

    

## Advanced Usage:

   Set additional options to underlying http client, by specifying array of options in constructor 

or creating Http client scrapper with desired options:

```php

     array('username', 'password')];

        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);

           

        //or by creating and setting scrap client

        $options = array(

            'verify_host' => false,

            'verify_peer' => false,

            'timeout' => 30,

        );

                        

        $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);

        $crawler->setScrapClient($scrapperClient);

```

   You can inject a [PSR-3][psr3] compliant logger object to monitor crawler activity (like [Monolog][monolog]):

```php

    pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));

    $crawler->setLogger($logger);

    ?>

```

   You can set crawler to visit only pages with specific criteria by specifying callback closure using `filterLinks` method:

```php

    filterLinks(function($link) {

                        //crawling only links with /blog/ prefix

                        return (bool)preg_match('/.*\/blog.*$/u', $link); 

                    })

                    ->traverse()

                    ->getLinks();

```    

    

   You can use `LinksCollection` class to get simple statistics about the links, as following:

```php

    traverse()

                     ->getLinks();

    $collection = new LinksCollection($links);

    //getting broken links

    $brokenLinks = $collection->getBrokenLinks();

   

    //getting links for specific depth

    $depth2Links = $collection->getByDepth(2);

    //getting external links inside site

    $externalLinks = $collection->getExternalLinks();

```

## How to Contribute

1. Fork this repository

2. Create a new branch for each feature or improvement

3. Apply your code changes along with corresponding unit test

4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches,

and to send a pull request for each branch. This allows me to review and pull in new features

or improvements individually.

All pull requests must adhere to the [PSR-2 standard][psr2].

## System Requirements

* PHP 7.2.0+

## Authors

* Josh Lockhart 

* Zeid Rashwani 

## License

MIT Public License

[composer]: http://getcomposer.org/

[psr2]: https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md

[psr3]: https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-3-logger-interface.md

[monolog]: https://github.com/Seldaek/monolog

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zrashwani/arachnid

Awesome Lists containing this project

README