https://github.com/baraja-core/webcrawler

Simple crawling websites by following links.
https://github.com/baraja-core/webcrawler

bot crawler crawling-websites fast php robot speed

Last synced: 2 months ago
JSON representation

Simple crawling websites by following links.

Host: GitHub
URL: https://github.com/baraja-core/webcrawler
Owner: baraja-core
License: mit
Created: 2019-07-24T07:40:29.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-06-09T20:13:55.000Z (over 1 year ago)
Last Synced: 2025-03-24T13:05:09.840Z (8 months ago)
Topics: bot, crawler, crawling-websites, fast, php, robot, speed
Language: PHP
Homepage: https://en.php.brj.cz/downloading-the-whole-site-by-links-in-php
Size: 87.9 KB
Stars: 6
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  

    

    

  

  


  BRJ organisation





Web crawler

===========

![Integrity check](https://github.com/baraja-core/webcrawler/workflows/Integrity%20check/badge.svg)

Simply library for crawling websites by following links with minimal dependencies.

[Czech documentation](https://php.baraja.cz/stazeni-celeho-webu-po-odkazech)

📦 Installation

---------------

It's best to use [Composer](https://getcomposer.org) for installation, and you can also find the package on

[Packagist](https://packagist.org/packages/baraja-core/webcrawler) and

[GitHub](https://github.com/baraja-core/webcrawler).

To install, simply use the command:

```

$ composer require baraja-core/webcrawler

```

You can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.

How to use

----------

Crawler can run without dependencies.

In default settings create instance and call `crawl()` method:

```php

$crawler = new \Baraja\WebCrawler\Crawler;

$result = $crawler->crawl('https://example.com');

```

In `$result` variable will be entity of type `CrawledResult`.

Advanced checking of multiple URLs

----------------------------------

In real case you need download multiple URLs in single domain and check if some specific URLs works.

Simple example:

```php

$crawler = new \Baraja\WebCrawler\Crawler;

$result = $crawler->crawlList(

    'https://example.com', // Starting (main) URL

    [ // Additional URLs

        'https://example.com/error-404',

        '/robots.txt', // Relative links are also allowed

        '/web.config',

    ]

);

```

Notice: File **robots.txt** and sitemap will be downloaded automatically if exist.

Settings

--------

In constructor of service `Crawler` you can define your project specific configuration.

Simply like:

```php

$crawler = new \Baraja\WebCrawler\Crawler(

    new \Baraja\WebCrawler\Config([

        // key => value

    ])

);

```

No one value is required. Please use as key-value array.

Configuration options:

| Option                  | Default value | Possible values |

|-------------------------|---------------|-----------------|

| `followExternalLinks`   | `false`       | `Bool`: Stay only in given domain? |

| `sleepBetweenRequests`  | `1000`        | `Int`: Sleep in milliseconds. |

| `maxHttpRequests`       | `1000000`     | `Int`: Crawler budget limit. |

| `maxCrawlTimeInSeconds` | `30`          | `Int`: Stop crawling when limit is exceeded. |

| `allowedUrls`           | `['.+']`      | `String[]`: List of valid regex about allowed URL format. |

| `forbiddenUrls`         | `['']`        | `String[]`: List of valid regex about banned URL format. |

📄 License

-----------

`baraja-core/webcrawler` is licensed under the MIT license. See the [LICENSE](https://github.com/baraja-core/variable-generator/blob/master/LICENSE) file for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/baraja-core/webcrawler

Awesome Lists containing this project

README