Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/baraja-core/webcrawler
Simple crawling websites by following links.
https://github.com/baraja-core/webcrawler
bot crawler crawling-websites fast php robot speed
Last synced: about 2 months ago
JSON representation
Simple crawling websites by following links.
- Host: GitHub
- URL: https://github.com/baraja-core/webcrawler
- Owner: baraja-core
- License: mit
- Created: 2019-07-24T07:40:29.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-06-09T20:13:55.000Z (7 months ago)
- Last Synced: 2024-08-12T11:23:00.829Z (5 months ago)
- Topics: bot, crawler, crawling-websites, fast, php, robot, speed
- Language: PHP
- Homepage: https://en.php.brj.cz/downloading-the-whole-site-by-links-in-php
- Size: 87.9 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Web crawler
===========![Integrity check](https://github.com/baraja-core/webcrawler/workflows/Integrity%20check/badge.svg)
Simply library for crawling websites by following links with minimal dependencies.
[Czech documentation](https://php.baraja.cz/stazeni-celeho-webu-po-odkazech)
📦 Installation
---------------It's best to use [Composer](https://getcomposer.org) for installation, and you can also find the package on
[Packagist](https://packagist.org/packages/baraja-core/webcrawler) and
[GitHub](https://github.com/baraja-core/webcrawler).To install, simply use the command:
```
$ composer require baraja-core/webcrawler
```You can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.
How to use
----------Crawler can run without dependencies.
In default settings create instance and call `crawl()` method:
```php
$crawler = new \Baraja\WebCrawler\Crawler;$result = $crawler->crawl('https://example.com');
```In `$result` variable will be entity of type `CrawledResult`.
Advanced checking of multiple URLs
----------------------------------In real case you need download multiple URLs in single domain and check if some specific URLs works.
Simple example:
```php
$crawler = new \Baraja\WebCrawler\Crawler;$result = $crawler->crawlList(
'https://example.com', // Starting (main) URL
[ // Additional URLs
'https://example.com/error-404',
'/robots.txt', // Relative links are also allowed
'/web.config',
]
);
```Notice: File **robots.txt** and sitemap will be downloaded automatically if exist.
Settings
--------In constructor of service `Crawler` you can define your project specific configuration.
Simply like:
```php
$crawler = new \Baraja\WebCrawler\Crawler(
new \Baraja\WebCrawler\Config([
// key => value
])
);
```No one value is required. Please use as key-value array.
Configuration options:
| Option | Default value | Possible values |
|-------------------------|---------------|-----------------|
| `followExternalLinks` | `false` | `Bool`: Stay only in given domain? |
| `sleepBetweenRequests` | `1000` | `Int`: Sleep in milliseconds. |
| `maxHttpRequests` | `1000000` | `Int`: Crawler budget limit. |
| `maxCrawlTimeInSeconds` | `30` | `Int`: Stop crawling when limit is exceeded. |
| `allowedUrls` | `['.+']` | `String[]`: List of valid regex about allowed URL format. |
| `forbiddenUrls` | `['']` | `String[]`: List of valid regex about banned URL format. |📄 License
-----------`baraja-core/webcrawler` is licensed under the MIT license. See the [LICENSE](https://github.com/baraja-core/variable-generator/blob/master/LICENSE) file for more details.