Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bjoern-hempel/php-web-crawler
A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.
https://github.com/bjoern-hempel/php-web-crawler
crawler mit-license php recursive webcrawler webscraper xpath
Last synced: about 2 months ago
JSON representation
A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.
- Host: GitHub
- URL: https://github.com/bjoern-hempel/php-web-crawler
- Owner: bjoern-hempel
- License: mit
- Created: 2018-03-11T23:43:52.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2024-02-24T14:20:37.000Z (10 months ago)
- Last Synced: 2024-02-24T15:28:32.817Z (10 months ago)
- Topics: crawler, mit-license, php, recursive, webcrawler, webscraper, xpath
- Language: PHP
- Homepage:
- Size: 223 KB
- Stars: 9
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
> **Attention**: This package is outdated and is no longer maintained. Use [PHP Web Crawler](https://github.com/ixnode/php-web-crawler) instead.
# WebCrawler
This php class allows you to crawl recursively a given webpage (or a given html file) and collect some data from it. Simply define the url (or a html file) and a set of xpath expressions which should map with the output data object. The final representation will be a php array which can be easily converted into the json format for further processing.
## 0. Introduction
### 0.1 Installation
```bash
user$ git clone [email protected]:bjoern-hempel/php-web-crawler.git .
```### 0.2 requirements
TODO...
## 1. Execute the examples
```bash
user$ php examples/simple.php
{
"version": "1.0.0",
"title": "Test Title",
"paragraph": "Test Paragraph"
}
```## 2. How to use
### 2.1 Basic usage [simple.php](examples/simple.php) (simple html page)
[basic.html](examples/html/basic.html)
```html
Test Page
Test Title
Test Paragraph
```
[simple.php](examples/simple.php)
```php5
parse(), JSON_PRETTY_PRINT);print_r($data);
echo "\n";
```It returns:
```json
{
"version": "1.0.0",
"title": "Test Title",
"paragraph": "Test Paragraph"
}
```#### 2.2 More examples
* [examples/simple-wiki-page.php](examples/simple-wiki-page.php)
* [examples/group.php](examples/group.php)
* [examples/section.php](examples/section.php)
* [examples/sections.php](examples/sections.php)
* [examples/url.php](examples/url.php)### 2.2 Complex examples
TODO...
### 2.3 Converter
TODO...
### 2.4 Filters
TODO...
## 3. Running the tests
```bash
user$ phpunit tests/Basic.php
PHPUnit 7.0.2 by Sebastian Bergmann and contributors... 2 / 2 (100%)
Time: 126 ms, Memory: 8.00MB
OK (2 tests, 16 assertions)
```## Using `composer`'s autoload (manual installation)
Using the autoloader function of the `composer` it is possible to use this classes without including the source files.
Make some changes to your `composer.json`:
```javascript
"autoload": {
"psr-0": {
...
"Ixno\\WebCrawler\\":"vendor/ixno/webcrawler/",
...
}
},
```Add this project to your `vendor` directory:
```bash
user$ cd /path/to/root/of/project
user$ mkdir vendor/ixno/webcrawler && cd vendor/ixno/webcrawler
user$ git clone [email protected]:bjoern-hempel/php-web-crawler.git . && cd ../../..
```Call the `composer` to create the composer autoloading mappings:
```bash
user$ composer.phar dumpautoload -o
```Check the result:
```bash
user$ grep -r Ixno vendor/composer/.
```You will something like the following lines:
```php
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Converter\\Converter' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/Converter.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Converter\\DateParser' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/DateParser.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\CrawlRule' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Crawler' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Page' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\PageGroup' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\PageList' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
...
```Now you can simply use all classes without including the source files.
## A. Authors
* **Björn Hempel** - *Initial work* - [Björn Hempel](https://github.com/bjoern-hempel)
## B. License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
## C. Closing words
Have fun! :)