Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hanson/crawler
A easy package to crawl a site list and detail
https://github.com/hanson/crawler
Last synced: 24 days ago
JSON representation
A easy package to crawl a site list and detail
- Host: GitHub
- URL: https://github.com/hanson/crawler
- Owner: Hanson
- License: mit
- Created: 2016-05-18T18:42:48.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-06-30T03:20:48.000Z (over 6 years ago)
- Last Synced: 2024-03-14T21:28:11.251Z (8 months ago)
- Language: PHP
- Size: 28.3 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# crawler
A easy package to crawl a site list and detail
## Installation
```
composer require hanccc/crawler
```## usage
This package require [Goutte](https://github.com/FriendsOfPHP/Goutte), you can get the dom by ```$this->crawler();``` in both of list and detail.
### example
```
//or $listCrawler = new ExampleListCrawler(storage_path('logs'));
$listCrawler = new ExampleListCrawler('http://example.com', storage_path('logs'));
$listCrawler->setDetailCrawler(new ExampleDetailCrawler());
$listCrawler->start();
```#### ListCrawler
```
class ExampleListCrawler extends ListCrawler{
public $url = 'http://example.com';
//return links per page
public function getEachPageUrl($page)
{
return 'http://example.com/list&page=' . $page;
}
// get the maximum number of pages
public function setMaxPage()
{
$this->maxPage = $num;
}
}```
#### DetailCrawler
```
class ExampleDetailCrawler extends DetailCrawler{//Returns boolean
public function isDetailUrl($url)
{
if(preg_match('/example.com\/id(\d+)/, $url))
return true;
}
// what you want to do about the detail page
public function handle()
{
echo $this->crawler->filter('title')->text();
}
}
```## License
Crawler is open-sourced software licensed under the [MIT license](http://opensource.org/licenses/MIT).