Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/anshu-krishna/html-scraper

A PHP class to simplify data extraction from HTML.
https://github.com/anshu-krishna/html-scraper

html-scraper html-scraping php php-queryselector scraper web-scraper web-scraping

Last synced: about 2 months ago
JSON representation

A PHP class to simplify data extraction from HTML.

Awesome Lists containing this project

README

        

# HTML Scraper
A set of PHP classes to simplify data extraction from HTML.

### Installation
```
composer require anshu-krishna/html-scraper
```

---

>Base code for the *CSS_to_Xpath* method in *HTMLScraper* was cloned from [https://github.com/zendframework/zend-dom](https://github.com/zendframework/zend-dom).
>Zend Framework
>: [http://framework.zend.com/](http://framework.zend.com/)
>Repository
>: [http://github.com/zendframework/zf2](http://github.com/zendframework/zf2)
>Copyright (c) 2005-2015 Zend Technologies USA Inc. [http://www.zend.com](http://www.zend.com)
>License
>: [https://framework.zend.com/license](https://framework.zend.com/license) New BSD License
---

For *basic* documentation see the DOC file.

### Example
```php
load_HTML_file('https://www.royalroad.com/fiction/10073/the-wandering-inn')) {
echo 'Unable to load data';
exit(1);
}

$data = [];

$data['title'] = $doc->querySelector_extract(TrimmedText, 'div.fic-title h1[property="name"]', 0);

$data['url'] = $doc->xpath_extract(function($meta) {
return $meta->getAttribute('content');
}, '//meta[@property="og:url"]', 0);

$data['description'] = htmlspecialchars($doc->querySelector_extract(function(&$div) {
return trim(DOMNodeHelper::innerHTML($div));
}, 'div.description div[property="description"]', 0));

$data['tags'] = $doc->querySelector_extract(TrimmedText, 'span.tags span[property="genre"]');

var_dump($data);
```