Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anshu-krishna/html-scraper
A PHP class to simplify data extraction from HTML.
https://github.com/anshu-krishna/html-scraper
html-scraper html-scraping php php-queryselector scraper web-scraper web-scraping
Last synced: about 2 months ago
JSON representation
A PHP class to simplify data extraction from HTML.
- Host: GitHub
- URL: https://github.com/anshu-krishna/html-scraper
- Owner: anshu-krishna
- License: mit
- Created: 2018-11-16T11:23:40.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-01-23T16:28:47.000Z (almost 2 years ago)
- Last Synced: 2024-09-17T11:28:46.197Z (4 months ago)
- Topics: html-scraper, html-scraping, php, php-queryselector, scraper, web-scraper, web-scraping
- Language: HTML
- Homepage:
- Size: 54.7 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HTML Scraper
A set of PHP classes to simplify data extraction from HTML.### Installation
```
composer require anshu-krishna/html-scraper
```---
>Base code for the *CSS_to_Xpath* method in *HTMLScraper* was cloned from [https://github.com/zendframework/zend-dom](https://github.com/zendframework/zend-dom).
>Zend Framework
>: [http://framework.zend.com/](http://framework.zend.com/)
>Repository
>: [http://github.com/zendframework/zf2](http://github.com/zendframework/zf2)
>Copyright (c) 2005-2015 Zend Technologies USA Inc. [http://www.zend.com](http://www.zend.com)
>License
>: [https://framework.zend.com/license](https://framework.zend.com/license) New BSD License
---For *basic* documentation see the DOC file.
### Example
```php
load_HTML_file('https://www.royalroad.com/fiction/10073/the-wandering-inn')) {
echo 'Unable to load data';
exit(1);
}$data = [];
$data['title'] = $doc->querySelector_extract(TrimmedText, 'div.fic-title h1[property="name"]', 0);
$data['url'] = $doc->xpath_extract(function($meta) {
return $meta->getAttribute('content');
}, '//meta[@property="og:url"]', 0);$data['description'] = htmlspecialchars($doc->querySelector_extract(function(&$div) {
return trim(DOMNodeHelper::innerHTML($div));
}, 'div.description div[property="description"]', 0));$data['tags'] = $doc->querySelector_extract(TrimmedText, 'span.tags span[property="genre"]');
var_dump($data);
```