https://github.com/sebastiansulinski/path-extractor
Parse html document and extract paths from the images, anchors and other tags.
https://github.com/sebastiansulinski/path-extractor
domdocument html php
Last synced: 4 months ago
JSON representation
Parse html document and extract paths from the images, anchors and other tags.
- Host: GitHub
- URL: https://github.com/sebastiansulinski/path-extractor
- Owner: sebastiansulinski
- License: mit
- Created: 2019-07-18T21:24:36.000Z (almost 6 years ago)
- Default Branch: main
- Last Pushed: 2023-01-04T12:49:05.000Z (over 2 years ago)
- Last Synced: 2024-09-21T14:13:39.975Z (9 months ago)
- Topics: domdocument, html, php
- Language: PHP
- Size: 19.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Path extractor
Package, which extracts paths and attributes from the image, anchor and other tags of the provided html.
### Installation
```bash
composer require sebastiansulinski/path-extractor
```### Basic usage
#### Instantiating
You can instantiate `Extractor` either by using `new` keyword or static `make` method.
Constructor takes and optional argument, which represents the string to be parsed.```php
use SSD\PathExtractor\Extractor;$extractor = new Extractor;
$extractor = new Extractor($html);
$extractor = Extractor::make();
$extractor = Extractor::make($html);
```#### Specifying input html
Apart from being able to pass your string via constructor, you can also use the `Extractor::for` method to set it on the instance.
```php
$extractor = new Extractor;
$extractor->for($html);
```#### Extracting images
To extract all images use the `Extractor::extract(Image::class)` method.
```php
use \SSD\PathExtractor\Tags\Image;$html = '
';
$html = .'';
$images = Extractor::make($html)->extract(Image::class);
```The above will return array containing the collection of `\SSD\PathExtractor\Tags\Image` class instances with properties `src` and `alt` available.
#### Extracting anchors
To extract all anchors use the `Extractor::extract(Anchor::class)` method.
```php
use \SSD\PathExtractor\Tags\Anchor;$html = 'Document one';
$html = .'Word document';$anchors = Extractor::make($html)->extract(Anchor::class);
```The above will return array containing the collection of `\SSD\PathExtractor\Tags\Anchor` class instances with properties `href`, `target`, `title` and `nodeValue` available.
#### Extracting scripts
To extract all anchors use the `Extractor::extract(Script::class)` method.
```php
use \SSD\PathExtractor\Tags\Script;$html = '';
$html = .'';
$html = .'';$scripts = Extractor::make($html)->extract(Script::class);
```The above will return array containing the collection of `\SSD\PathExtractor\Tags\Script` class instances with properties `src`, `async`, and `defer` available - last two with boolean `true` / `false` set based on whether they are present or not.
#### Limiting extensions
Sometimes you might want to only extract images or anchors with certain extensions.
To do this use the `Extractor::withExtensions()` method and pass the required extensions as argument.```php
$images = Extractor::make($html)->withExtensions('jpg')->extract(Image::class);
$anchors = Extractor::make($html)->withExtensions(['pdf', 'docx'])->extract(Anchor::class);
$anchors = Extractor::make($html)->withExtensions('pdf', 'docx')->extract(Anchor::class);
```#### Pre-pending url
Sometimes you might wish to prepend the protocol, domain name and even a port to the relative paths extracted from your html.
To do this, use the `Extractor::withUrl()` method.```php
$html = '';
$html .= '';
$images = Extractor::make($html)->withUrl('https://mywebsite.com')->extract(Image::class);
```The above will return an array containing two instances of `\SSD\PathExtractor\Tags\Image` - one with `src` set to `https://mywebsite.com/media/image.jpg` and the other to `https://ssdtutorials.com/media/image2.jpg`. **Please note** - it will not replace the paths which already contain protocol and domain.
#### Tidying / purifying input
If you'd like your input to first undergo the purification, you can use the `Extractor::withTidy()` method.
This method takes 2 optional arguments: `array $config = []`, which allows you to overwrite default `tidy` extension configuration as well as `string $encoding = 'utf8'` should you need to change the encoding.By default config is set to
```php
[
'clean' => 'yes',
'output-html' => 'yes',
'wrap' => 0,
]
```More on config options at [HTML Tidy Configuration Options](http://tidy.sourceforge.net/docs/quickref.html).
#### Invalid input exception
If you decide NOT to use `tidy` to purify your input, where for instance you will do this before passing the html to the constructor or `for` method and if the provided html contains invalid syntax, the `\SSD\PathExtractor\InvalidHtmlException` will be thrown - so make sure you catch it and act accordingly.
####
#### Accessing attributes of the `\SSD\PathExtractor\Tags\Tag` class instance.
Each implementation of `\SSD\PathExtractor\Tags\Tag` will have their own, unique set of properties available
```php
\SSD\PathExtractor\Tags\Anchor- href
- target
- title
- rel
- nodeValue (represents text in between opening and closing a tag)\SSD\PathExtractor\Tags\Image
- src
- alt
- width
- height\SSD\PathExtractor\Tags\Script
- src
- type
- charset
- async
- defer\SSD\PathExtractor\Tags\Link
- href
- type
- rel
```#### Rendering tag for `\SSD\PathExtractor\Tags\Tag` class instance.
Once you have extracted the collection of resources, you can then return an html tag for each one by simply casting it to string or by calling the `tag()` method on it.
```php
$html = '';
$html = .'';
$tag1 = (string)Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0];
$tag2 = Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->tag();
```Both of the above will return
```php
![]()
```You can also obtain array representation of each instance by calling `Tag::toArray()` method on it
```php
Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->toArray()
```#### Adding more tag types
If you need more tag types i.e. `link` - simply add new class that extends `\SSD\PathExtractor\Tags\Tag` and implement the abstract methods required by it.
```php
use SSD\PathExtractor\Tags\Tag;
use SSD\PathExtractor\Tags\Type;class Link extends Tag
{
/**
* Get tag name.
*
* @return string
*/
static public function tagName(): string
{
return 'link';
}/**
* Get path attribute.
*
* @return string
*/
static public function pathAttribute(): string
{
return 'href';
}/**
* Get available attributes.
*
* @return array
*/
static public function availableAttributes(): array
{
return [
'href' => Type::STRING,
'type' => Type::STRING,
'rel' => Type::STRING,
];
}/**
* Get formatted tag.
*
* @return string
*/
public function tag(): string
{
return 'tagAttributes('href', 'type', 'rel').'>';
}
}
```#### Example of extracting only paths
```php
$string = '';
$string .= '';
$string .= 'Document';
$string .= '';
$string .= '';$extractor = Extractor::make($string);
$images = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Image::class));$anchors = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Anchor::class));$scripts = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Script::class));$links = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Link::class));$this->assertEquals([
'/media/image/one.jpg',
'https://mysite.com/media/image/two.jpg',
'/media/files/two.pdf',
'/media/script/three.js',
'/media/link/three.css',
], array_merge($images, $anchors, $scripts, $links));
```