https://github.com/sebastiansulinski/path-extractor

Parse html document and extract paths from the images, anchors and other tags.
https://github.com/sebastiansulinski/path-extractor

domdocument html php

Last synced: 4 months ago
JSON representation

Parse html document and extract paths from the images, anchors and other tags.

Host: GitHub
URL: https://github.com/sebastiansulinski/path-extractor
Owner: sebastiansulinski
License: mit
Created: 2019-07-18T21:24:36.000Z (almost 6 years ago)
Default Branch: main
Last Pushed: 2023-01-04T12:49:05.000Z (over 2 years ago)
Last Synced: 2024-09-21T14:13:39.975Z (9 months ago)
Topics: domdocument, html, php
Language: PHP
Size: 19.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Path extractor

Package, which extracts paths and attributes from the image, anchor and other tags of the provided html.

### Installation

```bash

composer require sebastiansulinski/path-extractor

```

### Basic usage

#### Instantiating

You can instantiate `Extractor` either by using `new` keyword or static `make` method.

Constructor takes and optional argument, which represents the string to be parsed.

```php

use SSD\PathExtractor\Extractor;

$extractor = new Extractor;

$extractor = new Extractor($html);

$extractor = Extractor::make();

$extractor = Extractor::make($html);

```

#### Specifying input html

Apart from being able to pass your string via constructor, you can also use the `Extractor::for` method to set it on the instance.

```php

$extractor = new Extractor;

$extractor->for($html);

```

#### Extracting images

To extract all images use the `Extractor::extract(Image::class)` method.

```php

use \SSD\PathExtractor\Tags\Image;

$html = '';

$html = .'';

$images = Extractor::make($html)->extract(Image::class);

```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Image` class instances with properties `src` and `alt` available.

#### Extracting anchors

To extract all anchors use the `Extractor::extract(Anchor::class)` method.

```php

use \SSD\PathExtractor\Tags\Anchor;

$html = 'Document one';

$html = .'Word document';

$anchors = Extractor::make($html)->extract(Anchor::class);

```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Anchor` class instances with properties `href`, `target`, `title` and `nodeValue` available.

#### Extracting scripts

To extract all anchors use the `Extractor::extract(Script::class)` method.

```php

use \SSD\PathExtractor\Tags\Script;

$html = '';

$html = .'';

$html = .'';

$scripts = Extractor::make($html)->extract(Script::class);

```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Script` class instances with properties `src`, `async`, and `defer` available - last two with boolean `true` / `false` set based on whether they are present or not.

#### Limiting extensions

Sometimes you might want to only extract images or anchors with certain extensions.

To do this use the `Extractor::withExtensions()` method and pass the required extensions as argument.

```php

$images = Extractor::make($html)->withExtensions('jpg')->extract(Image::class);

$anchors = Extractor::make($html)->withExtensions(['pdf', 'docx'])->extract(Anchor::class);

$anchors = Extractor::make($html)->withExtensions('pdf', 'docx')->extract(Anchor::class);

```

#### Pre-pending url

Sometimes you might wish to prepend the protocol, domain name and even a port to the relative paths extracted from your html.

To do this, use the `Extractor::withUrl()` method.

```php

$html = '';

$html .= '';

$images = Extractor::make($html)->withUrl('https://mywebsite.com')->extract(Image::class);

```

The above will return an array containing two instances of `\SSD\PathExtractor\Tags\Image` - one with `src` set to `https://mywebsite.com/media/image.jpg` and the other to `https://ssdtutorials.com/media/image2.jpg`. **Please note** - it will not replace the paths which already contain protocol and domain.

#### Tidying / purifying input

If you'd like your input to first undergo the purification, you can use the `Extractor::withTidy()` method.

This method takes 2 optional arguments: `array $config = []`, which allows you to overwrite default `tidy` extension configuration as well as `string $encoding = 'utf8'` should you need to change the encoding.

By default config is set to

```php

[

    'clean' => 'yes',

    'output-html' => 'yes',

    'wrap' => 0,

]

```

More on config options at [HTML Tidy Configuration Options](http://tidy.sourceforge.net/docs/quickref.html).

#### Invalid input exception

If you decide NOT to use `tidy` to purify your input, where for instance you will do this before passing the html to the constructor or `for` method and if the provided html contains invalid syntax, the `\SSD\PathExtractor\InvalidHtmlException` will be thrown - so make sure you catch it and act accordingly.

#### 

#### Accessing attributes of the `\SSD\PathExtractor\Tags\Tag` class instance.

Each implementation of `\SSD\PathExtractor\Tags\Tag` will have their own, unique set of properties available

```php

\SSD\PathExtractor\Tags\Anchor

- href

- target

- title

- rel

- nodeValue (represents text in between opening and closing a tag)

\SSD\PathExtractor\Tags\Image

- src

- alt

- width

- height

\SSD\PathExtractor\Tags\Script

- src

- type

- charset

- async

- defer

\SSD\PathExtractor\Tags\Link

- href

- type

- rel

```

#### Rendering tag for `\SSD\PathExtractor\Tags\Tag` class instance.

Once you have extracted the collection of resources, you can then return an html tag for each one by simply casting it to string or by calling the `tag()` method on it.

```php

$html = '';

$html = .'';

$tag1 = (string)Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0];

$tag2 = Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->tag();

``` 

Both of the above will return

```php



```

You can also obtain array representation of each instance by calling `Tag::toArray()` method on it

```php

Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->toArray()

```

#### Adding more tag types

If you need more tag types i.e. `link` - simply add new class that extends `\SSD\PathExtractor\Tags\Tag` and implement the abstract methods required by it.

```php

use SSD\PathExtractor\Tags\Tag;

use SSD\PathExtractor\Tags\Type;

class Link extends Tag

{

    /**

     * Get tag name.

     *

     * @return string

     */

    static public function tagName(): string

    {

        return 'link';

    }

    /**

     * Get path attribute.

     *

     * @return string

     */

    static public function pathAttribute(): string

    {

        return 'href';

    }

    /**

     * Get available attributes.

     *

     * @return array

     */

    static public function availableAttributes(): array

    {

        return [

            'href' => Type::STRING,

            'type' => Type::STRING,

            'rel' => Type::STRING,

        ];

    }

    /**

     * Get formatted tag.

     *

     * @return string

     */

    public function tag(): string

    {

        return 'tagAttributes('href', 'type', 'rel').'>';

    }

}

```

#### Example of extracting only paths

```php

$string = '';

$string .= '';

$string .= 'Document';

$string .= '';

$string .= '';

$extractor = Extractor::make($string);

$images = array_map(function (Tag $tag) {

    return $tag->path();

}, $extractor->extract(Image::class));

$anchors = array_map(function (Tag $tag) {

    return $tag->path();

}, $extractor->extract(Anchor::class));

$scripts = array_map(function (Tag $tag) {

    return $tag->path();

}, $extractor->extract(Script::class));

$links = array_map(function (Tag $tag) {

    return $tag->path();

}, $extractor->extract(Link::class));

$this->assertEquals([

    '/media/image/one.jpg',

    'https://mysite.com/media/image/two.jpg',

    '/media/files/two.pdf',

    '/media/script/three.js',

    '/media/link/three.css',

], array_merge($images, $anchors, $scripts, $links));

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sebastiansulinski/path-extractor

Awesome Lists containing this project

README