{"id":18870406,"url":"https://github.com/sebastiansulinski/path-extractor","last_synced_at":"2026-02-14T18:30:15.276Z","repository":{"id":62541643,"uuid":"197659006","full_name":"sebastiansulinski/path-extractor","owner":"sebastiansulinski","description":"Parse html document and extract paths from the images, anchors and other tags.","archived":false,"fork":false,"pushed_at":"2023-01-04T12:49:05.000Z","size":20,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-09-21T14:13:39.975Z","etag":null,"topics":["domdocument","html","php"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sebastiansulinski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-18T21:24:36.000Z","updated_at":"2023-03-19T17:44:56.000Z","dependencies_parsed_at":"2023-02-02T13:00:22.284Z","dependency_job_id":null,"html_url":"https://github.com/sebastiansulinski/path-extractor","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastiansulinski%2Fpath-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastiansulinski%2Fpath-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastiansulinski%2Fpath-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebastiansulinski%2Fpath-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sebastiansulinski","download_url":"https://codeload.github.com/sebastiansulinski/path-extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239816507,"owners_count":19701755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["domdocument","html","php"],"created_at":"2024-11-08T05:20:03.487Z","updated_at":"2026-02-14T18:30:15.214Z","avatar_url":"https://github.com/sebastiansulinski.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Path extractor\n\nPackage, which extracts paths and attributes from the image, anchor and other tags of the provided html.\n\n### Installation\n\n```bash\ncomposer require sebastiansulinski/path-extractor\n```\n\n### Basic usage\n\n#### Instantiating\n\nYou can instantiate `Extractor` either by using `new` keyword or static `make` method.\nConstructor takes and optional argument, which represents the string to be parsed.\n\n```php\nuse SSD\\PathExtractor\\Extractor;\n\n\n$extractor = new Extractor;\n\n$extractor = new Extractor($html);\n\n$extractor = Extractor::make();\n\n$extractor = Extractor::make($html);\n```\n\n#### Specifying input html\n\nApart from being able to pass your string via constructor, you can also use the `Extractor::for` method to set it on the instance.\n\n```php\n$extractor = new Extractor;\n$extractor-\u003efor($html);\n```\n\n#### Extracting images\n\nTo extract all images use the `Extractor::extract(Image::class)` method.\n\n```php\nuse \\SSD\\PathExtractor\\Tags\\Image;\n\n$html = '\u003cimg src=\"/media/image.jpg\" alt=\"My image\"\u003e';\n$html = .'\u003cimg src=\"/media/image2.png\" alt=\"My image 2\"\u003e';\n\n$images = Extractor::make($html)-\u003eextract(Image::class);\n```\n\nThe above will return array containing the collection of `\\SSD\\PathExtractor\\Tags\\Image` class instances with properties `src` and `alt` available.\n\n#### Extracting anchors\n\nTo extract all anchors use the `Extractor::extract(Anchor::class)` method.\n\n```php\nuse \\SSD\\PathExtractor\\Tags\\Anchor;\n\n$html = '\u003ca href=\"/media/files/one.pdf\" target=\"_blank\"\u003eDocument one\u003c/a\u003e';\n$html = .'\u003ca href=\"/media/files/two.docx\" title=\"Word document\"\u003eWord document\u003c/a\u003e';\n\n$anchors = Extractor::make($html)-\u003eextract(Anchor::class);\n```\n\nThe above will return array containing the collection of `\\SSD\\PathExtractor\\Tags\\Anchor` class instances with properties `href`, `target`, `title` and `nodeValue` available.\n\n#### Extracting scripts\n\nTo extract all anchors use the `Extractor::extract(Script::class)` method.\n\n```php\nuse \\SSD\\PathExtractor\\Tags\\Script;\n\n$html = '\u003cscript src=\"/media/script/one.js\" async\u003e\u003c/script\u003e';\n$html = .'\u003cscript src=\"/media/script/two.js\" async defer\u003e\u003c/script\u003e';\n$html = .'\u003cscript src=\"/media/script/three.js\"\u003e\u003c/script\u003e';\n\n$scripts = Extractor::make($html)-\u003eextract(Script::class);\n```\n\nThe above will return array containing the collection of `\\SSD\\PathExtractor\\Tags\\Script` class instances with properties `src`, `async`, and `defer` available - last two with boolean `true` / `false` set based on whether they are present or not.\n\n#### Limiting extensions\n\nSometimes you might want to only extract images or anchors with certain extensions.\nTo do this use the `Extractor::withExtensions()` method and pass the required extensions as argument.\n\n```php\n$images = Extractor::make($html)-\u003ewithExtensions('jpg')-\u003eextract(Image::class);\n$anchors = Extractor::make($html)-\u003ewithExtensions(['pdf', 'docx'])-\u003eextract(Anchor::class);\n$anchors = Extractor::make($html)-\u003ewithExtensions('pdf', 'docx')-\u003eextract(Anchor::class);\n```\n\n#### Pre-pending url\n\nSometimes you might wish to prepend the protocol, domain name and even a port to the relative paths extracted from your html.\nTo do this, use the `Extractor::withUrl()` method.\n\n```php\n$html = '\u003cimg src=\"/media/image.jpg\" alt=\"My image\"\u003e';\n$html .= '\u003cimg src=\"https://ssdtutorials.com/media/image2.jpg\" alt=\"My image 2\"\u003e';\n\n$images = Extractor::make($html)-\u003ewithUrl('https://mywebsite.com')-\u003eextract(Image::class);\n```\n\nThe above will return an array containing two instances of `\\SSD\\PathExtractor\\Tags\\Image` - one with `src` set to `https://mywebsite.com/media/image.jpg` and the other to `https://ssdtutorials.com/media/image2.jpg`. **Please note** - it will not replace the paths which already contain protocol and domain.\n\n#### Tidying / purifying input\n\nIf you'd like your input to first undergo the purification, you can use the `Extractor::withTidy()` method.\nThis method takes 2 optional arguments: `array $config = []`, which allows you to overwrite default `tidy` extension configuration as well as `string $encoding = 'utf8'` should you need to change the encoding.\n\nBy default config is set to\n\n```php\n[\n    'clean' =\u003e 'yes',\n    'output-html' =\u003e 'yes',\n    'wrap' =\u003e 0,\n]\n```\n\nMore on config options at [HTML Tidy Configuration Options](http://tidy.sourceforge.net/docs/quickref.html).\n\n#### Invalid input exception\n\nIf you decide NOT to use `tidy` to purify your input, where for instance you will do this before passing the html to the constructor or `for` method and if the provided html contains invalid syntax, the `\\SSD\\PathExtractor\\InvalidHtmlException` will be thrown - so make sure you catch it and act accordingly.\n\n#### \n\n#### Accessing attributes of the `\\SSD\\PathExtractor\\Tags\\Tag` class instance.\n\nEach implementation of `\\SSD\\PathExtractor\\Tags\\Tag` will have their own, unique set of properties available\n\n```php\n\\SSD\\PathExtractor\\Tags\\Anchor\n\n- href\n- target\n- title\n- rel\n- nodeValue (represents text in between opening and closing a tag)\n\n\\SSD\\PathExtractor\\Tags\\Image\n\n- src\n- alt\n- width\n- height\n\n\\SSD\\PathExtractor\\Tags\\Script\n\n- src\n- type\n- charset\n- async\n- defer\n\n\\SSD\\PathExtractor\\Tags\\Link\n\n- href\n- type\n- rel\n```\n\n#### Rendering tag for `\\SSD\\PathExtractor\\Tags\\Tag` class instance.\n\nOnce you have extracted the collection of resources, you can then return an html tag for each one by simply casting it to string or by calling the `tag()` method on it.\n\n```php\n$html = '\u003cimg src=\"/media/image.jpg\" alt=\"My image\"\u003e';\n$html = .'\u003cimg src=\"/media/image2.png\" alt=\"My image 2\"\u003e';\n\n$tag1 = (string)Extractor::make($html)-\u003ewithExtensions('jpg')-\u003eextract(Image::class)[0];\n$tag2 = Extractor::make($html)-\u003ewithExtensions('jpg')-\u003eextract(Image::class)[0]-\u003etag();\n``` \n\nBoth of the above will return\n\n```php\n\u003cimg src=\"/media/image.jpg\" alt=\"My image\"\u003e\n```\n\nYou can also obtain array representation of each instance by calling `Tag::toArray()` method on it\n\n```php\nExtractor::make($html)-\u003ewithExtensions('jpg')-\u003eextract(Image::class)[0]-\u003etoArray()\n```\n\n#### Adding more tag types\n\nIf you need more tag types i.e. `link` - simply add new class that extends `\\SSD\\PathExtractor\\Tags\\Tag` and implement the abstract methods required by it.\n\n```php\n\nuse SSD\\PathExtractor\\Tags\\Tag;\nuse SSD\\PathExtractor\\Tags\\Type;\n\nclass Link extends Tag\n{\n    /**\n     * Get tag name.\n     *\n     * @return string\n     */\n    static public function tagName(): string\n    {\n        return 'link';\n    }\n\n    /**\n     * Get path attribute.\n     *\n     * @return string\n     */\n    static public function pathAttribute(): string\n    {\n        return 'href';\n    }\n\n    /**\n     * Get available attributes.\n     *\n     * @return array\n     */\n    static public function availableAttributes(): array\n    {\n        return [\n            'href' =\u003e Type::STRING,\n            'type' =\u003e Type::STRING,\n            'rel' =\u003e Type::STRING,\n        ];\n    }\n\n    /**\n     * Get formatted tag.\n     *\n     * @return string\n     */\n    public function tag(): string\n    {\n        return '\u003clink'.$this-\u003etagAttributes('href', 'type', 'rel').'\u003e';\n    }\n}\n```\n\n#### Example of extracting only paths\n\n```php\n$string = '\u003cimg src=\"/media/image/one.jpg\" alt=\"Image one\"\u003e';\n$string .= '\u003cimg src=\"https://mysite.com/media/image/two.jpg\" alt=\"Image two\"\u003e';\n$string .= '\u003ca href=\"/media/files/two.pdf\" '.\n    'target=\"_blank\" title=\"Document\"\u003eDocument\u003c/a\u003e';\n$string .= '\u003cscript src=\"/media/script/three.js\" async\u003e\u003c/script\u003e';\n$string .= '\u003clink href=\"/media/link/three.css\" rel=\"stylesheet\"\u003e';\n\n$extractor = Extractor::make($string);\n\n\n$images = array_map(function (Tag $tag) {\n    return $tag-\u003epath();\n}, $extractor-\u003eextract(Image::class));\n\n$anchors = array_map(function (Tag $tag) {\n    return $tag-\u003epath();\n}, $extractor-\u003eextract(Anchor::class));\n\n\n$scripts = array_map(function (Tag $tag) {\n    return $tag-\u003epath();\n}, $extractor-\u003eextract(Script::class));\n\n$links = array_map(function (Tag $tag) {\n    return $tag-\u003epath();\n}, $extractor-\u003eextract(Link::class));\n\n\n$this-\u003eassertEquals([\n    '/media/image/one.jpg',\n    'https://mysite.com/media/image/two.jpg',\n    '/media/files/two.pdf',\n    '/media/script/three.js',\n    '/media/link/three.css',\n], array_merge($images, $anchors, $scripts, $links));\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebastiansulinski%2Fpath-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsebastiansulinski%2Fpath-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebastiansulinski%2Fpath-extractor/lists"}