{"id":13545670,"url":"https://github.com/VIPnytt/SitemapParser","last_synced_at":"2025-04-02T15:31:46.177Z","repository":{"id":6348037,"uuid":"55315447","full_name":"VIPnytt/SitemapParser","owner":"VIPnytt","description":"XML Sitemap parser class compliant with the Sitemaps.org protocol.","archived":false,"fork":false,"pushed_at":"2023-11-27T15:30:30.000Z","size":65,"stargazers_count":72,"open_issues_count":0,"forks_count":28,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-05-17T04:43:59.805Z","etag":null,"topics":["parser","sitemap","sitemaps-org","xml","xml-sitemap-parser"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VIPnytt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-04-02T20:47:41.000Z","updated_at":"2024-05-01T16:57:45.000Z","dependencies_parsed_at":"2024-01-16T17:02:22.081Z","dependency_job_id":"81e16450-9ce3-4ab9-98be-f88f07eef81d","html_url":"https://github.com/VIPnytt/SitemapParser","commit_stats":{"total_commits":47,"total_committers":11,"mean_commits":"4.2727272727272725","dds":"0.23404255319148937","last_synced_commit":"c263e1f0fdcc8be541e805d52acab801a43be358"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIPnytt%2FSitemapParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIPnytt%2FSitemapParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIPnytt%2FSitemapParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIPnytt%2FSitemapParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VIPnytt","download_url":"https://codeload.github.com/VIPnytt/SitemapParser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242478375,"owners_count":20134966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parser","sitemap","sitemaps-org","xml","xml-sitemap-parser"],"created_at":"2024-08-01T11:01:09.005Z","updated_at":"2025-04-02T15:31:41.168Z","avatar_url":"https://github.com/VIPnytt.png","language":"PHP","funding_links":[],"categories":["PHP"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/VIPnytt/SitemapParser.svg?branch=master)](https://travis-ci.org/VIPnytt/SitemapParser)\n[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/VIPnytt/SitemapParser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/VIPnytt/SitemapParser/?branch=master)\n[![Code Climate](https://codeclimate.com/github/VIPnytt/SitemapParser/badges/gpa.svg)](https://codeclimate.com/github/VIPnytt/SitemapParser)\n[![Test Coverage](https://codeclimate.com/github/VIPnytt/SitemapParser/badges/coverage.svg)](https://codeclimate.com/github/VIPnytt/SitemapParser/coverage)\n[![License](https://poser.pugx.org/VIPnytt/SitemapParser/license)](https://github.com/VIPnytt/SitemapParser/blob/master/LICENSE)\n[![Packagist](https://img.shields.io/packagist/v/VIPnytt/SitemapParser.svg)](https://packagist.org/packages/VIPnytt/SitemapParser)\n[![Join the chat at https://gitter.im/VIPnytt/SitemapParser](https://badges.gitter.im/VIPnytt/SitemapParser.svg)](https://gitter.im/VIPnytt/SitemapParser)\n\n# XML Sitemap parser\nAn easy-to-use PHP library to parse XML Sitemaps compliant with the [Sitemaps.org protocol](http://www.sitemaps.org/protocol.html).\n\nThe [Sitemaps.org](http://www.sitemaps.org/) protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.\n\n[![SensioLabsInsight](https://insight.sensiolabs.com/projects/2d3fbd49-66c4-4ab9-9007-aaeec6956d30/big.png)](https://insight.sensiolabs.com/projects/2d3fbd49-66c4-4ab9-9007-aaeec6956d30)\n\n## Features\n- Basic parsing\n- Recursive parsing\n- String parsing\n- Custom User-Agent string\n- Proxy support\n- URL blacklist\n- request throttling (using https://github.com/hamburgscleanest/guzzle-advanced-throttle)\n- retry (using https://github.com/caseyamcl/guzzle_retry_middleware)\n- advanced logging (using https://github.com/gmponos/guzzle_logger)\n\n## Formats supported\n- XML `.xml`\n- Compressed XML `.xml.gz`\n- Robots.txt rule sheet `robots.txt`\n- Line separated text _(disabled by default)_\n\n## Requirements:\n- PHP [5.6 or 7.0+](http://php.net/supported-versions.php), alternatively [HHVM](http://hhvm.com)\n- PHP extensions:\n  - [mbstring](http://php.net/manual/en/book.mbstring.php)\n  - [libxml](http://php.net/manual/en/book.libxml.php) _(enabled by default)_\n  - [SimpleXML](http://php.net/manual/en/book.simplexml.php) _(enabled by default)_\n- Optional:\n  - https://github.com/caseyamcl/guzzle_retry_middleware\n  - https://github.com/hamburgscleanest/guzzle-advanced-throttle\n## Installation\nThe library is available for install via [Composer](https://getcomposer.org). Just add this to your `composer.json` file:\n```json\n{\n    \"require\": {\n        \"vipnytt/sitemapparser\": \"^1.0\"\n    }\n}\n```\nThen run `composer update`.\n\n## Getting Started\n\n### Basic example\nReturns an list of URLs only.\n```php\nuse vipnytt\\SitemapParser;\nuse vipnytt\\SitemapParser\\Exceptions\\SitemapParserException;\n\ntry {\n    $parser = new SitemapParser();\n    $parser-\u003eparse('http://php.net/sitemap.xml');\n    foreach ($parser-\u003egetURLs() as $url =\u003e $tags) {\n        echo $url . '\u003cbr\u003e';\n    }\n} catch (SitemapParserException $e) {\n    echo $e-\u003egetMessage();\n}\n```\n\n### Advanced\nReturns all available tags, for both Sitemaps and URLs.\n```php\nuse vipnytt\\SitemapParser;\nuse vipnytt\\SitemapParser\\Exceptions\\SitemapParserException;\n\ntry {\n    $parser = new SitemapParser('MyCustomUserAgent');\n    $parser-\u003eparse('http://php.net/sitemap.xml');\n    foreach ($parser-\u003egetSitemaps() as $url =\u003e $tags) {\n        echo 'Sitemap\u003cbr\u003e';\n        echo 'URL: ' . $url . '\u003cbr\u003e';\n        echo 'LastMod: ' . $tags['lastmod'] . '\u003cbr\u003e';\n        echo '\u003chr\u003e';\n    }\n    foreach ($parser-\u003egetURLs() as $url =\u003e $tags) {\n        echo 'URL: ' . $url . '\u003cbr\u003e';\n        echo 'LastMod: ' . $tags['lastmod'] . '\u003cbr\u003e';\n        echo 'ChangeFreq: ' . $tags['changefreq'] . '\u003cbr\u003e';\n        echo 'Priority: ' . $tags['priority'] . '\u003cbr\u003e';\n        echo '\u003chr\u003e';\n    }\n} catch (SitemapParserException $e) {\n    echo $e-\u003egetMessage();\n}\n```\n\n### Recursive\nParses any sitemap detected while parsing, to get an complete list of URLs.\n\nUse `url_black_list` to skip sitemaps that are part of parent sitemap. Exact match only.\n```php\nuse vipnytt\\SitemapParser;\nuse vipnytt\\SitemapParser\\Exceptions\\SitemapParserException;\n\ntry {\n    $parser = new SitemapParser('MyCustomUserAgent');\n    $parser-\u003eparseRecursive('http://www.google.com/robots.txt');\n    echo '\u003ch2\u003eSitemaps\u003c/h2\u003e';\n    foreach ($parser-\u003egetSitemaps() as $url =\u003e $tags) {\n        echo 'URL: ' . $url . '\u003cbr\u003e';\n        echo 'LastMod: ' . $tags['lastmod'] . '\u003cbr\u003e';\n        echo '\u003chr\u003e';\n    }\n    echo '\u003ch2\u003eURLs\u003c/h2\u003e';\n    foreach ($parser-\u003egetURLs() as $url =\u003e $tags) {\n        echo 'URL: ' . $url . '\u003cbr\u003e';\n        echo 'LastMod: ' . $tags['lastmod'] . '\u003cbr\u003e';\n        echo 'ChangeFreq: ' . $tags['changefreq'] . '\u003cbr\u003e';\n        echo 'Priority: ' . $tags['priority'] . '\u003cbr\u003e';\n        echo '\u003chr\u003e';\n    }\n} catch (SitemapParserException $e) {\n    echo $e-\u003egetMessage();\n}\n```\n\n### Parsing of line separated text strings\n__Note:__ This is __disabled by default__ to avoid false positives when expecting XML, but fetches plain text instead.\n\nTo disable `strict` standards, simply pass this configuration to constructor parameter #2: ````['strict' =\u003e false]````.\n```php\nuse vipnytt\\SitemapParser;\nuse vipnytt\\SitemapParser\\Exceptions\\SitemapParserException;\n\ntry {\n    $parser = new SitemapParser('MyCustomUserAgent', ['strict' =\u003e false]);\n    $parser-\u003eparse('https://www.xml-sitemaps.com/urllist.txt');\n    foreach ($parser-\u003egetSitemaps() as $url =\u003e $tags) {\n            echo $url . '\u003cbr\u003e';\n    }\n    foreach ($parser-\u003egetURLs() as $url =\u003e $tags) {\n            echo $url . '\u003cbr\u003e';\n    }\n} catch (SitemapParserException $e) {\n    echo $e-\u003egetMessage();\n}\n```\n\n### Throttling\n\n1. Install middleware:\n```bash\ncomposer require hamburgscleanest/guzzle-advanced-throttle\n```\n2. Define host rules:\n\n```php\n$rules = new RequestLimitRuleset([\n    'https://www.google.com' =\u003e [\n        [\n            'max_requests'     =\u003e 20,\n            'request_interval' =\u003e 1\n        ],\n        [\n            'max_requests'     =\u003e 100,\n            'request_interval' =\u003e 120\n        ]\n    ]\n]);\n```\n3. Create handler stack:\n\n```php\n$stack = new HandlerStack();\n$stack-\u003esetHandler(new CurlHandler());\n```\n4. Create middleware:\n```php\n$throttle = new ThrottleMiddleware($rules);\n\n // Invoke the middleware\n$stack-\u003epush($throttle());\n \n// OR: alternatively call the handle method directly\n$stack-\u003epush($throttle-\u003ehandle());\n```\n5. Create client manually:\n```php\n$client = new \\GuzzleHttp\\Client(['handler' =\u003e $stack]);\n```\n6. Pass client as an argument or use `setClient` method:\n```php\n$parser = new SitemapParser();\n$parser-\u003esetClient($client);\n```\nMore details about this middle ware is available [here](https://github.com/hamburgscleanest/guzzle-advanced-throttle) \n\n### Automatic retry\n\n1. Install middleware:\n```bash\ncomposer require caseyamcl/guzzle_retry_middleware\n```\n\n2. Create stack:\n```php\n$stack = new HandlerStack();\n$stack-\u003esetHandler(new CurlHandler());\n```\n\n3. Add middleware to the stack:\n```php\n$stack-\u003epush(GuzzleRetryMiddleware::factory());\n```\n\n4. Create client manually:\n```php\n$client = new \\GuzzleHttp\\Client(['handler' =\u003e $stack]);\n```\n\n5. Pass client as an argument or use setClient method:\n```php\n$parser = new SitemapParser();\n$parser-\u003esetClient($client);\n```\nMore details about this middle ware is available [here](https://github.com/caseyamcl/guzzle_retry_middleware)\n\n### Advanced logging\n\n1. Install middleware:\n```bash\ncomposer require gmponos/guzzle_logger\n```\n\n2. Create PSR-3 style logger\n```php\n$logger = new Logger();\n```\n\n3. Create handler stack:\n\n```php\n$stack = new HandlerStack();\n$stack-\u003esetHandler(new CurlHandler());\n```\n\n5. Push logger middleware to stack\n```php\n$stack-\u003epush(new LogMiddleware($logger));\n```\n\n6. Create client manually:\n```php\n$client = new \\GuzzleHttp\\Client(['handler' =\u003e $stack]);\n```\n7. Pass client as an argument or use `setClient` method:\n```php\n$parser = new SitemapParser();\n$parser-\u003esetClient($client);\n```\nMore details about this middleware config (like log levels, when to log and what to log) is available [here](https://github.com/gmponos/guzzle_logger)\n\n\n\n### Additional examples\nEven more examples available in the [examples](https://github.com/VIPnytt/SitemapParser/tree/master/examples) directory.\n\n## Configuration\nAvailable configuration options, with their default values:\n```php\n$config = [\n    'strict' =\u003e true, // (bool) Disallow parsing of line-separated plain text\n    'guzzle' =\u003e [\n        // GuzzleHttp request options\n        // http://docs.guzzlephp.org/en/latest/request-options.html\n    ],\n    // use this to ignore URL when parsing sitemaps that contain multiple other sitemaps. Exact match only.\n    'url_black_list' =\u003e []\n];\n$parser = new SitemapParser('MyCustomUserAgent', $config);\n```\n_If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVIPnytt%2FSitemapParser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVIPnytt%2FSitemapParser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVIPnytt%2FSitemapParser/lists"}