{"id":27024094,"url":"https://github.com/qiubits2007/xml-sitemap","last_synced_at":"2026-02-25T19:34:37.274Z","repository":{"id":284710758,"uuid":"955799922","full_name":"qiubits2007/XML-Sitemap","owner":"qiubits2007","description":"Multi-domain XML sitemap generator with support for robots.txt, meta tags, email logging \u0026 search engine pinging","archived":false,"fork":false,"pushed_at":"2025-04-25T05:20:47.000Z","size":171,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-25T06:27:13.354Z","etag":null,"topics":["crawler","generator","gzip","multi-domain","php8","robots-txt","seo","seotools","sitemap-builder","sitemap-generator","sitemap-xml"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qiubits2007.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-27T08:08:30.000Z","updated_at":"2025-04-25T05:18:02.000Z","dependencies_parsed_at":"2025-04-03T15:30:39.750Z","dependency_job_id":"d91ad736-e711-421c-aaed-0950564ebab4","html_url":"https://github.com/qiubits2007/XML-Sitemap","commit_stats":null,"previous_names":["qiubits2007/xml-sitemap"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/qiubits2007/XML-Sitemap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qiubits2007%2FXML-Sitemap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qiubits2007%2FXML-Sitemap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qiubits2007%2FXML-Sitemap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qiubits2007%2FXML-Sitemap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qiubits2007","download_url":"https://codeload.github.com/qiubits2007/XML-Sitemap/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qiubits2007%2FXML-Sitemap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273867535,"owners_count":25182423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","generator","gzip","multi-domain","php8","robots-txt","seo","seotools","sitemap-builder","sitemap-generator","sitemap-xml"],"created_at":"2025-04-04T21:17:30.671Z","updated_at":"2026-02-25T19:34:37.246Z","avatar_url":"https://github.com/qiubits2007.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🗺️ XML Sitemap Generator (PHP)\n\nA powerful and customizable sitemap generator written in PHP (PHP 8+).  \nIt crawls one or multiple domains, respects `robots.txt`, follows meta directives, supports resumable sessions, sends logs by email, and can even notify search engines when the sitemap is ready.\nIt is optimized for large websites and offers advanced crawl controls, meta/robots filtering, JSON/HTML export, and more.\n\n---\n\n## ✅ Features\n\n- 🔗 Multi-domain support (comma-separated URLs)\n- 📑 Combined sitemap for all domains\n- 📑 Automatically creates multiple sitemap files if more than 50,000 URLs are found \n- 🧭 Crawling depth control\n- 🔍 `robots.txt` and `\u003cmeta name=\"robots\"\u003e` handling\n- 🔁 Resumable crawl via cache (optional)\n- 💣 `--resetcache` to force full crawl (new!)\n- 💣 `--resetlog` to delete old log files (new!)\n- 🧠 Dynamic priority \u0026 changefreq rules (via config or patterns)\n- 🧹 Pretty or single-line XML output\n- 📦 GZIP compression (optional)\n- 📧 Log by email\n- 🛠 Health check report\n- 📡 Ping Google/Bing/Yandex\n- 🧪 Debug mode with detailed logs\n- 📑 Structured logging with timestamps and levels (`info`, `error`, `debug`, etc.)\n- 📑 Log export: JSON format + HTML report (`logs/crawl_log.json`, `logs/crawl_log.html`)\n- 📑 Visual crawl map generation (`crawl_graph.json`, `crawl_map.html`)\n- 📧 Flattened email reports with attached crawl logs\n- 📧 Customizable sender email via `--from`\n- 📑 Public base URL for sitemap/map references via `--publicbase`\n\n---\n\n## 🚀 Requirements\n\n- PHP 8.0 or newer\n- `curl` and `dom` extensions enabled\n- Write permissions to the script folder (for logs/cache/sitemaps)\n\n---\n\n## ⚙️ Usage (CLI)\n\n```bash\nphp sitemap.php \\\n  --url=https://yourdomain.com,https://blog.yourdomain.com \\\n  --key=YOUR_SECRET_KEY \\\n  [options]\n```\n\n## 🌐 Usage (Browser)\n\n```url\nsitemap.php?url=https://yourdomain.com\u0026key=YOUR_SECRET_KEY\u0026gzip\u0026prettyxml\n```\n\n---\n\n## 🧩 Options\n\n| Option              | Description                                                              |\n|---------------------|--------------------------------------------------------------------------|\n| `--url=`            | Comma-separated domain list to crawl (required)                          |\n| `--key=`            | Secret key to authorize script execution (required)                      |\n| `--output=`         | Output path for the sitemap file                                         |\n| `--depth=`          | Max crawl depth (default: 3)                                             |\n| `--gzip`            | Export sitemap as `.gz`                                                  |\n| `--prettyxml`       | Human-readable XML output                                                |\n| `--resume`          | Resume from last crawl using `cache/visited.json`                        |\n| `--resetcache`      | Force fresh crawl by deleting the cache (NEW)                            |\n| `--resetlog`        | Clear previous crawl logs before start (NEW)                             |\n| `--filters`         | Enable external filtering from `filter_config.json`                      |\n| `--graph`           | Export visual crawl map (JSON + interactive HTML)                        |\n| `--priorityrules`   | Enable dynamic `\u003cpriority\u003e` based on URL patterns                        |\n| `--changefreqrules` | Enable dynamic `\u003cchangefreq\u003e` based on URL patterns                      |\n| `--ignoremeta`      | Ignore `\u003cmeta name=\"robots\"\u003e` directives                                 |\n| `--respectrobots`   | Obey rules in `robots.txt`                                               |\n| `--email=`          | Send crawl log to this email                                             |\n| `--ping`            | Notify search engines after sitemap generation (⚠️ Google/Bing ping deprecated)                          |\n| `--threads=`        | Number of concurrent crawl threads (default: 10)                         |\n| `--agent=`          | Set a custom User-Agent                                                  |\n| `--splitbysite`     | Generate one sitemap per domain and build sitemap_index.xml to link them |\n| `--graphmap`     | Generate crawl map as JSON and interactive HTML                          |\n| `--publicbase=`     | Public base URL for HTML links (e.g., https://example.com/sitemaps)      |\n| `--from=`     | Sender address for email reports                                         |\n| `--debug`           | Output detailed log info for debugging                                   |\n\n\n---\n\n## 📁 Output Files\n\n- `sitemap.xml` or `sitemap-*.xml`\n- `sitemap.xml.gz` (optional)\n- `sitemap_index.xml` (if split)\n- `cache/visited.json` → stores crawl progress (used with `--resume`)\n- `logs/crawl_log.txt` → full crawl log\n- `logs/crawl_log.json`      → Structured log as JSON\n- `logs/crawl_log.html`      → Visual HTML report of the crawl log\n- `logs/crawl_report_*.txt`       → emailed attachment\n- `logs/health_report.txt` → summary of crawl (errors, speed, blocks)\n- `crawl_graph.json`         → Graph structure for visualization\n- `crawl_map.html`           → Interactive crawl map\n\n---\n\n## ⚙️ External Filter Config\n\nCreate a `config/filter.json` to define your own include/exclude patterns and dynamic rules:\n\n```json\n{\n  \"excludeExtensions\": [\"jpg\", \"png\", \"zip\", \"docx\"],\n  \"excludePatterns\": [\"*/private/*\", \"debug\"],\n  \"includeOnlyPatterns\": [\"blog\", \"news\"],\n  \"priorityPatterns\": {\n    \"high\": [\"blog\", \"news\"],\n    \"low\": [\"impressum\", \"privacy\"]\n  },\n  \"changefreqPatterns\": {\n    \"daily\": [\"blog\", \"news{\n      \"excludeExtensions\": [\"jpg\", \"png\", \"docx\", \"zip\"],\n      \"excludePatterns\": [],\n      \"includeOnlyPatterns\": [],\n      \"priorityPatterns\": {\n        \"high\": [\n          \"news\",\n          \"blog\",\n          \"offers\"\n        ],\n        \"low\": [\n          \"terms-and-conditions\",\n          \"legal-notice\",\n          \"privacy-policy\"\n        ]\n      },\n      \"changefreqPatterns\": {\n        \"daily\": [\n          \"news\",\n          \"blog\",\n          \"offers\"\n        ],\n        \"monthly\": [\n          \"terms-and-conditions\",\n          \"legal-notice\",\n          \"privacy-policy\"\n        ]\n      }\n      }\"],\n    \"monthly\": [\"impressum\", \"agb\"]\n  }\n}\n```\n\nActivate with:\n```bash\n--filters --priorityrules --changefreqrules\n```\n\n---\n\n## 📬 Ping Support\n\nWith `--ping` enabled, the script will notify:\n\n- Yandex: `https://webmaster.yandex.com/ping`\n\nAs of 2023/2024:\n- ❌ **Google** and **Bing** ping endpoints are deprecated (410 Gone)\n- ✅ Use `robots.txt` with a `Sitemap:` entry\n- ✅ Optionally submit in Webmaster Tools\n\n---\n\n## 🔐 Security\n\nThe script **requires a secret key** (`--key=` or `key=`) to run.  \nSet it inside the script:\n\n```php\n$authorized_hash = 'YOUR_SECRET_KEY';\n```\n\n---\n\n## 📤 Email Log\n\nSend crawl reports to your inbox with:\n\n```bash\n--email=you@yourdomain.com\n```\n\nYour server must support the `mail()` function.\n\n---\n\n## 🧪 Debugging\n\nEnable `--debug` to log everything:\n- Pattern matches\n- Skipped URLs\n- Meta robots blocking\n- Robots.txt interpretation\n- Response times\n- Log file resets\n\n---\n\n## Sitemap Splitting\n\nIf more than **50,000 URLs** are crawled (the limit of a single sitemap file per [sitemaps.org spec](https://www.sitemaps.org/protocol.html)),  \nthe script will automatically create multiple sitemap files:\n\n- `sitemap-1.xml`, `sitemap-2.xml`, ...\n- Or `domain-a-1.xml`, `domain-a-2.xml`, ... if `--splitbysite` is active\n- These are automatically referenced from a `sitemap_index.xml`\n\nNo configuration is needed – the split is automatic.\n\n---\n\n### How Split-by-Site Works\n\nWhen using `--splitbysite`, the crawler will:\n\n1. Create a separate sitemap file for each domain (e.g., `/sitemaps/domain1.xml`, `/sitemaps/domain2.xml`)\n2. Automatically generate a `sitemap_index.xml` file in the root directory\n3. Ping search engines (Google, Bing, Yandex) with the `sitemap_index.xml` URL instead of individual sitemap files\n\nThis is useful when crawling multiple domains in a single run.\n\n---\n\n## Crawl Map Visualization\n\nIf you enable `--graph`, the crawler will export:\n\n- `graph.json` – link structure as raw data\n- `crawl_map.html` – interactive map powered by D3.js\n\nYou can explore your site structure visually, zoom in/out, drag nodes, and inspect links.\nUseful for spotting crawl traps, dead ends, and structure gaps.\n\n📍 Tip: For large sites, open the HTML file in Chrome or Firefox.\n\n---\n\n## 🔐 Example robots.txt\n\n```\nUser-agent: *\nDisallow:\n\nSitemap: https://yourdomain.com/sitemap.xml\n```\n\n---\n\n## 📄 License\n\nMIT License  \nFeel free to fork, modify, or contribute!\n\n---\n\n## 👤 Author\n\nBuilt by Gilles Dumont (Qiubits SARL)  \nContributions and feedback welcome.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqiubits2007%2Fxml-sitemap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqiubits2007%2Fxml-sitemap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqiubits2007%2Fxml-sitemap/lists"}