{"id":45226024,"url":"https://github.com/michaelblaess/sitemap-generator","last_synced_at":"2026-04-05T15:00:53.819Z","repository":{"id":338506470,"uuid":"1157936304","full_name":"michaelblaess/sitemap-generator","owner":"michaelblaess","description":"Crawls websites and generates standards-compliant sitemap.xml files. Supports Playwright for JS rendering and httpx for fast HTTP crawling","archived":false,"fork":false,"pushed_at":"2026-04-05T12:58:41.000Z","size":498,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-05T14:32:23.518Z","etag":null,"topics":["httpx","playwright","python","seo","sitemap","sitemap-generator","textual","tui","web-crawler","xml"],"latest_commit_sha":null,"homepage":"https://michaelblaess.github.io/sitemap-generator/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaelblaess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-14T14:39:32.000Z","updated_at":"2026-04-05T12:58:44.000Z","dependencies_parsed_at":"2026-04-05T15:00:34.409Z","dependency_job_id":null,"html_url":"https://github.com/michaelblaess/sitemap-generator","commit_stats":null,"previous_names":["michaelblaess/playwright-sitemap-generator","michaelblaess/sitemap-generator"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/michaelblaess/sitemap-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelblaess%2Fsitemap-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelblaess%2Fsitemap-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelblaess%2Fsitemap-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelblaess%2Fsitemap-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaelblaess","download_url":"https://codeload.github.com/michaelblaess/sitemap-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelblaess%2Fsitemap-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31439442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T13:13:19.330Z","status":"ssl_error","status_checked_at":"2026-04-05T13:13:17.778Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["httpx","playwright","python","seo","sitemap","sitemap-generator","textual","tui","web-crawler","xml"],"created_at":"2026-02-20T19:15:49.891Z","updated_at":"2026-04-05T15:00:53.811Z","avatar_url":"https://github.com/michaelblaess.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sitemap Generator\n\nCrawlt Websites und generiert standardkonforme `sitemap.xml` Dateien. Nutzt [Playwright](https://playwright.dev/) fuer JavaScript-Rendering oder [httpx](https://www.python-httpx.org/) fuer schnelles HTTP-Crawling.\n\nCrawls websites and generates standard-compliant `sitemap.xml` files. Uses [Playwright](https://playwright.dev/) for JavaScript rendering or [httpx](https://www.python-httpx.org/) for fast HTTP crawling.\n\n## Screenshots\n\n### Hauptansicht\n![Hauptansicht](docs/screenshots/01-main.png)\n\n### Seitenbaum\n![Seitenbaum](docs/screenshots/02-sitemap.png)\n\n### Crawl-History\n![Crawl-History](docs/screenshots/03-history.png)\n\n## Installation\n\n### One-Liner (Standalone, kein Python noetig)\n\n**Linux / macOS:**\n```bash\ncurl -fsSL https://raw.githubusercontent.com/michaelblaess/sitemap-generator/main/install.sh | bash\n```\n\n**Windows (PowerShell):**\n```powershell\nirm https://raw.githubusercontent.com/michaelblaess/sitemap-generator/main/install.ps1 | iex\n```\n\n## Verwendung / Usage\n\n```bash\n# Einfach crawlen (httpx-Modus, schnell)\nsitemap-generator https://example.com\n\n# Mit JavaScript-Rendering (Playwright)\nsitemap-generator https://example.com --render\n\n# Sitemap direkt speichern\nsitemap-generator https://example.com --output sitemap.xml\n\n# Crawl-Tiefe begrenzen\nsitemap-generator https://example.com --max-depth 5\n\n# Mehr Parallelitaet\nsitemap-generator https://example.com --concurrency 16\n\n# robots.txt ignorieren\nsitemap-generator https://example.com --ignore-robots\n\n# Mit Cookies (z.B. fuer Login)\nsitemap-generator https://example.com --cookie session=abc123\n```\n\n## CLI-Parameter\n\n| Parameter | Beschreibung | Default |\n|---|---|---|\n| `URL` | Start-URL der Website | - |\n| `--output`, `-o` | Ausgabe-Pfad fuer sitemap.xml | `sitemap_\u003chost\u003e_\u003ctimestamp\u003e.xml` |\n| `--max-depth`, `-d` | Maximale Crawl-Tiefe | 10 |\n| `--concurrency`, `-c` | Parallele Requests | 8 |\n| `--timeout`, `-t` | Timeout pro Seite (Sekunden) | 30 |\n| `--render` | JavaScript mit Playwright rendern | aus |\n| `--no-headless` | Browser sichtbar (Debugging) | aus |\n| `--ignore-robots` | robots.txt ignorieren | aus |\n| `--user-agent` | Custom User-Agent | Chrome 131 |\n| `--cookie` | Cookie setzen (NAME=VALUE, mehrfach) | - |\n\n## Tastenkuerzel (TUI)\n\n| Taste | Funktion |\n|---|---|\n| `s` | Crawl starten |\n| `x` | Crawl abbrechen / JSON-Fehlerbericht |\n| `m` | Sitemap speichern |\n| `g` | Formular-Report exportieren (JSON) |\n| `j` | JIRA-Tabelle in Zwischenablage |\n| `e` | Nur Fehler anzeigen |\n| `b` | Seitenbaum |\n| `f` | Sitemap-Diff |\n| `d` | URL-Details kopieren |\n| `c` | Log kopieren |\n| `l` | Log ein/aus |\n| `+` / `-` | Log vergroessern/verkleinern |\n| `h` | History |\n| `o` | robots.txt AN/AUS |\n| `p` | Playwright AN/AUS |\n| `i` | Info-Dialog |\n| `q` | Beenden |\n\n## Features\n\n- **Dual-Modus**: httpx (schnell, nur HTML) oder Playwright (JavaScript-Rendering)\n- **robots.txt**: Wird standardmaessig respektiert, `--ignore-robots` zum Deaktivieren\n- **Auto-Split**: Bei \u003e50.000 URLs automatisch Sitemap-Index mit Teil-Sitemaps\n- **Priority**: Automatisch basierend auf Crawl-Tiefe (Startseite = 1.0)\n- **lastmod**: Aus HTTP Last-Modified Header\n- **URL-Normalisierung**: Duplikate durch Normalisierung vermieden\n- **Formular-Erkennung**: `\u003cform\u003e`-Tags werden erkannt, in der Tabelle markiert und als JSON exportierbar\n- **Live-TUI**: Fortschritt, Statistiken und URL-Details in Echtzeit\n\n## Browser-Strategie\n\n1. **System-Chrome** bevorzugt (schneller Start, weniger Speicher)\n2. **Gebundeltes Chromium** als Fallback (bei Standalone-Installation enthalten)\n\n## Datenschutz / Privacy\n\n**Wichtig**: Das Crawlen einer Website kann je nach Umfang und Haeufigkeit vom Betreiber als ungewoehnlicher Traffic wahrgenommen werden. Bitte beachte:\n\n- Informiere den Website-Betreiber **vor** dem Crawlen, insbesondere bei grossen Websites\n- Respektiere die `robots.txt` (ist standardmaessig aktiviert)\n- Setze angemessene Concurrency- und Timeout-Werte\n- Dieses Tool ist fuer **eigene Websites** und **autorisierte Analysen** gedacht\n\n**Important**: Crawling a website may be perceived as unusual traffic by the operator. Please note:\n\n- Inform the website operator **before** crawling, especially for large websites\n- Respect `robots.txt` (enabled by default)\n- Use reasonable concurrency and timeout values\n- This tool is intended for **your own websites** and **authorized analyses**\n\n## Entwickler / Development\n\n### Setup\n\n```bash\ngit clone https://github.com/michaelblaess/sitemap-generator.git\ncd sitemap-generator\n\n# Windows\nsetup-dev-environment.bat\n\n# Linux/macOS\n./setup-dev-environment.sh\n```\n\n### Lokaler Start\n\n```bash\n# Windows\nrun.bat https://example.com\n\n# Linux/macOS\n./run.sh https://example.com\n```\n\n### Release erstellen\n\n```bash\ngit tag v1.4.0\ngit push origin v1.4.0\n```\n\nGitHub Actions baut automatisch Executables fuer Windows, Linux und macOS.\n\n## Lizenz / License\n\nApache License 2.0 - siehe [LICENSE](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelblaess%2Fsitemap-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaelblaess%2Fsitemap-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelblaess%2Fsitemap-generator/lists"}