Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dylancl/sitemap-crawler

Verify the status of each url in a (hosted) sitemap XML file.
https://github.com/dylancl/sitemap-crawler

crawler parser scraper sitemap xml

Last synced: 29 days ago
JSON representation

Verify the status of each url in a (hosted) sitemap XML file.

Host: GitHub
URL: https://github.com/dylancl/sitemap-crawler
Owner: dylancl
Created: 2024-05-23T06:35:32.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-05-26T20:06:18.000Z (8 months ago)
Last Synced: 2024-11-07T03:25:33.085Z (3 months ago)
Topics: crawler, parser, scraper, sitemap, xml
Language: TypeScript
Homepage:
Size: 35.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# sitemap-crawler

Verify the status of each url in a (hosted) sitemap XML file, by crawling through the XML and fetching it to see if returns a 200 OK. Free alternative to Screaming Frog SEO Spider's paid sitemap crawler feature.

https://github.com/dylancl/sitemap-scraper/assets/14956708/d15b02a0-351a-43fd-a91e-90c042603075

# Installation

1. Clone the repository

```bash
git clone https://github.com/dylancl/sitemap-scraper.git
```

2. Install the dependencies

```bash
pnpm install
```

3. Run the script

```bash
pnpm start
```

# Usage

1. Enter the URL of the sitemap XML file you want to check.
2. The script will ask you for configuration options:
- **Concurrency limit**: The maximum number of requests that can be made at the same time. Default is 5. Must be a number between 1 and 15.
- **Request delay**: The delay between each request. Default is 1000. Must be a number starting from 250.
- **Traversal order**: The order in which the URLs will be checked. Default is `sequential`. Options are `sequential` and `random`.
3. The script will start checking the URLs and display the progress in the console.
4. When the script is done, it will ask you if you want to save the results (ok & not ok URLs) to a file.