Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dylancl/sitemap-crawler
Verify the status of each url in a (hosted) sitemap XML file.
https://github.com/dylancl/sitemap-crawler
crawler parser scraper sitemap xml
Last synced: 29 days ago
JSON representation
Verify the status of each url in a (hosted) sitemap XML file.
- Host: GitHub
- URL: https://github.com/dylancl/sitemap-crawler
- Owner: dylancl
- Created: 2024-05-23T06:35:32.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-05-26T20:06:18.000Z (8 months ago)
- Last Synced: 2024-11-07T03:25:33.085Z (3 months ago)
- Topics: crawler, parser, scraper, sitemap, xml
- Language: TypeScript
- Homepage:
- Size: 35.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# sitemap-crawler
Verify the status of each url in a (hosted) sitemap XML file, by crawling through the XML and fetching it to see if returns a 200 OK. Free alternative to Screaming Frog SEO Spider's paid sitemap crawler feature.
https://github.com/dylancl/sitemap-scraper/assets/14956708/d15b02a0-351a-43fd-a91e-90c042603075
# Installation
1. Clone the repository
```bash
git clone https://github.com/dylancl/sitemap-scraper.git
```2. Install the dependencies
```bash
pnpm install
```3. Run the script
```bash
pnpm start
```# Usage
1. Enter the URL of the sitemap XML file you want to check.
2. The script will ask you for configuration options:
- **Concurrency limit**: The maximum number of requests that can be made at the same time. Default is 5. Must be a number between 1 and 15.
- **Request delay**: The delay between each request. Default is 1000. Must be a number starting from 250.
- **Traversal order**: The order in which the URLs will be checked. Default is `sequential`. Options are `sequential` and `random`.
3. The script will start checking the URLs and display the progress in the console.
4. When the script is done, it will ask you if you want to save the results (ok & not ok URLs) to a file.