https://github.com/barttc/siteprobe
Sitemap Validation and Performance Analyzer
https://github.com/barttc/siteprobe
Last synced: 5 months ago
JSON representation
Sitemap Validation and Performance Analyzer
- Host: GitHub
- URL: https://github.com/barttc/siteprobe
- Owner: bartTC
- License: mit
- Created: 2025-03-11T07:00:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-01-01T17:32:47.000Z (5 months ago)
- Last Synced: 2026-01-07T00:07:36.050Z (5 months ago)
- Language: Rust
- Homepage: https://barttc.github.io/siteprobe/
- Size: 1.62 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Siteprobe
Siteprobe is a Rust-based CLI tool that fetches all URLs from a given `sitemap.xml`
url, checks their existence, and generates a performance report. It supports various
features such as authentication, concurrency control, caching bypass, and more.

## Features
- Fetch and parse sitemap.xml to extract URLs, including nested Sitemap Index files
recursively.
- Check the existence and response times of each URL.
- Generate a detailed performance CSV report.
- Support for Basic Authentication.
- Adjustable concurrency limits for request handling.
- Configurable request timeout settings.
- Support for configuring rate limits, such as 300 requests per 5-minute interval.
- Redirect handling with security precautions.
- Filtering and reporting slow URLs based on a threshold.
- Custom User-Agent header support.
- Option to append random timestamps to URLs to bypass caching mechanisms.
- Save downloaded documents for further inspection or use as a static site mirror.
## Installation
You can install Siteprobe using Cargo:
```sh
cargo install siteprobe
```
Alternatively, build from source:
```sh
git clone https://github.com/bartTC/siteprobe.git
cd siteprobe
cargo build --release
```
## Usage
```sh
siteprobe [OPTIONS]
```
### Arguments
- `` - The URL of the sitemap to be fetched and processed.
### Options
```
Usage: siteprobe [OPTIONS]
Arguments:
The URL of the sitemap to be fetched and processed.
Options:
--basic-auth
Basic authentication credentials in the format `username:password`
-c, --concurrency-limit
Maximum number of concurrent requests allowed [default: 4]
-l, --rate-limit
The rate limit for all requests in the format 'requests/time[unit]',
where unit can be seconds (`s`), minutes (`m`), or hours (`h`). E.g.
'-l 300/5m' for 300 requests per 5 minutes, or '-l 100/1h' for 100
requests per hour.
-o, --output-dir
Directory where all downloaded documents will be saved
-a, --append-timestamp
Append a random timestamp to each URL to bypass caching mechanisms
-r, --report-path
File path for storing the generated `report.csv`
-j, --report-path-json
File path for storing the generated `report.json`
-t, --request-timeout
Default timeout (in seconds) for each request [default: 10]
--user-agent
Custom User-Agent header to be used in requests [default: "Mozilla/5.0
(compatible; Siteprobe/0.5.0)"]
--slow-num
Limit the number of slow documents displayed in the report. [default:
100]
-s, --slow-threshold
Show slow responses. The value is the threshold (in seconds) for
considering a document as 'slow'. E.g. '-s 3' for 3 seconds or '-s
0.05' for 50ms.
-f, --follow-redirects
Controls automatic redirects. When enabled, the client will follow
HTTP redirects (up to 10 by default). Note that for security, Basic
Authentication credentials are intentionally not forwarded during
redirects to prevent unintended credential exposure.
-h, --help
Print help
```
### Example Usage
```sh
# Fetch and analyze a sitemap with default settings
siteprobe https://example.com/sitemap.xml
# Save the report to a specific file
siteprobe https://example.com/sitemap.xml --report-path ./results/report.csv --output-dir ./example.com
# Set concurrency limit to 10 and timeout to 5 seconds
siteprobe https://example.com/sitemap.xml --concurrency-limit 10 --request-timeout 5
```