https://github.com/lukehsiao/sitemap2urllist
Read a sitemap and output a list of URLs.
https://github.com/lukehsiao/sitemap2urllist
cli linkcheck rust sitemap urllist
Last synced: 2 months ago
JSON representation
Read a sitemap and output a list of URLs.
- Host: GitHub
- URL: https://github.com/lukehsiao/sitemap2urllist
- Owner: lukehsiao
- License: other
- Created: 2025-01-01T22:37:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-03-21T14:07:52.000Z (2 months ago)
- Last Synced: 2026-03-21T18:23:36.979Z (2 months ago)
- Topics: cli, linkcheck, rust, sitemap, urllist
- Language: Rust
- Homepage:
- Size: 146 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
🌐
sitemap2urllist
Read a sitemap and output a list of URLs.
`sitemap2urllist` is a CLI tool for parsing a sitemap and outputting a simple list of URLs, which can easily be piped into other tools (e.g., [lychee](https://github.com/lycheeverse/lychee)).
## Install
```
cargo install --locked sitemap2urllist
```
Or, if you use [`cargo-binstall`](https://github.com/cargo-bins/cargo-binstall):
```
cargo binstall sitemap2urllist
```
## Usage
```
Read a sitemap and output a list of URLs.
Usage: sitemap2urllist [OPTIONS]
Arguments:
The URL to a sitemap
Options:
--no-cache Do NOT use request cache stored on disk
--max-cache-age Discard all cached requests older than this duration [default: 30d]
-v, --verbose... Increase logging verbosity
-q, --quiet... Decrease logging verbosity
-h, --help Print help (see more with '--help')
-V, --version Print version
```
### Example Usage with Lychee
At some point, it is likely link checkers like lychee obviate the need for this tool by implementing [recursive link checking](https://github.com/lycheeverse/lychee/issues/78).
In the meantime, it is easy to run a link check from your local machine on an entire website as defined by its sitemap by doing something like the following.
```
sitemap2urllist https://alumni.cottonwoodhigh.school/sitemap-index.xml | xargs lychee --cache
```
Note you can combine this with [lychee's configuration](https://lychee.cli.rs/usage/config/) to do things like cache or ignore certain errors, etc.
## Caching
We use OS-standard locations for caching.
- **Linux**: `$XDG_CACHE_HOME/sitemap2urllist/cache.json` or `$HOME/.cache/sitemap2urllist/cache.json`
- **macOS**: `$HOME/Library/Caches/dev.hsiao.sitemap2urllist/cache.json`
- **Windows**: `{FOLDERID_LocalAppData}\hsiao\sitemap2urllist\cache\cache.json`
The cache file is simple JSON.
The cache only prevents refetching a feed if the feed source responds with a 429.
In this case, we respect `Retry-After`, or default to 4 hours.
Otherwise, we use the cache to send conditional requests by respecting the `ETag` and `Last-Modified` headers.
## Related Tools
- [Sitemap-to-Urllist](https://github.com/matejkosiarcik/sitemap2urllist) (rust/shell/typescript): Simple sitemap.xml to urllist.txt converter (**abandoned**)