sitemap2urllist is a CLI tool for parsing a sitemap and outputting a simple list of URLs, which can easily be piped into other tools (e.g., lychee).
cargo install --locked sitemap2urllist
Or, if you use cargo-binstall:
cargo binstall sitemap2urllist
Read a sitemap and output a list of URLs.
Usage: sitemap2urllist [OPTIONS] <URL>
Arguments:
<URL> The URL to a sitemap
Options:
--no-cache Do NOT use request cache stored on disk
--max-cache-age <MAX_CACHE_AGE> Discard all cached requests older than this duration [default: 30d]
-v, --verbose... Increase logging verbosity
-q, --quiet... Decrease logging verbosity
-h, --help Print help (see more with '--help')
-V, --version Print version
At some point, it is likely link checkers like lychee obviate the need for this tool by implementing recursive link checking.
In the meantime, it is easy to run a link check from your local machine on an entire website as defined by its sitemap by doing something like the following.
sitemap2urllist https://alumni.cottonwoodhigh.school/sitemap-index.xml | xargs lychee --cache
Note you can combine this with lychee's configuration to do things like cache or ignore certain errors, etc.
We use OS-standard locations for caching.
- Linux:
$XDG_CACHE_HOME/sitemap2urllist/cache.jsonor$HOME/.cache/sitemap2urllist/cache.json - macOS:
$HOME/Library/Caches/dev.hsiao.sitemap2urllist/cache.json - Windows:
{FOLDERID_LocalAppData}\hsiao\sitemap2urllist\cache\cache.json
The cache file is simple JSON.
The cache only prevents refetching a feed if the feed source responds with a 429.
In this case, we respect Retry-After, or default to 4 hours.
Otherwise, we use the cache to send conditional requests by respecting the ETag and Last-Modified headers.
- Sitemap-to-Urllist (rust/shell/typescript): Simple sitemap.xml to urllist.txt converter (abandoned)