Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joelkoen/wls
Easily crawl multiple sitemaps and list URLs
https://github.com/joelkoen/wls
crawler sitemap url
Last synced: about 5 hours ago
JSON representation
Easily crawl multiple sitemaps and list URLs
- Host: GitHub
- URL: https://github.com/joelkoen/wls
- Owner: joelkoen
- License: mit
- Created: 2024-02-04T13:05:44.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-05-13T05:38:44.000Z (6 months ago)
- Last Synced: 2024-10-31T11:57:52.590Z (7 days ago)
- Topics: crawler, sitemap, url
- Language: Rust
- Homepage:
- Size: 45.9 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# wls
wls (web ls) makes it easy to crawl multiple sitemaps and list URLs. It can even automatically find sitemaps for a domain using robots.txt.
## Usage
wls accepts multiple domains/sitemaps as arguments, and will print all found URLs to stdout:
```sh
$ wls docs.rs > urls.txt$ head -n 6 urls.txt
https://docs.rs/A-1/latest/A_1/
https://docs.rs/A-1/latest/A_1/all.html
https://docs.rs/A5/latest/A5/
https://docs.rs/A5/latest/A5/all.html
https://docs.rs/AAAA/latest/AAAA/
https://docs.rs/AAAA/latest/AAAA/all.html$ grep /all.html urls.txt | wc -l
113191
# that's a lot of crates!
```If an argument does not contain a slash, it is treated as a domain, and wls will automatically attempt to find sitemaps using robots.txt. For example, [docs.rs](https://docs.rs/) uses the `Sitemap:` directive in [its robots.txt file](https://docs.rs/robots.txt), so the following commands are equivalent:
```sh
$ wls docs.rs
$ wls https://docs.rs/robots.txt
$ wls https://docs.rs/sitemap.xml
```wls will print logs to stderr when `-v/--verbose` is enabled:
```sh
$ wls -v docs.rs
Found 1 sitemaps
in robotstxt with url: https://docs.rs/robots.txtFound 26 sitemaps
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txtFound 15934 URLs
in sitemap with url: https://docs.rs/-/sitemap/a/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txtFound 11170 URLs
in sitemap with url: https://docs.rs/-/sitemap/b/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt...
```More options are available too:
```
Usage: wls [OPTIONS] ...Arguments:
... Domains/sitemaps to crawlOptions:
-c, --cookies Enable cookies while crawling
-k, --insecure Disable certificate verification
-U, --user-agent Browser to identify as [default: wls/0.2.0]
-T, --timeout Maximum response time [default: 30]
-w, --wait Delay between requests [default: 0]
-v, --verbose Enable logs
-h, --help Print help
-V, --version Print version
```