Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/joelkoen/wls

Easily crawl multiple sitemaps and list URLs
https://github.com/joelkoen/wls

crawler sitemap url

Last synced: 3 months ago
JSON representation

Easily crawl multiple sitemaps and list URLs

Host: GitHub
URL: https://github.com/joelkoen/wls
Owner: joelkoen
License: mit
Created: 2024-02-04T13:05:44.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-05-13T05:38:44.000Z (9 months ago)
Last Synced: 2024-10-31T11:57:52.590Z (3 months ago)
Topics: crawler, sitemap, url
Language: Rust
Homepage:
Size: 45.9 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# wls

wls (web ls) makes it easy to crawl multiple sitemaps and list URLs. It can even automatically find sitemaps for a domain using robots.txt.

## Usage

wls accepts multiple domains/sitemaps as arguments, and will print all found URLs to stdout:

```sh
$ wls docs.rs > urls.txt

$ head -n 6 urls.txt
https://docs.rs/A-1/latest/A_1/
https://docs.rs/A-1/latest/A_1/all.html
https://docs.rs/A5/latest/A5/
https://docs.rs/A5/latest/A5/all.html
https://docs.rs/AAAA/latest/AAAA/
https://docs.rs/AAAA/latest/AAAA/all.html

$ grep /all.html urls.txt | wc -l
113191
# that's a lot of crates!
```

If an argument does not contain a slash, it is treated as a domain, and wls will automatically attempt to find sitemaps using robots.txt. For example, [docs.rs](https://docs.rs/) uses the `Sitemap:` directive in [its robots.txt file](https://docs.rs/robots.txt), so the following commands are equivalent:

```sh
$ wls docs.rs
$ wls https://docs.rs/robots.txt
$ wls https://docs.rs/sitemap.xml
```

wls will print logs to stderr when `-v/--verbose` is enabled:

```sh
$ wls -v docs.rs
Found 1 sitemaps
in robotstxt with url: https://docs.rs/robots.txt

Found 26 sitemaps
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt

Found 15934 URLs
in sitemap with url: https://docs.rs/-/sitemap/a/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt

Found 11170 URLs
in sitemap with url: https://docs.rs/-/sitemap/b/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt

...
```

More options are available too:

```
Usage: wls [OPTIONS] ...

Arguments:
... Domains/sitemaps to crawl

Options:
-c, --cookies Enable cookies while crawling
-k, --insecure Disable certificate verification
-U, --user-agent Browser to identify as [default: wls/0.2.0]
-T, --timeout Maximum response time [default: 30]
-w, --wait Delay between requests [default: 0]
-v, --verbose Enable logs
-h, --help Print help
-V, --version Print version
```