{"id":18748954,"url":"https://github.com/joelkoen/wls","last_synced_at":"2025-04-12T23:31:09.945Z","repository":{"id":220804466,"uuid":"752660038","full_name":"joelkoen/wls","owner":"joelkoen","description":"Easily crawl multiple sitemaps and list URLs","archived":false,"fork":false,"pushed_at":"2024-05-13T05:38:44.000Z","size":47,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T17:43:24.583Z","etag":null,"topics":["crawler","sitemap","url"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joelkoen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-04T13:05:44.000Z","updated_at":"2024-05-13T05:38:28.000Z","dependencies_parsed_at":"2024-02-04T13:19:22.383Z","dependency_job_id":"1f81d6f4-35f9-482c-b597-205c2151c2a1","html_url":"https://github.com/joelkoen/wls","commit_stats":null,"previous_names":["joelkoen/wls"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joelkoen%2Fwls","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joelkoen%2Fwls/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joelkoen%2Fwls/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joelkoen%2Fwls/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joelkoen","download_url":"https://codeload.github.com/joelkoen/wls/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647257,"owners_count":21139081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","sitemap","url"],"created_at":"2024-11-07T17:05:37.427Z","updated_at":"2025-04-12T23:31:09.605Z","avatar_url":"https://github.com/joelkoen.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wls\n\nwls (web ls) makes it easy to crawl multiple sitemaps and list URLs. It can even automatically find sitemaps for a domain using robots.txt.\n\n## Usage\n\nwls accepts multiple domains/sitemaps as arguments, and will print all found URLs to stdout:\n\n```sh\n$ wls docs.rs \u003e urls.txt\n\n$ head -n 6 urls.txt \nhttps://docs.rs/A-1/latest/A_1/\nhttps://docs.rs/A-1/latest/A_1/all.html\nhttps://docs.rs/A5/latest/A5/\nhttps://docs.rs/A5/latest/A5/all.html\nhttps://docs.rs/AAAA/latest/AAAA/\nhttps://docs.rs/AAAA/latest/AAAA/all.html\n\n$ grep /all.html urls.txt | wc -l\n113191\n# that's a lot of crates!\n```\n\nIf an argument does not contain a slash, it is treated as a domain, and wls will automatically attempt to find sitemaps using robots.txt. For example, [docs.rs](https://docs.rs/) uses the `Sitemap:` directive in [its robots.txt file](https://docs.rs/robots.txt), so the following commands are equivalent:\n\n```sh\n$ wls docs.rs\n$ wls https://docs.rs/robots.txt\n$ wls https://docs.rs/sitemap.xml\n```\n\nwls will print logs to stderr when `-v/--verbose` is enabled:\n\n```sh\n$ wls -v docs.rs\n   Found 1 sitemaps\n    in robotstxt with url: https://docs.rs/robots.txt\n\n   Found 26 sitemaps\n    in sitemap with url: https://docs.rs/sitemap.xml\n    in robotstxt with url: https://docs.rs/robots.txt\n\n   Found 15934 URLs\n    in sitemap with url: https://docs.rs/-/sitemap/a/sitemap.xml\n    in sitemap with url: https://docs.rs/sitemap.xml\n    in robotstxt with url: https://docs.rs/robots.txt\n\n   Found 11170 URLs\n    in sitemap with url: https://docs.rs/-/sitemap/b/sitemap.xml\n    in sitemap with url: https://docs.rs/sitemap.xml\n    in robotstxt with url: https://docs.rs/robots.txt\n\n  ...\n```\n\nMore options are available too:\n\n```\nUsage: wls [OPTIONS] \u003cURLS\u003e...\n\nArguments:\n  \u003cURLS\u003e...  Domains/sitemaps to crawl\n\nOptions:\n  -c, --cookies                  Enable cookies while crawling\n  -k, --insecure                 Disable certificate verification\n  -U, --user-agent \u003cUSER_AGENT\u003e  Browser to identify as [default: wls/0.2.0]\n  -T, --timeout \u003cSECONDS\u003e        Maximum response time [default: 30]\n  -w, --wait \u003cSECONDS\u003e           Delay between requests [default: 0]\n  -v, --verbose                  Enable logs\n  -h, --help                     Print help\n  -V, --version                  Print version\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoelkoen%2Fwls","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoelkoen%2Fwls","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoelkoen%2Fwls/lists"}