
An open API service indexing awesome lists of open source software.

The unix-way web crawler

cli crawler go golang golang-application pentest-tool unix-way web-crawler web-scraping web-spider

Last synced: 23 days ago
JSON representation

The unix-way web crawler




[![FOSSA Status](](
[![Go Version](](go.mod)
[![Mentioned in Awesome Go](](

[![Go Report Card](](
[![Test Coverage](](

# crawley

Crawls web pages and prints any link it can find.

# features

- fast html SAX-parser (powered by [x/net/html](
- js/css lexical parsers (powered by [tdewolff/parse]( - extract api endpoints from js code and `url()` properties
- small (below 1500 SLOC), idiomatic, 100% test covered codebase
- grabs most of useful resources urls (pics, videos, audios, forms, etc...)
- found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
- scan depth (limited by starting host and path, by default - 0) can be configured
- can be polite - crawl rules and sitemaps from `robots.txt`
- `brute` mode - scan html comments for urls (this can lead to bogus results)
- make use of `HTTP_PROXY` / `HTTPS_PROXY` environment values + handles proxy auth (use `HTTP_PROXY="socks5://" crawley` for socks5)
- directory-only scan mode (aka `fast-scan`)
- user-defined cookies, in curl-compatible format (i.e. `-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file`)
- user-defined headers, same as curl: `-header "ONE: 1" -header "TWO: 2" -header @headers-file`
- tag filter - allow to specify tags to crawl for (single: `-tag a -tag form`, multiple: `-tag a,form`, or mixed)
- url ignore - allow to ignore urls with matched substrings from crawling (i.e.: `-ignore logout`)

# examples

# print all links from first page:

# print all js files and api endpoints:
crawley -depth -1 -tag script -js

# print all endpoints from js:
crawley -js

# download all png images from site:
crawley -depth -1 -tag img | grep '\.png$' | wget -i -

# fast directory traversal:
crawley -headless -delay 0 -depth -1 -dirs only

# installation

- [binaries / deb / rpm]( for Linux, FreeBSD, macOS and Windows.
- [archlinux]( you can use your favourite AUR helper to install it, e. g. `paru -S crawley-bin`.

# usage

crawley [flags] url

possible flags with default values:

scan all known sources (js/css/...)
scan html comments
-cookie value
extra cookies for request, can be used multiple times, accept files with '@'-prefix
scan css for urls
-delay duration
per-request delay (0 - disable) (default 150ms)
-depth int
scan depth (set -1 for unlimited)
-dirs string
policy for non-resource urls: show / hide / only (default "show")
-header value
extra headers for request, can be used multiple times, accept files with '@'-prefix
disable pre-flight HEAD requests
-ignore value
patterns (in urls) to be ignored in crawl process
scan js code for endpoints
-proxy-auth string
credentials for proxy: user:password
-robots string
policy for robots.txt: ignore / crawl / respect (default "ignore")
suppress info and error messages in stderr
skip ssl verification
-tag value
tags filter, single or comma-separated tag names
-timeout duration
request timeout (min: 1 second, max: 10 minutes) (default 5s)
-user-agent string
user-agent string
show version
-workers int
number of workers (default - number of CPU cores)

# flags autocompletion

Crawley can handle flags autocompletion in bash and zsh via `complete`:

complete -C "/full-path-to/bin/crawley" crawley

# license
[![FOSSA Status](](