Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/s0rg/crawley

The unix-way web crawler
https://github.com/s0rg/crawley

cli crawler go golang golang-application pentest-tool unix-way web-crawler web-scraping web-spider

Last synced: 23 days ago
JSON representation

The unix-way web crawler

Host: GitHub
URL: https://github.com/s0rg/crawley
Owner: s0rg
License: mit
Created: 2021-10-27T18:48:51.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-01-26T20:40:57.000Z (4 months ago)
Last Synced: 2024-01-26T21:36:53.403Z (4 months ago)
Topics: cli, crawler, go, golang, golang-application, pentest-tool, unix-way, web-crawler, web-scraping, web-spider
Language: Go
Homepage:
Size: 217 KB
Stars: 214
Watchers: 2
Forks: 12
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Lists

awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go-extra - crawley - way web crawler|64|2|0|2021-10-27T18:48:51Z|2022-08-04T20:23:59Z| (Go Tools / Other Software)
awesome-cli-apps - crawley - Unix-way web crawler. (Utilities / Calendars)
cli-apps - crawley - Unix-way web crawler: crawls web pages and prints any link it can find. (<a name="webdev"></a>Web development)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-bugbounty-tools - crawley - fast, feature-rich unix-way web scraper/crawler written in Golang. (Recon / Content Discovery)
awesome-go-cn - crawley
awesome-go-projects - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go-stars - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go-cn - crawley
awesome-go-zh - crawley
awesome-go-with-stars - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
repo-1316-awesome-go-cn - crawley
repo-1211-awesome-go-cn - crawley
Go-awesome - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-cli-apps - crawley - Unix-way web crawler: crawls web pages and prints any link it can find. (<a name="webdev"></a>Web development)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)
awesome-go - crawley - Web scraper/crawler for cli. (Software Packages / Other Software)

README

        [![License](https://img.shields.io/badge/license-MIT%20License-blue.svg)](https://github.com/s0rg/crawley/blob/main/LICENSE)

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley?ref=badge_shield)

[![Go Version](https://img.shields.io/github/go-mod/go-version/s0rg/crawley)](go.mod)

[![Release](https://img.shields.io/github/v/release/s0rg/crawley)](https://github.com/s0rg/crawley/releases/latest)

[![Mentioned in Awesome Go](https://awesome.re/mentioned-badge.svg)](https://github.com/avelino/awesome-go)

![Downloads](https://img.shields.io/github/downloads/s0rg/crawley/total.svg)

[![CI](https://github.com/s0rg/crawley/workflows/ci/badge.svg)](https://github.com/s0rg/crawley/actions?query=workflow%3Aci)

[![Go Report Card](https://goreportcard.com/badge/github.com/s0rg/crawley)](https://goreportcard.com/report/github.com/s0rg/crawley)

[![Maintainability](https://api.codeclimate.com/v1/badges/6542cd90a6c665e4202e/maintainability)](https://codeclimate.com/github/s0rg/crawley/maintainability)

[![Test Coverage](https://api.codeclimate.com/v1/badges/e1c002df2b4571e01537/test_coverage)](https://codeclimate.com/github/s0rg/crawley/test_coverage)

[![libraries.io](https://img.shields.io/librariesio/github/s0rg/crawley)](https://libraries.io/github/s0rg/crawley)

![Issues](https://img.shields.io/github/issues/s0rg/crawley)

# crawley

Crawls web pages and prints any link it can find.

# features

- fast html SAX-parser (powered by [x/net/html](https://golang.org/x/net/html))

- js/css lexical parsers (powered by [tdewolff/parse](https://github.com/tdewolff/parse)) - extract api endpoints from js code and `url()` properties

- small (below 1500 SLOC), idiomatic, 100% test covered codebase

- grabs most of useful resources urls (pics, videos, audios, forms, etc...)

- found urls are streamed to stdout and guranteed to be unique (with fragments omitted)

- scan depth (limited by starting host and path, by default - 0) can be configured

- can be polite - crawl rules and sitemaps from `robots.txt`

- `brute` mode - scan html comments for urls (this can lead to bogus results)

- make use of `HTTP_PROXY` / `HTTPS_PROXY` environment values + handles proxy auth (use `HTTP_PROXY="socks5://127.0.0.1:1080/" crawley` for socks5)

- directory-only scan mode (aka `fast-scan`)

- user-defined cookies, in curl-compatible format (i.e. `-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file`)

- user-defined headers, same as curl: `-header "ONE: 1" -header "TWO: 2" -header @headers-file`

- tag filter - allow to specify tags to crawl for (single: `-tag a -tag form`, multiple: `-tag a,form`, or mixed)

- url ignore - allow to ignore urls with matched substrings from crawling (i.e.: `-ignore logout`)

# examples

```sh

# print all links from first page:

crawley http://some-test.site

# print all js files and api endpoints:

crawley -depth -1 -tag script -js http://some-test.site

# print all endpoints from js:

crawley -js http://some-test.site/app.js

# download all png images from site:

crawley -depth -1 -tag img http://some-test.site | grep '\.png$' | wget -i -

# fast directory traversal:

crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site

```

# installation

- [binaries / deb / rpm](https://github.com/s0rg/crawley/releases) for Linux, FreeBSD, macOS and Windows.

- [archlinux](https://aur.archlinux.org/packages/crawley-bin/) you can use your favourite AUR helper to install it, e. g. `paru -S crawley-bin`.

# usage

```

crawley [flags] url

possible flags with default values:

-all

    scan all known sources (js/css/...)

-brute

    scan html comments

-cookie value

    extra cookies for request, can be used multiple times, accept files with '@'-prefix

-css

    scan css for urls

-delay duration

    per-request delay (0 - disable) (default 150ms)

-depth int

    scan depth (set -1 for unlimited)

-dirs string

    policy for non-resource urls: show / hide / only (default "show")

-header value

    extra headers for request, can be used multiple times, accept files with '@'-prefix

-headless

    disable pre-flight HEAD requests

-ignore value

    patterns (in urls) to be ignored in crawl process

-js

    scan js code for endpoints

-proxy-auth string

    credentials for proxy: user:password

-robots string

    policy for robots.txt: ignore / crawl / respect (default "ignore")

-silent

    suppress info and error messages in stderr

-skip-ssl

    skip ssl verification

-tag value

    tags filter, single or comma-separated tag names

-timeout duration

    request timeout (min: 1 second, max: 10 minutes) (default 5s)

-user-agent string

    user-agent string

-version

    show version

-workers int

      number of workers (default - number of CPU cores)

```

# flags autocompletion

Crawley can handle flags autocompletion in bash and zsh via `complete`:

```bash

complete -C "/full-path-to/bin/crawley" crawley

```

# license

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fs0rg%2Fcrawley?ref=badge_large)