https://github.com/icio/gergle
Golang website crawler
https://github.com/icio/gergle
Last synced: about 1 year ago
JSON representation
Golang website crawler
- Host: GitHub
- URL: https://github.com/icio/gergle
- Owner: icio
- Created: 2016-02-07T10:51:22.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2016-03-05T13:36:48.000Z (over 10 years ago)
- Last Synced: 2025-01-28T00:43:24.653Z (over 1 year ago)
- Language: Go
- Homepage:
- Size: 48.8 KB
- Stars: 1
- Watchers: 4
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# :dizzy: gergle
`gergle` is a silly little website-scraping tool, written in Go. By no coincidence, very similar to [`crul`](http://github.com/icio/crul). It will attempt to abide by robots.txt unless you tell it otherwise, spawning a new goroutine for every request being made.
## Installation
```
go get github.com/icio/gergle/cmd/gergle
```
## Usage
```
$ gergle -h
Website crawler.
Usage:
gergle URL [flags]
Flags:
-c, --connections int Maximum number of open connections to the server. (default 5)
-t, --delay float The number of seconds between requests to the server. (default -1)
-d, --depth value Maximum crawl depth. (default 100)
-i, --disallow value Disallowed paths. (default [])
--long List all of the links and assets from a page.
-q, --quiet No logging to stderr.
-v, --verbose Verbose output logging.
--zero The number of bothers to give about robots.txt.
```
## Examples
``` bash
# Crawl paul-scott.com with one second between each page request,
# listing all links and assets.
$ gergle http://www.paul-scott.com/ -t 1 --long
# Crawl kirupa.com, excluding /forum*, up to three levels deep (first page is
# depth 0), ignoring robots.txt and using up to 30 simultaneous connections.
# 640 pages in 9 seconds on my local.
$ gergle -q https://www.kirupa.com/ --zero -c 30 -d 3 -iforum
```
## Todo
- [ ] Actual tests -- something beyond [manual testing](https://github.com/icio/crawler-target) :disappointed:
- [ ] First-class tracking of redirects and canonical URLs
- [ ] Vendoring of dependencies