https://github.com/icio/gergle

Golang website crawler
https://github.com/icio/gergle

Last synced: about 1 year ago
JSON representation

Golang website crawler

Host: GitHub
URL: https://github.com/icio/gergle
Owner: icio
Created: 2016-02-07T10:51:22.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2016-03-05T13:36:48.000Z (over 10 years ago)
Last Synced: 2025-01-28T00:43:24.653Z (over 1 year ago)
Language: Go
Homepage:
Size: 48.8 KB
Stars: 1
Watchers: 4
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# :dizzy: gergle

`gergle` is a silly little website-scraping tool, written in Go. By no coincidence, very similar to [`crul`](http://github.com/icio/crul). It will attempt to abide by robots.txt unless you tell it otherwise, spawning a new goroutine for every request being made.

## Installation

```
go get github.com/icio/gergle/cmd/gergle
```

## Usage

```
$ gergle -h
Website crawler.

Usage:
gergle URL [flags]

Flags:
-c, --connections int Maximum number of open connections to the server. (default 5)
-t, --delay float The number of seconds between requests to the server. (default -1)
-d, --depth value Maximum crawl depth. (default 100)
-i, --disallow value Disallowed paths. (default [])
--long List all of the links and assets from a page.
-q, --quiet No logging to stderr.
-v, --verbose Verbose output logging.
--zero The number of bothers to give about robots.txt.
```

## Examples

``` bash
# Crawl paul-scott.com with one second between each page request,
# listing all links and assets.
$ gergle http://www.paul-scott.com/ -t 1 --long

# Crawl kirupa.com, excluding /forum*, up to three levels deep (first page is
# depth 0), ignoring robots.txt and using up to 30 simultaneous connections.
# 640 pages in 9 seconds on my local.
$ gergle -q https://www.kirupa.com/ --zero -c 30 -d 3 -iforum
```

## Todo

- [ ] Actual tests -- something beyond [manual testing](https://github.com/icio/crawler-target) :disappointed:
- [ ] First-class tracking of redirects and canonical URLs
- [ ] Vendoring of dependencies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/icio/gergle

Awesome Lists containing this project

README