https://github.com/rtlong/web-spider

Simple Web Crawler/Spider written in Go
https://github.com/rtlong/web-spider

Last synced: 3 months ago
JSON representation

Simple Web Crawler/Spider written in Go

Host: GitHub
URL: https://github.com/rtlong/web-spider
Owner: rtlong
Created: 2014-07-20T10:34:56.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2014-07-22T18:20:53.000Z (almost 11 years ago)
Last Synced: 2025-01-12T12:37:32.226Z (5 months ago)
Language: Go
Size: 152 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# web-spider

This is a web spider. It's very much a work in progress, but it's working well
enough to perhaps be useful to someone. I'm new to Go, so please excuse any
glaring/horrid mistakes.

## Goals

The intention of this project was to create a spider much like a search
engine's, with the exception that I'm not interested in saving or indexing the
fetched pages. This spider is meant for scanning a site, verifying that there
are no broken links, no dead pages, and collecting response time and other
stats about the response, but not necessarily saving the response itself.

## Usage

```shell
% go get github.com/rtlong/web-spider
% web-spider http://example.com
```

## TODO

- check ``, ``, ``, and `<iframe>` tag hrefs in addition to `<a>`
- ensure `href="//blah.com/foo"` urls are not ignored due to `URL.Scheme` assertion
- add tests!
- improve output
- add more configurability:
- ability to add extra headers during requests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rtlong/web-spider

Awesome Lists containing this project

README