https://github.com/rtlong/web-spider
Simple Web Crawler/Spider written in Go
https://github.com/rtlong/web-spider
Last synced: 3 months ago
JSON representation
Simple Web Crawler/Spider written in Go
- Host: GitHub
- URL: https://github.com/rtlong/web-spider
- Owner: rtlong
- Created: 2014-07-20T10:34:56.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2014-07-22T18:20:53.000Z (almost 11 years ago)
- Last Synced: 2025-01-12T12:37:32.226Z (5 months ago)
- Language: Go
- Size: 152 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# web-spider
This is a web spider. It's very much a work in progress, but it's working well
enough to perhaps be useful to someone. I'm new to Go, so please excuse any
glaring/horrid mistakes.## Goals
The intention of this project was to create a spider much like a search
engine's, with the exception that I'm not interested in saving or indexing the
fetched pages. This spider is meant for scanning a site, verifying that there
are no broken links, no dead pages, and collecting response time and other
stats about the response, but not necessarily saving the response itself.## Usage
```shell
% go get github.com/rtlong/web-spider
% web-spider http://example.com
```## TODO
- check ``, `
`, ``, and `<iframe>` tag hrefs in addition to `<a>`
- ensure `href="//blah.com/foo"` urls are not ignored due to `URL.Scheme` assertion
- add tests!
- improve output
- add more configurability:
- ability to add extra headers during requests