https://github.com/dmuth/cat-crawler
A webcrawler I wrote in Golang that I can use to find and download cat pictures.
https://github.com/dmuth/cat-crawler
Last synced: about 1 year ago
JSON representation
A webcrawler I wrote in Golang that I can use to find and download cat pictures.
- Host: GitHub
- URL: https://github.com/dmuth/cat-crawler
- Owner: dmuth
- License: other
- Created: 2013-06-10T02:19:24.000Z (about 13 years ago)
- Default Branch: main
- Last Pushed: 2020-09-03T22:21:05.000Z (almost 6 years ago)
- Last Synced: 2024-06-20T12:25:27.681Z (almost 2 years ago)
- Language: Go
- Homepage:
- Size: 114 KB
- Stars: 28
- Watchers: 6
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
## Cat Crawler
A webcrawler I'm writing in Golang that I can use to find and download cat pictures.
### Installation
- Make sure your GOPATH environment variable is set up properly:
`export GOPATH=$HOME/golib`
- Make sure the bin directory is in your path:
`PATH=$PATH:$GOPATH/bin`
- Now install the package
`go get -v github.com/dmuth/cat-crawler`
### Running the crawler
cat-crawler [--seed-url url[,url[,url[...]]]] [ --num-connections n ] [--allow-urls [url,[url,[...]]]] [--search-string cat]
--seed-url What URL to start at? More than one URL may be
specified in comma-delimited format.
--num-connections How many concurrent connections?
--search-string A string we want to search for in ALT and TITLE attributes on images
--allow-urls If specified, only URLs starting with the URLs listed here are crawled
--stats Print out stats once a second using my stats package
### Examples
cat-crawler --seed-url cnn.com --num-connections 1
Get top stories. :-)
cat-crawler --seed-url (any URL) --num-connections 1000
This will saturate your download bandwidth. Seriously, don't do it.
cat-crawler --seed-url cnn.com --num-connections 1 --allow-urls cnn.com
Don't leave CNN's website
cat-crawler --seed-url cnn.com --num-connections 1 --allow-urls foobar
After crawling the first page, nothing will happen. Oops.
### Sequence diagram

### Development
go get -v github.com/dmuth/cat-crawler && cat-crawler [options]
### Running the tests
go get -v -a github.com/dmuth/procedural-webserver # Dependency
go test -v github.com/dmuth/cat-crawler
You should see results like this:
=== RUN TestSplitHostnames
--- PASS: TestSplitHostnames (0.00 seconds)
=== RUN TestHtmlNew
--- PASS: TestHtmlNew (0.00 seconds)
=== RUN TestHtmlBadImg
--- PASS: TestHtmlBadImg (0.00 seconds)
=== RUN TestHtmlLinksAndImages
--- PASS: TestHtmlLinksAndImages (0.00 seconds)
=== RUN TestHtmlNoLinks
--- PASS: TestHtmlNoLinks (0.00 seconds)
=== RUN TestHtmlNoImages
--- PASS: TestHtmlNoImages (0.00 seconds)
=== RUN TestHtmlNoLinksNorImages
--- PASS: TestHtmlNoLinksNorImages (0.00 seconds)
=== RUN TestHtmlPortNumberInBaseUrl
--- PASS: TestHtmlPortNumberInBaseUrl (0.00 seconds)
=== RUN TestGetFilenameFromUrl
--- PASS: TestGetFilenameFromUrl (0.00 seconds)
=== RUN Test
--- PASS: Test (0.00 seconds)
=== RUN TestFilterUrl
--- PASS: TestFilterUrl (0.00 seconds)
=== RUN TestIsUrlAllowed
--- PASS: TestIsUrlAllowed (0.00 seconds)
PASS
ok github.com/dmuth/cat-crawler 0.037s
### Depdendencies
This repo uses other packages I wrote:
- [log4go](https://github.com/dmuth/google-go-log4go)
- [golang-stats](https://github.com/dmuth/golang-stats)
### Bugs
- I am not accessing the maps inside of an array.
- Fix: A separate source file, with a single goroutine which service requests through a channel is a possibility
### TODO
- Rate limiting by domain in URL crawler
- I could have an array of key=domain, value=count and a goroutine
that decrements count regularly
- Could get a bit crazy on the memory, though!
- Write instrumentation to detect how many goroutines are active/idle
- GoStatStart(key)
- GoStatStop(key)
- go GoStatDump(interval)
### Contact
Questions? Complaints? Here's my contact info: http://www.dmuth.org/contact