{"id":27334852,"url":"https://github.com/dmuth/cat-crawler","last_synced_at":"2025-04-12T14:46:29.809Z","repository":{"id":57610973,"uuid":"10592253","full_name":"dmuth/cat-crawler","owner":"dmuth","description":"A webcrawler I wrote in Golang that I can use to find and download cat pictures.","archived":false,"fork":false,"pushed_at":"2020-09-03T22:21:05.000Z","size":117,"stargazers_count":28,"open_issues_count":0,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-06-20T12:25:27.681Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmuth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-06-10T02:19:24.000Z","updated_at":"2023-05-19T10:39:41.000Z","dependencies_parsed_at":"2022-09-11T02:51:44.469Z","dependency_job_id":null,"html_url":"https://github.com/dmuth/cat-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fcat-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fcat-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fcat-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fcat-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmuth","download_url":"https://codeload.github.com/dmuth/cat-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248585235,"owners_count":21128969,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-12T14:46:29.006Z","updated_at":"2025-04-12T14:46:29.798Z","avatar_url":"https://github.com/dmuth.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Cat Crawler\n\nA webcrawler I'm writing in Golang that I can use to find and download cat pictures.\n\n### Installation\n\n- Make sure your GOPATH environment variable is set up properly:\n   `export GOPATH=$HOME/golib`\n- Make sure the bin directory is in your path:\n   `PATH=$PATH:$GOPATH/bin`\n- Now install the package\n   `go get -v github.com/dmuth/cat-crawler`\n\n### Running the crawler\n    cat-crawler [--seed-url url[,url[,url[...]]]] [ --num-connections n ] [--allow-urls [url,[url,[...]]]] [--search-string cat]\n        --seed-url What URL to start at? More than one URL may be \n            specified in comma-delimited format.\n        --num-connections How many concurrent connections?\n        --search-string A string we want to search for in ALT and TITLE attributes on images\n        --allow-urls If specified, only URLs starting with the URLs listed here are crawled\n        --stats Print out stats once a second using my stats package\n\n### Examples\n    cat-crawler --seed-url cnn.com --num-connections 1\nGet top stories. :-)\n\n    cat-crawler --seed-url (any URL) --num-connections 1000\nThis will saturate your download bandwidth. Seriously, don't do it.\n\n    cat-crawler --seed-url cnn.com  --num-connections 1 --allow-urls cnn.com\nDon't leave CNN's website\n\n    cat-crawler --seed-url cnn.com  --num-connections 1 --allow-urls foobar\nAfter crawling the first page, nothing will happen.  Oops.\n\n### Sequence diagram\n\n![Sequence Diagram](https://raw.github.com/dmuth/cat-crawler/master/docs/sequence-diagram.png \"Sequence Diagram\")\n\n\n### Development\n\n    go get -v github.com/dmuth/cat-crawler \u0026\u0026 cat-crawler [options]\n\n\n### Running the tests\n\n    go get -v -a github.com/dmuth/procedural-webserver # Dependency\n    go test -v github.com/dmuth/cat-crawler\n\nYou should see results like this:\n\n    === RUN TestSplitHostnames\n    --- PASS: TestSplitHostnames (0.00 seconds)\n    === RUN TestHtmlNew\n    --- PASS: TestHtmlNew (0.00 seconds)\n    === RUN TestHtmlBadImg\n    --- PASS: TestHtmlBadImg (0.00 seconds)\n    === RUN TestHtmlLinksAndImages\n    --- PASS: TestHtmlLinksAndImages (0.00 seconds)\n    === RUN TestHtmlNoLinks\n    --- PASS: TestHtmlNoLinks (0.00 seconds)\n    === RUN TestHtmlNoImages\n    --- PASS: TestHtmlNoImages (0.00 seconds)\n    === RUN TestHtmlNoLinksNorImages\n    --- PASS: TestHtmlNoLinksNorImages (0.00 seconds)\n    === RUN TestHtmlPortNumberInBaseUrl\n    --- PASS: TestHtmlPortNumberInBaseUrl (0.00 seconds)\n    === RUN TestGetFilenameFromUrl\n    --- PASS: TestGetFilenameFromUrl (0.00 seconds)\n    === RUN Test\n    --- PASS: Test (0.00 seconds)\n    === RUN TestFilterUrl\n    --- PASS: TestFilterUrl (0.00 seconds)\n    === RUN TestIsUrlAllowed\n    --- PASS: TestIsUrlAllowed (0.00 seconds)\n    PASS\n    ok      github.com/dmuth/cat-crawler    0.037s\n\n\n### Depdendencies\n\nThis repo uses other packages I wrote:\n- [log4go](https://github.com/dmuth/google-go-log4go)\n- [golang-stats](https://github.com/dmuth/golang-stats)\n\n\n### Bugs\n\n- I am not accessing the maps inside of an array.\n    - Fix: A separate source file, with a single goroutine which service requests through a channel is a possibility\n\n\n### TODO\n\n- Rate limiting by domain in URL crawler\n\t- I could have an array of key=domain, value=count and a goroutine \n\t\tthat decrements count regularly\n\t\t- Could get a bit crazy on the memory, though!\n- Write instrumentation to detect how many goroutines are active/idle\n\t- GoStatStart(key)\n\t- GoStatStop(key)\n\t- go GoStatDump(interval)\n\n\n### Contact\n\nQuestions? Complaints? Here's my contact info: http://www.dmuth.org/contact\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Fcat-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmuth%2Fcat-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Fcat-crawler/lists"}