Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aybabtme/crawler
Toy domain crawler
https://github.com/aybabtme/crawler
Last synced: about 1 month ago
JSON representation
Toy domain crawler
- Host: GitHub
- URL: https://github.com/aybabtme/crawler
- Owner: aybabtme
- License: mit
- Created: 2014-05-11T06:23:19.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-05-12T14:27:31.000Z (over 10 years ago)
- Last Synced: 2024-06-20T22:35:11.105Z (5 months ago)
- Language: Go
- Homepage: http://godoc.org/github.com/aybabtme/crawler
- Size: 1.26 MB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Crawler
[![Build Status](https://drone.io/github.com/aybabtme/crawler/status.png)](https://drone.io/github.com/aybabtme/crawler/latest)
[![Coverage Status](https://coveralls.io/repos/aybabtme/crawler/badge.png?branch=master)](https://coveralls.io/r/aybabtme/crawler?branch=master)A simple domain crawler.
* Respects `robots.txt`.
* Doesn't leave the domain it's given.
* Doesn't visit sub-domains.# Crawl things!
Install the crawler:
```
go get github.com/aybabtme/crawler/cmd/crawl
```Use it:
```
crawl -h http://antoine.im -f antoineim_map.json
```Should print things like:
```
2014/05/11 02:28:04 starting crawl on http://antoine.im
2014/05/11 02:28:05 [crawler] root has 1 elements
2014/05/11 02:28:05 [crawler] fringe=10 found=12 (new=10, rejected=2) source="http://antoine.im"
2014/05/11 02:28:06 [crawler] fringe=10 found=27 (new=1, rejected=20) source="http://antoine.im/posts/someone_was_right_on_the_internet"
...
2014/05/11 02:28:07 [crawler] fringe=0 found=0 (new=0, rejected=0) source="http://antoine.im/assets/data/to_buffer_or_not_to_buffer/t1_micro_bench_1.0MB.svg"
2014/05/11 02:28:07 [crawler] done crawling, 15 resources, 45 links
2014/05/11 02:28:07 preparing sitemap
2014/05/11 02:28:07 saving to "antoineim_map.json"
2014/05/11 02:28:07 done in 3.006155429s
```You can then use the output file, for instance to count how many links point to 404 (needs [jq](https://stedolan.github.io/jq/)):
```
jq < mysite.com.json '.resources | map(select(.status_code == 404)) | length'
```
Or find out which page led to those 404:
```
jq < mysite.com.json '.resources | map(select(.status_code == 404)) | [.[].refered_by[]] | unique'
```# Use the lib!
If you want to use the library.
```
go get github.com/aybabtme/crawler
```The godocs are on [godoc](http://godoc.org/github.com/aybabtme/crawler) (lol).
# Test it!
```
go get -t github.com/aybabtme/crawler
make test
```To view the coverage report:
```
make cover
```# The output
The output of a crawl is a list of resources, along with:
* Where they refer to (points to something).
* Where are they are refered from (something points to that).
* What was the status code of reaching this resource.The status code is interesting: it might show that you have dead links (404),
for instance.Here's a snippet of crawling my [blog](http://antoine.im). The full map can be
found [here](sample_map.json) if you want to see it.```json
{
"resource_count": 15,
"link_count": 45,
"resources": [
{
"url": "http://antoine.im/posts/dynamic_programming_for_the_lazy",
"refered_by": [
"http://antoine.im"
],
"refers_to": [
"http://antoine.im",
"http://antoine.im/assets/css/brog.css",
"http://antoine.im/assets/css/font-awesome.min.css",
"http://antoine.im/assets/css/styles/github.css",
"http://antoine.im/assets/js/algo_convenience_hacks.js",
"http://antoine.im/assets/js/brog.js"
],
"status_code": 200
},
{
"url": "http://antoine.im/assets/css/brog.css",
"refered_by": [
"http://antoine.im",
"http://antoine.im/posts/someone_was_right_on_the_internet",
"http://antoine.im/posts/someone_is_wrong_on_the_internet",
"http://antoine.im/posts/the_story_of_select_and_the_goroutines",
"http://antoine.im/posts/dynamic_programming_for_the_lazy",
"http://antoine.im/posts/to_buffer_or_not_to_buffer",
"http://antoine.im/posts/correction_hacks"
],
"refers_to": [],
"status_code": 200
},
// ...
}
}
```