{"id":17151055,"url":"https://github.com/simonrichardson/crwlr","last_synced_at":"2025-03-24T12:22:43.873Z","repository":{"id":66364250,"uuid":"91888797","full_name":"SimonRichardson/crwlr","owner":"SimonRichardson","description":"Crawl all the things!","archived":false,"fork":false,"pushed_at":"2017-06-12T09:20:52.000Z","size":60,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-29T17:31:01.523Z","etag":null,"topics":["crawler","meshuggah"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SimonRichardson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-20T12:49:15.000Z","updated_at":"2017-06-12T16:15:03.000Z","dependencies_parsed_at":"2023-02-20T16:01:12.360Z","dependency_job_id":null,"html_url":"https://github.com/SimonRichardson/crwlr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SimonRichardson%2Fcrwlr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SimonRichardson%2Fcrwlr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SimonRichardson%2Fcrwlr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SimonRichardson%2Fcrwlr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SimonRichardson","download_url":"https://codeload.github.com/SimonRichardson/crwlr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245267699,"owners_count":20587487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","meshuggah"],"created_at":"2024-10-14T21:37:10.197Z","updated_at":"2025-03-24T12:22:43.865Z","avatar_url":"https://github.com/SimonRichardson.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# crwlr\n\n## Command Crawler\n\n - [Getting started](#getting-started)\n - [Introduction](#introduction)\n - [Static](#static)\n - [Crawl](#crawl)\n - [Reports](#reports)\n - [Tests](#tests)\n - [Improvements](#improvements)\n\n### Getting started\n\nThe crwlr command expects to have some things pre-installed via `go get` if you\nwould like to build the project.\n\n - go get github.com/Masterminds/glide\n - go get github.com/mjibson/esc\n\n-----\n\nQuick guide to getting started, this assumes you've got the `$GOPATH` setup\ncorrectly and the gopath bin folder is in your `$PATH`:\n\n```\nglide install\nmake clean all\ncd dist\n\n./crwlr crawl -addr=\"http://google.com\"\n```\n\n### Introduction\n\nThe crwlr CLI is split up into two distinctive commands, `static` and `crawl`.\n`static` command is only an aid to help manually test the `crawl` command along\nwith various benchmarking/integration tests.\n\n### Static\n\nThe `static` command creates a series of pages that allow the `crawl` command to\nwalk, without hitting an external host. To help integration with `crawl`, the\n`static` command can be used in combination with a pipe to send the current\naddress, this allows quick and fast iterative testing.\n\nThe following command launches the cli:\n\n```\ncrwlr static\n```\n\nIn combination with the crawl command, an extra argument is required.\n\n```\ncrwlr static -output.addr=true | crwlr crawl\n```\n\nAlso available is a quite descriptive `-help` section to better understand what\nthe static command can do:\n\n```\ncrwlr static -help\nUSAGE\n  static [flags]\n\nFLAGS\n  -api tcp://0.0.0.0:7650  listen address for static APIs\n  -debug false             debug logging\n  -output.addr false       Output address writes the address to stdout\n  -output.prefix -addr=    Output prefix prefixes the flag to the output.addr\n  -ui.local true           Use local files straight from the file system\n```\n\n### Crawl\n\nThe `crawl` command walks a host for potential new urls that it can also inturn\ntraverse. The command can configured (on by default) to check the `robots.txt`\nof the host to follow the rules for crawling.\n\nThe command uses aggressive caching to help better improve performance and to\nbe more efficient when crawling a host.\n\nAs part of the command it's also possible to output a report (on by default)\nof what was crawled and expose some metrics about what went on. These include,\nmetrics like: requested vs received or filtered and errors.\n\n - Requested is when a request is sent to the host, it's not know if that request\n was actually successful.\n - Received is the acknowledgement of the request succeeding.\n - Filtered describes if the host was cached already.\n - Errorred states if the request failed for some reason.\n\nThe following command launches the cli:\n\n```\ncrwlr crawl -addr=\"http://yourhosthere.com\"\n```\n\nAlso available is a comprehensive `-help` section:\n\n```\ncrwlr crawl -help\nUSAGE\n  crawl [flags]\n\nFLAGS\n  -addr 0.0.0.0:0                                                         addr to start crawling\n  -debug false                                                            debug logging\n  -filter.same-domain true                                                filter other domains that aren't the same\n  -follow-redirects true                                                  should the crawler follow redirects\n  -report.metrics false                                                   report the metric outcomes of the crawl\n  -report.sitemap true                                                    report the sitemap of the crawl\n  -robots.crawl-delay false                                               use the robots.txt crawl delay when crawling\n  -robots.request true                                                    request the robots.txt when crawling\n  -useragent.full Mozilla/5.0 (compatible; crwlr/0.1; +http://crwlr.com)  full user agent the crawler should use\n  -useragent.robot Googlebot (crwlr/0.1)                                  robot user agent the crawler should use\n\n```\n\n### Reports\n\nThe reporting part of the command outputs two different types of information;\nsitemap reporting and metric reporting. Both reports can be turned off behind\na series of flags.\n\n#### Sitemap Reports\n\nWhen the command is done the sitemap report can be outputted (on by default),\nwhich explains what was linked to what and also includes a list of static assets\nthat was also linked in the file.\n\nA possible output is as follows:\n\n```\ndist/crwlr crawl\n URL                              | Ref Links                   | Ref Assets                        |\n http://0.0.0.0:7650/robots.txt   |                             |                                   |\n http://0.0.0.0:7650              |                             |                                   |\n                                  | http://0.0.0.0:7650/index   | http://0.0.0.0:7650/index.css     |\n                                  | http://0.0.0.0:7650/page1   | http://google.com/bootstrap.css   |\n                                  | http://0.0.0.0:7650/bad     | http://0.0.0.0:7650/image.jpg     |\n                                  |                             | http://google.com/image.jpg       |\n http://0.0.0.0:7650/index        |                             |                                   |\n                                  |                             | http://0.0.0.0:7650/index.css     |\n                                  |                             | http://google.com/bootstrap.css   |\n                                  |                             | http://0.0.0.0:7650/image.jpg     |\n                                  |                             | http://google.com/image.jpg       |\n http://0.0.0.0:7650/page1        |                             |                                   |\n                                  | http://0.0.0.0:7650/page2   | http://0.0.0.0:7650/index1.css    |\n                                  |                             | http://google.com/bootstrap.css   |\n                                  |                             | http://0.0.0.0:7650/image2.jpg    |\n                                  |                             | http://google.com/image.jpg       |\n http://0.0.0.0:7650/bad          |                             |                                   |\n http://0.0.0.0:7650/page2        |                             |                                   |\n                                  | http://0.0.0.0:7650/page    |                                   |\n                                  | http://0.0.0.0:7650/page3   |                                   |\n http://0.0.0.0:7650/page         |                             |                                   |\n http://0.0.0.0:7650/page3        |                             |                                   |\n```\n\n#### Metric Reports\n\nWhen the command is done a report can be outputted (off by default), which can\nhelp explain what the crawl actually requested vs what it filtered for example.\n\nExample report using the `static` command is as follows:\n\n```\ndist/crwlr crawl -report.metrics=true\n URL                              | Avg Duration (ms)   | Requested   | Received   | Filtered   | Errorred   |\n http://0.0.0.0:7650/page         | 0                   | 1           | 0          | 0          | 1          |\n http://0.0.0.0:7650/page3        | 0                   | 1           | 0          | 1          | 0          |\n http://0.0.0.0:7650/robots.txt   | 5                   | 1           | 1          | 0          | 0          |\n http://0.0.0.0:7650              | 1                   | 1           | 1          | 0          | 0          |\n http://0.0.0.0:7650/index        | 0                   | 1           | 1          | 3          | 0          |\n http://0.0.0.0:7650/page1        | 1                   | 1           | 1          | 2          | 0          |\n http://0.0.0.0:7650/bad          | 0                   | 1           | 0          | 1          | 1          |\n http://0.0.0.0:7650/page2        | 0                   | 1           | 1          | 0          | 0          |\n\n Totals   | Duration (ms)   |\n          | 9560            |\n```\n\n### Tests\n\nTests can be run using the following command, it also includes a series of\nbenchmarking tests:\n\n```\n go test -v -bench=. $(glide nv)\n```\n\n### Improvements\n\nPossible improvements:\n\n - Store the urls in a KVS so that a crawler can truly work distributed, esp. if\n the host is large or if it's allowed to crawl beyond the host.\n - Potentially better strategies to walk assets at a later date to back fill the\n metrics.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonrichardson%2Fcrwlr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonrichardson%2Fcrwlr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonrichardson%2Fcrwlr/lists"}