{"id":36755692,"url":"https://github.com/bp72/crwl","last_synced_at":"2026-01-12T12:49:27.728Z","repository":{"id":205888160,"uuid":"683824695","full_name":"bp72/crwl","owner":"bp72","description":null,"archived":false,"fork":false,"pushed_at":"2024-03-23T11:48:20.000Z","size":4789,"stargazers_count":32,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-21T16:55:56.549Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bp72.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-27T20:13:44.000Z","updated_at":"2024-03-04T17:01:01.000Z","dependencies_parsed_at":"2024-01-16T20:05:48.879Z","dependency_job_id":"4bed441f-8d97-4d2a-b366-b140d8e99517","html_url":"https://github.com/bp72/crwl","commit_stats":null,"previous_names":["bp72/crwl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bp72/crwl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bp72%2Fcrwl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bp72%2Fcrwl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bp72%2Fcrwl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bp72%2Fcrwl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bp72","download_url":"https://codeload.github.com/bp72/crwl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bp72%2Fcrwl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28338983,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T12:22:26.515Z","status":"ssl_error","status_checked_at":"2026-01-12T12:22:10.856Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-12T12:49:27.665Z","updated_at":"2026-01-12T12:49:27.717Z","avatar_url":"https://github.com/bp72.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1\u003eCrawler\u003c/h1\u003e\n\n**crwl** is an open source web crawler in Golang which allows you to traverse entire site. Using it, you can scan, benchmark and validate your site, for example evaluate [connected component](https://en.wikipedia.org/wiki/Component_(graph_theory)) or [internal pagerank](https://en.wikipedia.org/wiki/PageRank)\n\n### Motivation\nI faced problem to crawl site as-is for various reason: create set site structure as graph, validate it, benchmark.\n\n# Get Started\n#### Clone repo\n```\ngit clone git@github.com:bp72/crwl.git\n```\n\n#### Build\n```\nmake build\n```\n\n#### Run\n```\nbin/crwl -domain example.com -use-internal-cache -max-depth 3 -max-workers 5\n```\n\n\n# Crawler arch\n![alt text](https://github.com/bp72/crwl/blob/feature/update-readme-to-provide-more-context/crawler-arc.png?raw=true)\n\n\n# Web Crawler Features\n- Start from the root domain and crawl the web pages with a specified depth.\n- Save the pages\n- Support logging and statsd metrics\n\n# TODO Features\n- Add WebUI to control and manage crawler\n- Add Crawl delay support per domain\n- Add Data storage interface to support FS, ClickHouse, RDB\n- Add logic to respect robots.txt\n- Add Grafana dashboard to repo\n- Add docker-compose to setup and run crawler with external service dependencies \n- Add condition to save page content to storage, for example keyword or url pattern\n\n\n# Options\n\n#### Benchmark/Test mode\nSometime you just need to traverse your site without storing the content, just to check everything works fine or how far you can go. In this case you can use **-do-not-store** option, it disables content storing function :\n```\nbin/crwl -do-not-store\n```\n\n#### Setting up limits\n\nMaximum crawls limitation\nOption allows to limit number of crawls with exact number, by default it's 100k pages to crawl\n```\nbin/crwl -max-crawl 1234\n```\n\nMaximum depth allows to set limitation on how deep crawler can go, by default it's 7\n```\nbin/crwl -max-depth 1\n```\n\nMaximum number of worker sets the limit of concurrent cralwers to run, by default it's 20\n```\nbin/crwl -max-workers 2\n```\n\n#### Run without any external service dependancy\nCrawler can be run standalone (without other services), however this configuration has memory limitation, since it's maintaince urls queue and visitied url in memory.\n```\nbin/crwl -use-internal-cache\n```\n\n# Metrics and logging\nCrawler support statd metric publishing technique, to enable it:\n```\nbin/crwl -statsd-addr hostname:port\n```\n\n### Roadmap\n- [x] Define crawler arch\n- [x] Implement initial crawler version\n- [ ] Add WebUI to control and manage crawler\n- [ ] Add Crawl delay support per domain\n- [ ] Add Data storage interface to support FS, ClickHouse, RDB\n- [ ] Respect robots.txt\n- [ ] Add Grafana dashboard to repo\n- [ ] Add docker-compose to setup and run crawler with external service dependencies \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbp72%2Fcrwl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbp72%2Fcrwl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbp72%2Fcrwl/lists"}