{"id":34705746,"url":"https://github.com/oparaskos/simple-web-crawler","last_synced_at":"2026-05-22T12:05:43.002Z","repository":{"id":185397484,"uuid":"658464403","full_name":"oparaskos/simple-web-crawler","owner":"oparaskos","description":"Job Interview Tech Test","archived":false,"fork":false,"pushed_at":"2023-06-25T20:21:50.000Z","size":70,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-08-01T20:54:03.747Z","etag":null,"topics":["interview","interview-test","kotlin","monzo","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oparaskos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-25T20:21:28.000Z","updated_at":"2023-08-01T20:54:08.323Z","dependencies_parsed_at":null,"dependency_job_id":"bc67a4e1-64b5-4f81-9478-7f00094bfbd9","html_url":"https://github.com/oparaskos/simple-web-crawler","commit_stats":null,"previous_names":["oparaskos/simple-web-crawler"],"tags_count":null,"template":null,"template_full_name":null,"purl":"pkg:github/oparaskos/simple-web-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oparaskos%2Fsimple-web-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oparaskos%2Fsimple-web-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oparaskos%2Fsimple-web-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oparaskos%2Fsimple-web-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oparaskos","download_url":"https://codeload.github.com/oparaskos/simple-web-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oparaskos%2Fsimple-web-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28012114,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["interview","interview-test","kotlin","monzo","web-crawler"],"created_at":"2025-12-24T23:19:58.712Z","updated_at":"2025-12-24T23:20:02.155Z","avatar_url":"https://github.com/oparaskos.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Crawler 🕷️\n 🕸️A simple web crawler.🕸️\n\nGiven a starting URL, the crawler visits each URL it finds on the same domain and prints each URL visited with a list of links found on that page.\n\nThe crawler is limited to a single subdomain - so when you start with https://example.com/, it will crawl all pages on the example.com website, but not follow external links, for example to othersite.com or links to other subdomains e.g. forum.example.com or www.example.com.\n\nrobots.txt 🤖 is partially respected, and the crawler won't crawl pages with X-Robot-Tag: noindex, or a `\u003cmeta name=robots value=noindex /\u003e` in the HTML body.\n\nThe crawler outputs each page it visits on a new line along with the links out from that page (including external links) in the format `\"https://page.url/\" -\u003e {\"https://page.url/another/path\"; \"https://linked.site/\"}`\n\n[![asciicast](https://asciinema.org/a/pl1WhaWwcumPX4sF1SSa8c1Uj.svg)](https://asciinema.org/a/pl1WhaWwcumPX4sF1SSa8c1Uj)\n\n## 🧾 Prerequisites\n\nAssuming you have a working Java ☕️ runtime installed the gradle wrapper (`./gradlew`) should handle the rest. (see [Troubleshooting](#Troubleshooting))\nFor MacOS you can use brew to install `openjdk`, for windows visit https://openjdk.org/ and for linux you probably won't need to recompile the kernel. \n\n## ▶️ Running the Application\n\n```\n./gradlew run --args='[OPTIONS] https://example.com'\n```\n\n## ✅ Running Tests\n\n```\n./gradlew test\n```\n\n## 👆 Caveats\n\n* ✋ This isnt rate-limited, if you run it against something protected by a CDN (e.g. CloudFlare) you'll likely get banned\n* Pages with dynamically loaded content (e.g. those which require javascript to be enabled) won't work properly unless they are also rendered server side.\n* 🤖 The crawler only has a very simple interpretation of robots.txt\n* Error pages arent crawled, this is by design. but does mean any links the error pages take you to are also not crawled unless linked elsewhere\n* 🖇️ non-HTML pages (e.g. pdf downloads) aren't crawled but leave litter in the error logs. \n\n## 📝 Other Notes\n\n* java.util.URL is marked deprecated, I'm still using it here because it handles \"malformed\" URLs better than the suggested replacement java.util.URI. for instance URLs with spaces in it are considered invalid by URI, but behave more as expected using URL.\n\n----\n\n## 🫥 Outline\n\nIllustrative psuedocode outline\n\n```pascal\nprogram Crawler;\n// push to a queue of urls still to check, and to a set of already checked urls\nfunction queue_push();\n\n// return true if url in already queued list\nfunction already_queued();\n\n// return true if url host part matches\nfunction same_host();\n\n// return true if url in robots exclusion list (á la RFC-9309)\nfunction matches_robots();\n\n// Write something to the console for each page we find.\nfunction emit(url: string, document: HTMLDocument, anchors: List):\n    write(url)\n    write(' : ')\n    write(anchors)\n    write('\\n')\n\n\nfunction crawl(url: string):\n    html := get_page(url);\n    document := parse_html(html);\n    anchors := find_elements_by_tag_name(document, 'a')\n\n    emit(url, document, anchors)\n    \n    for i := 0 to len(anchors) do\n    begin\n        if (already_queued(anchors[i])) then continue;\n        if (!has_href(anchors[i])) then continue;\n        if (!same_host(anchors[i])) then continue;\n        if (matches_robots(anchors[i])) then continue;\n        queue_push(anchors[i])\n    end;\n\nbegin\n    crawl(first_url)\n    while !queue_empty() do\n        crawl(queue_pop())\n    end;\nend.\n```\n\n\n## 🆘 Troubleshooting\n\n### 🍎 MacOS\n*Note if you are using a Mac and installing openjdk via brew you must also symlink it*\n\n```zsh\nbrew install openjdk\n  ...\n    ==\u003e Pouring openjdk--20.0.1.ventura.bottle.tar.gz\n    ==\u003e Caveats\n    For the system Java wrappers to find this JDK, symlink it with\n      sudo ln -sfn /usr/local/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk\n  ...\n    ==\u003e Summary\n    🍺  /usr/local/Cellar/openjdk/20.0.1: 636 files, 322.4MB \n\nsudo ln -sfn /usr/local/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk\n```\n\notherwise you may encounter this error if you're running `./gradlew` on a mac:\n\n    The operation couldn’t be completed. Unable to locate a Java Runtime that supports javaws.\n    Please visit http://www.java.com for information on installing Java.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foparaskos%2Fsimple-web-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foparaskos%2Fsimple-web-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foparaskos%2Fsimple-web-crawler/lists"}