{"id":15140095,"url":"https://github.com/kraken-devs/websight","last_synced_at":"2026-01-20T14:33:15.348Z","repository":{"id":173076721,"uuid":"563436691","full_name":"Kraken-Devs/websight","owner":"Kraken-Devs","description":null,"archived":false,"fork":false,"pushed_at":"2023-03-02T00:02:50.000Z","size":1134,"stargazers_count":0,"open_issues_count":11,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-02-12T16:21:39.988Z","etag":null,"topics":["github","krakens","validation-error"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"wtfpl","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kraken-Devs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-08T15:57:43.000Z","updated_at":"2023-11-15T22:03:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"7ade7f7a-dc3f-44c1-b5ed-0221dd6783b9","html_url":"https://github.com/Kraken-Devs/websight","commit_stats":{"total_commits":5,"total_committers":2,"mean_commits":2.5,"dds":"0.19999999999999996","last_synced_commit":"469890f492e0b47758c8106fa8c572127bde0ae3"},"previous_names":["kraken-devs/websight"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kraken-Devs%2Fwebsight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kraken-Devs%2Fwebsight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kraken-Devs%2Fwebsight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kraken-Devs%2Fwebsight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kraken-Devs","download_url":"https://codeload.github.com/Kraken-Devs/websight/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247471297,"owners_count":20944153,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github","krakens","validation-error"],"created_at":"2024-09-26T08:01:26.067Z","updated_at":"2026-01-20T14:33:15.306Z","avatar_url":"https://github.com/Kraken-Devs.png","language":"TypeScript","readme":"# websight\n\n![CI Status](https://github.com/paambaati/websight/workflows/build/badge.svg) [![Test Coverage](https://api.codeclimate.com/v1/badges/170cf9f21bdb38fd63a1/test_coverage)](https://codeclimate.com/github/paambaati/websight/test_coverage) [![Maintainability](https://api.codeclimate.com/v1/badges/170cf9f21bdb38fd63a1/maintainability)](https://codeclimate.com/github/paambaati/websight/maintainability) [![WTFPL License](https://img.shields.io/badge/License-WTFPL-blue.svg)](LICENSE)\n\n\nA simple crawler that fetches all pages in a given website and prints the links between them.\n\n![Screenshot](SCREENSHOT.png)\n\n\u003csmall\u003e 📣 Note that this project was purpose-built for a coding challenge (see [problem statement](PROBLEM-STATEMENT.md)) and is not meant for production use (unless you aren't [web scale](http://www.mongodb-is-web-scale.com/) yet).\u003c/small\u003e\n\n### 🛠️ Setup\n\nBefore you run this app, make sure you have [Node.js](https://nodejs.org/en/) installed. [`yarn`](https://yarnpkg.com/lang/en/docs/install) is recommended, but can be used interchangeably with `npm`. If you'd prefer running everything inside a Docker container, see the [Docker setup](#docker-setup) section.\n\n```bash\ngit clone https://github.com/paambaati/websight\ncd websight\nyarn install \u0026\u0026 yarn build\n```\n\n#### 👩🏻‍💻 Usage\n```bash\nyarn start \u003cwebsite\u003e\n```\n\n#### 🧪 Tests \u0026 Coverage\n```bash\nyarn run coverage\n```\n\n### 🐳 Docker Setup\n\n```bash\ndocker build -t websight .\ndocker run -ti websight \u003cwebsite\u003e\n```\n### 📦 Executable Binary\n\n```bash\nyarn bundle \u0026\u0026 yarn binary\n```\n\nThis produces standalone executable binaries for both Linux and macOS.\n\n## 🧩 Design\n\n```\n                                            +---------------------+                        \n                                            |   Link Extractor    |                        \n                                            | +-----------------+ |                        \n                                            | |                 | |                        \n                                            | |   URL Resolver  | |                        \n                                            | |                 | |                        \n                                            | +-----------------+ |                        \n                    +-----------------+     | +-----------------+ |     +-----------------+\n                    |                 |     | |                 | |     |                 |\n                    |     Crawler     +----\u003e+ |     Fetcher     | +----\u003e+     Sitemap     |\n                    |                 |     | |                 | |     |                 |\n                    +-----------------+     | +-----------------+ |     +-----------------+\n                                            | +-----------------+ |                        \n                                            | |                 | |                        \n                                            | |     Parser      | |                        \n                                            | |                 | |                        \n                                            | +-----------------+ |                        \n                                            +---------------------+                        \n```\n\nThe `Crawler` class runs a fast non-deterministic fetch of all pages (via `LinkExtractor`) \u0026 the URLs in them recursively and saves them in `Sitemap`. When crawling is complete\u003csup id=\"a1\"\u003e[[1]](#f1)\u003c/sup\u003e, the sitemap is printed as a ASCII tree.\n\nThe `LinkExtractor` class is a thin orchestrating wrapper around 3 core components —\n\n1. `URLResolver` includes logic for resolving relative URLs and normalizing them. It also includes utility methods for filtering out external URLs.\n2. `Fetcher` takes a URL, fetches it and returns the response as a [`Stream`](https://nodejs.org/api/stream.html#stream_stream). This is better because streams can be read in small buffered chunks, avoiding holding very large HTMLs in memory.\n3. `Parser` parses the HTML stream (returned by `Fetcher`) in chunks and emits the `link` event on each page URL and the `asset` event on each static asset found in the HTML.\n\n\u003chr/\u003e\n\n\u003cb id=\"f1\"\u003e\u003csup\u003e1\u003c/sup\u003e\u003c/b\u003e `Crawler.crawl()` is an `async` function that _never resolves_ because it is technically impossible to detect when we've finished crawling. In most runtimes, we'd have to implement some kind of idle polling to detect completion; however, in Node.js, as soon as the event loop has no more tasks to execute, the main process will run to completion. This is why we finally print the sitemap in the [`Process.beforeExit`](https://nodejs.org/api/process.html#process_event_beforeexit) event. [↩](#a1)\n\n## 🏎 Optimizations\n\n1. Streams all the way down.\n\n    The key workloads in this system are HTTP fetches (I/O-bound) and HTML parses (CPU-bound), and either can be time-consuming and/or high on memory usage. To better parallelize the crawls and use as little memory as possible, [`got` library's streaming API](https://www.npmjs.com/package/got#streams) and the _very_ fast [`htmlparser2`](https://github.com/fb55/htmlparser2#performance) have been used.\n\n2. Keep-Alive connections.\n\n    The `Fetcher` class uses a global [`keepAlive`](https://nodejs.org/api/http.html#http_new_agent_options) agent to reuse sockets as we're only crawling a single domain. This helps avoid re-establishing TCP connections for each request.\n\n## ⚡️ Limitations\n\nWhen ramping up for scale, this design exposes a few of its limitations —\n\n1. No rate-limiting.\n\n    Most modern and large websites have some sort of throttling set up to block bots. A production-grade crawler should implement _some_ [politeness policy](https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy) to make sure it doesn't inadverdently bring down a website, and so it doesn't run into permanent bans \u0026 `429` error responses.\n\n2. In-memory state management.\n\n    `Sitemap().sitemap` is an unbound `Map`, and can quickly grow and possibly cause the runtime to run out of memory \u0026 crash when crawling very large websites. In a production-grade crawler, there should an external scheduler that holds URLs to crawl next.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkraken-devs%2Fwebsight","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkraken-devs%2Fwebsight","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkraken-devs%2Fwebsight/lists"}