Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/paambaati/websight
๐ทA simple but *really* fast crawler built with Node.js & TypeScript
https://github.com/paambaati/websight
coding-challenge crawler interview-questions javascript monzo nodejs typescript
Last synced: about 2 months ago
JSON representation
๐ทA simple but *really* fast crawler built with Node.js & TypeScript
- Host: GitHub
- URL: https://github.com/paambaati/websight
- Owner: paambaati
- License: wtfpl
- Created: 2019-07-14T02:35:13.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-10-01T23:17:10.000Z (over 1 year ago)
- Last Synced: 2024-11-27T14:11:47.768Z (2 months ago)
- Topics: coding-challenge, crawler, interview-questions, javascript, monzo, nodejs, typescript
- Language: TypeScript
- Homepage:
- Size: 1.34 MB
- Stars: 18
- Watchers: 2
- Forks: 13
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# websight
![CI Status](https://github.com/paambaati/websight/workflows/build/badge.svg) [![Test Coverage](https://api.codeclimate.com/v1/badges/170cf9f21bdb38fd63a1/test_coverage)](https://codeclimate.com/github/paambaati/websight/test_coverage) [![Maintainability](https://api.codeclimate.com/v1/badges/170cf9f21bdb38fd63a1/maintainability)](https://codeclimate.com/github/paambaati/websight/maintainability) [![WTFPL License](https://img.shields.io/badge/License-WTFPL-blue.svg)](LICENSE)
A simple crawler that fetches all pages in a given website and prints the links between them.
![Screenshot](SCREENSHOT.png)
๐ฃ Note that this project was purpose-built for a coding challenge (see [problem statement](PROBLEM-STATEMENT.md)) and is not meant for production use (unless you aren't [web scale](http://www.mongodb-is-web-scale.com/) yet).
### ๐ ๏ธ Setup
Before you run this app, make sure you have [Node.js](https://nodejs.org/en/) installed. [`yarn`](https://yarnpkg.com/lang/en/docs/install) is recommended, but can be used interchangeably with `npm`. If you'd prefer running everything inside a Docker container, see the [Docker setup](#docker-setup) section.
```bash
git clone https://github.com/paambaati/websight
cd websight
yarn install && yarn build
```#### ๐ฉ๐ปโ๐ป Usage
```bash
yarn start
```#### ๐งช Tests & Coverage
```bash
yarn run coverage
```### ๐ณ Docker Setup
```bash
docker build -t websight .
docker run -ti websight
```
### ๐ฆ Executable Binary```bash
yarn bundle && yarn binary
```This produces standalone executable binaries for both Linux and macOS.
## ๐งฉ Design
```
+---------------------+
| Link Extractor |
| +-----------------+ |
| | | |
| | URL Resolver | |
| | | |
| +-----------------+ |
+-----------------+ | +-----------------+ | +-----------------+
| | | | | | | |
| Crawler +---->+ | Fetcher | +---->+ Sitemap |
| | | | | | | |
+-----------------+ | +-----------------+ | +-----------------+
| +-----------------+ |
| | | |
| | Parser | |
| | | |
| +-----------------+ |
+---------------------+
```The `Crawler` class runs a fast non-deterministic fetch of all pages (via `LinkExtractor`) & the URLs in them recursively and saves them in `Sitemap`. When crawling is complete[[1]](#f1), the sitemap is printed as a ASCII tree.
The `LinkExtractor` class is a thin orchestrating wrapper around 3 core components โ
1. `URLResolver` includes logic for resolving relative URLs and normalizing them. It also includes utility methods for filtering out external URLs.
2. `Fetcher` takes a URL, fetches it and returns the response as a [`Stream`](https://nodejs.org/api/stream.html#stream_stream). This is better because streams can be read in small buffered chunks, avoiding holding very large HTMLs in memory.
3. `Parser` parses the HTML stream (returned by `Fetcher`) in chunks and emits the `link` event on each page URL and the `asset` event on each static asset found in the HTML.
1 `Crawler.crawl()` is an `async` function that _never resolves_ because it is technically impossible to detect when we've finished crawling. In most runtimes, we'd have to implement some kind of idle polling to detect completion; however, in Node.js, as soon as the event loop has no more tasks to execute, the main process will run to completion. This is why we finally print the sitemap in the [`Process.beforeExit`](https://nodejs.org/api/process.html#process_event_beforeexit) event. [โฉ](#a1)
## ๐ Optimizations
1. Streams all the way down.
The key workloads in this system are HTTP fetches (I/O-bound) and HTML parses (CPU-bound), and either can be time-consuming and/or high on memory usage. To better parallelize the crawls and use as little memory as possible, [`got` library's streaming API](https://www.npmjs.com/package/got#streams) and the _very_ fast [`htmlparser2`](https://github.com/fb55/htmlparser2#performance) have been used.
2. Keep-Alive connections.
The `Fetcher` class uses a global [`keepAlive`](https://nodejs.org/api/http.html#http_new_agent_options) agent to reuse sockets as we're only crawling a single domain. This helps avoid re-establishing TCP connections for each request.
## โก๏ธ Limitations
When ramping up for scale, this design exposes a few of its limitations โ
1. No rate-limiting.
Most modern and large websites have some sort of throttling set up to block bots. A production-grade crawler should implement _some_ [politeness policy](https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy) to make sure it doesn't inadverdently bring down a website, and so it doesn't run into permanent bans & `429` error responses.
2. In-memory state management.
`Sitemap().sitemap` is an unbound `Map`, and can quickly grow and possibly cause the runtime to run out of memory & crash when crawling very large websites. In a production-grade crawler, there should an external scheduler that holds URLs to crawl next.