Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yusuftaufiq/cli-website-crawler
Non-blocking CLI based application to recursively crawl data from whole pages on websites in parallel and save the results to HTML output. Built with Node.js, TypeScript, NestJs, and Playwright.
https://github.com/yusuftaufiq/cli-website-crawler
cli crawling nestjs nodejs playwraight scraping typescript
Last synced: 4 days ago
JSON representation
Non-blocking CLI based application to recursively crawl data from whole pages on websites in parallel and save the results to HTML output. Built with Node.js, TypeScript, NestJs, and Playwright.
- Host: GitHub
- URL: https://github.com/yusuftaufiq/cli-website-crawler
- Owner: yusuftaufiq
- Created: 2023-07-20T23:04:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-07-21T08:07:36.000Z (over 1 year ago)
- Last Synced: 2024-10-12T11:35:38.362Z (3 months ago)
- Topics: cli, crawling, nestjs, nodejs, playwraight, scraping, typescript
- Language: HTML
- Homepage:
- Size: 2.63 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Table of Contents
- [Description](#description)
- [How it works](#how-it-works)
- [Technical details](#technical-details)
- [Installation](#installation)
- [Usage](#usage)
- [TODO](#todo)## Description
Non-blocking CLI based application to recursively crawl data from whole pages on websites in parallel and save the results to HTML output.
![Overview](./assets/overview.png)
## How it works
- Open pages using [Playwright](https://playwright.dev/).
- On pages, find new link elements that have an HTML `a` tag on the page.
- Filter only links that point to the same domain and are allowed in `robots.txt`.
- Add links to the request queue.
- Skips duplicate URLs.
- Visit recently queued links.
- Repeat the process.## Technical details
- Tech stack: [Node.js](https://nodejs.org/en), [TypeScript](https://www.typescriptlang.org/), [NestJs](https://nestjs.com/), [Playwright](https://playwright.dev/)
- Data structures: [Hash Map](./src/robots/robots.service.ts#L29), [Hash Set](./src/crawl/handlers/default.handler.ts#L40) for high performance O(1) constant insertion and retrieval.
- Architectures: modules, services, and commands separated by feature purpose.## Installation
- Requirements
- Node.js >= 18
- Clone this repository
```bash
$ https://github.com/yusuftaufiq/cmlabs-backend-crawler-freelance-test.git
```- Change to the cloned directory and install all required dependencies (may take a while)
```bash
$ npm install
```- Build the application
```bash
$ npm run build
```- Start the CLI application, all the features can be seen in the following [section](#usage)
```bash
$ npm run start:prod -- crawl
```
If the command is launched successfully, all results will be available in [./storage/key_value_stores](./storage/key_value_stores)## Usage
- Show all available commands
```bash
$ npm run start:prod -- --help
$ npm run start:prod -- crawl --help
```
- Customize targets to be crawled. (default: https://cmlabs.co/ https://www.sequence.day/ https://yusuftaufiq.com)
```bash
$ npm run start:prod -- crawl https://books.toscrape.com/ https://quotes.toscrape.com/
```
- Control the verbosity of log messages (choices: "off", "error", "soft_fail", "warning", "info", "debug", "perf", default: "info")
```bash
$ npm run start:prod -- crawl --log-level warning
```
- Sets the maximum concurrency (parallelism) for the crawl (default: 15)
```bash
$ npm run start:prod -- crawl --max-concurrency 100
```
- Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. (default: 50)
```bash
$ npm run start:prod -- crawl --max-requests 1000
```
- Timeout by which the function must complete, in seconds. (default: 30)
```bash
$ npm run start:prod -- crawl --timeout 10
```
- Whether to run the browser in headful mode. (default: false)
```bash
$ npm run start:prod -- crawl --headful
```## TODO
- Prioritize sitemap.xml
- Add proxies
- Watch out for honeypots
- Adopt CAPTCHA solving service