An open API service indexing awesome lists of open source software.

https://github.com/kodemartin/webcrawler

A simple webcrawler
https://github.com/kodemartin/webcrawler

crawler rust

Last synced: 11 months ago
JSON representation

A simple webcrawler

Awesome Lists containing this project

README

          

# `webcrawler`

A library for enabling breadth-first crawling starting from a root URL.

## Features

* Dynamically set maximum concurrent tasks
* Dynamically set maximum number of pages to visit
* Skips duplicate pages
* Stores visited pages in the `webpages` directory.

## Command-line application

The library is used to expose a command-line program (`crawler-cli`) that
exposes its functionality according to the following API

```
$ cargo run -- --help

A command-line application that launches a crawler starting from a root url, and
descending to nested urls in a breadth-first manner

Usage: crawler-cli [OPTIONS]

Arguments:
The root url to start the crawling from

Options:
--max-tasks Max number of concurrent tasks to trigger
[default: 5]
--max-pages Max number of pages to visit [default: 100]
--n-workers Number of workers. By default this equals the
number of available cores
-h, --help Print help information
-V, --version Print version information
```

## Limitations

* Links with relative urls are not treated
* `robots.txt` is not handled