https://github.com/kodemartin/webcrawler

A simple webcrawler
https://github.com/kodemartin/webcrawler

crawler rust

Last synced: 11 months ago
JSON representation

A simple webcrawler

Host: GitHub
URL: https://github.com/kodemartin/webcrawler
Owner: kodemartin
License: mit
Created: 2022-10-21T18:29:38.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-11-27T19:04:17.000Z (over 3 years ago)
Last Synced: 2025-03-21T00:44:49.546Z (over 1 year ago)
Topics: crawler, rust
Language: Rust
Homepage:
Size: 14.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# `webcrawler`

A library for enabling breadth-first crawling starting from a root URL.

## Features

* Dynamically set maximum concurrent tasks
* Dynamically set maximum number of pages to visit
* Skips duplicate pages
* Stores visited pages in the `webpages` directory.

## Command-line application

The library is used to expose a command-line program (`crawler-cli`) that
exposes its functionality according to the following API

```
$ cargo run -- --help

A command-line application that launches a crawler starting from a root url, and
descending to nested urls in a breadth-first manner

Usage: crawler-cli [OPTIONS]

Arguments:
The root url to start the crawling from

Options:
--max-tasks Max number of concurrent tasks to trigger
[default: 5]
--max-pages Max number of pages to visit [default: 100]
--n-workers Number of workers. By default this equals the
number of available cores
-h, --help Print help information
-V, --version Print version information
```

## Limitations

* Links with relative urls are not treated
* `robots.txt` is not handled

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kodemartin/webcrawler

Awesome Lists containing this project

README