https://github.com/kodemartin/webcrawler
A simple webcrawler
https://github.com/kodemartin/webcrawler
crawler rust
Last synced: 11 months ago
JSON representation
A simple webcrawler
- Host: GitHub
- URL: https://github.com/kodemartin/webcrawler
- Owner: kodemartin
- License: mit
- Created: 2022-10-21T18:29:38.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-27T19:04:17.000Z (over 3 years ago)
- Last Synced: 2025-03-21T00:44:49.546Z (over 1 year ago)
- Topics: crawler, rust
- Language: Rust
- Homepage:
- Size: 14.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# `webcrawler`
A library for enabling breadth-first crawling starting from a root URL.
## Features
* Dynamically set maximum concurrent tasks
* Dynamically set maximum number of pages to visit
* Skips duplicate pages
* Stores visited pages in the `webpages` directory.
## Command-line application
The library is used to expose a command-line program (`crawler-cli`) that
exposes its functionality according to the following API
```
$ cargo run -- --help
A command-line application that launches a crawler starting from a root url, and
descending to nested urls in a breadth-first manner
Usage: crawler-cli [OPTIONS]
Arguments:
The root url to start the crawling from
Options:
--max-tasks Max number of concurrent tasks to trigger
[default: 5]
--max-pages Max number of pages to visit [default: 100]
--n-workers Number of workers. By default this equals the
number of available cores
-h, --help Print help information
-V, --version Print version information
```
## Limitations
* Links with relative urls are not treated
* `robots.txt` is not handled