Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wcygan/crawler
web crawler
https://github.com/wcygan/crawler
crawler crawling tokio tokio-rs web-crawler
Last synced: 29 days ago
JSON representation
web crawler
- Host: GitHub
- URL: https://github.com/wcygan/crawler
- Owner: wcygan
- Created: 2023-02-13T01:57:52.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-09-15T04:44:43.000Z (over 1 year ago)
- Last Synced: 2024-11-13T04:51:06.759Z (3 months ago)
- Topics: crawler, crawling, tokio, tokio-rs, web-crawler
- Language: Rust
- Homepage:
- Size: 124 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Crawler
A web crawler written in Rust.
This crawler creates a web graph by exploring all URLs that it finds.
## Design
The crawler is split into two parts:
1. The connection pool
2. The parser poolThe crawler will spin up as many connections & parsers as you specify.
The connection pool will handle all HTTP requests, while the parser pool will handle all HTML parsing.
Requests to the same domain are rate limited to avoid being blocked by the server.
The URL mapping is written to an index which can be written to disk during shutdown.
## Resources
- [Tokio](https://crates.io/crates/tokio) - asynchronous runtime
- [Tokio-utils](https://crates.io/crates/tokio-utils) - rate limiter, graceful shutdown
- [Reqwest](https://crates.io/crates/reqwest/) - HTTP client
- [Dashmap](https://crates.io/crates/dashmap/) - concurrent hash map