Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mattsse/voyager
crawl and scrape web pages in rust
https://github.com/mattsse/voyager
Last synced: 25 days ago
JSON representation
crawl and scrape web pages in rust
- Host: GitHub
- URL: https://github.com/mattsse/voyager
- Owner: mattsse
- License: apache-2.0
- Created: 2020-12-22T10:26:48.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-06-20T07:18:00.000Z (over 1 year ago)
- Last Synced: 2024-04-25T01:42:30.113Z (7 months ago)
- Language: Rust
- Homepage:
- Size: 74.2 KB
- Stars: 703
- Watchers: 9
- Forks: 32
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
voyager
=========================[](https://github.com/mattsse/voyager)
[](https://crates.io/crates/voyager)
[](https://docs.rs/voyager)
[](https://github.com/mattsse/voyager/actions?query=branch%3Amain)With voyager you can easily extract structured data from websites.
Write your own crawler/scraper with voyager following a state machine model.
## Example
The examples use [tokio](https://tokio.rs/) as its runtime, so your `Cargo.toml` could look like this:
```toml
[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1", features = ["full"] }
```### Declare your own Scraper and model
```rust
// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
post_selector: Selector,
author_selector: Selector,
title_selector: Selector,
comment_selector: Selector,
max_page: usize,
}/// The state model
#[derive(Debug)]
enum HackernewsState {
Page(usize),
Post,
}/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
author: String,
url: Url,
link: Option,
title: String,
}
```### Implement the `voyager::Scraper` trait
A `Scraper` consists of two associated types:
* `Output`, the type the scraper eventually produces
* `State`, the type, the scraper can drag along several requests that eventually lead to an `Output`and the `scrape` callback, which is invoked after each received response.
Based on the state attached to `response` you can supply the crawler with new urls to visit with, or without a state attached to it.
Scraping is done with [causal-agent/scraper](https://github.com/causal-agent/scraper).
```rust
impl Scraper for HackernewsScraper {
type Output = Entry;
type State = HackernewsState;/// do your scraping
fn scrape(
&mut self,
response: Response,
crawler: &mut Crawler,
) -> Result> {
let html = response.html();if let Some(state) = response.state {
match state {
HackernewsState::Page(page) => {
// find all entries
for id in html
.select(&self.post_selector)
.filter_map(|el| el.value().attr("id"))
{
// submit an url to a post
crawler.visit_with_state(
&format!("https://news.ycombinator.com/item?id={}", id),
HackernewsState::Post,
);
}
if page < self.max_page {
// queue in next page
crawler.visit_with_state(
&format!("https://news.ycombinator.com/news?p={}", page + 1),
HackernewsState::Page(page + 1),
);
}
}HackernewsState::Post => {
// scrape the entry
let entry = Entry {
// ...
};
return Ok(Some(entry))
}
}
}Ok(None)
}
}
```### Setup and collect all the output
Configure the crawler with via `CrawlerConfig`:
* Allow/Block list of Domains
* Delays between requests
* Whether to respect the `Robots.txt` rulesFeed your config and an instance of your scraper to the `Collector` that drives the `Crawler` and forwards the responses to your `Scraper`.
```rust
use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;#[tokio::main]
async fn main() -> Result<(), Box> {
// only fulfill requests to `news.ycombinator.com`
let config = CrawlerConfig::default().allow_domain_with_delay(
"news.ycombinator.com",
// add a delay between requests
RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
);
let mut collector = Collector::new(HackernewsScraper::default(), config);collector.crawler_mut().visit_with_state(
"https://news.ycombinator.com/news",
HackernewsState::Page(1),
);while let Some(output) = collector.next().await {
let post = output?;
dbg!(post);
}
Ok(())
}
```See [examples](./examples) for more.
### Inject async calls
Sometimes it might be helpful to execute some other calls first, get a token etc.,
You submit `async` closures to the crawler to manually get a response and inject a state or drive a state to completion```rust
fn scrape(
&mut self,
response: Response,
crawler: &mut Crawler,
) -> Result> {// inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
crawler.crawl(move |client| async move {
let state = response.state;
let auth = client.post("some auth end point ").send()?.await?.json().await?;
// do other async tasks etc..
let new_resp = client.get("the next html page").send().await?;
Ok((new_resp, state))
});
// submit a crawling job that completes to `Self::Output` directly
crawler.complete(move |client| async move {
// do other async tasks to create a `Self::Output` instance
let output = Self::Output{/*..*/};
Ok(Some(output))
});
Ok(None)
}
```### Recover a state that got lost
If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as `CrawlError`, which carries the last valid state. The error then can be down casted.
```rust
let mut collector = Collector::new(HackernewsScraper::default(), config);
while let Some(output) = collector.next().await {
match output {
Ok(post) => {/**/}
Err(err) => {
// recover the state by downcasting the error
if let Ok(err) = err.downcast::::State>>() {
let last_state = err.state();
}
}
}
}
```Licensed under either of these:
* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
https://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or
https://opensource.org/licenses/MIT)