Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mattsse/voyager

crawl and scrape web pages in rust
https://github.com/mattsse/voyager

Last synced: 25 days ago
JSON representation

crawl and scrape web pages in rust

Awesome Lists containing this project

README

        

voyager
=========================

[github](https://github.com/mattsse/voyager)
[crates.io](https://crates.io/crates/voyager)
[docs.rs](https://docs.rs/voyager)
[build status](https://github.com/mattsse/voyager/actions?query=branch%3Amain)

With voyager you can easily extract structured data from websites.

Write your own crawler/scraper with voyager following a state machine model.

## Example

The examples use [tokio](https://tokio.rs/) as its runtime, so your `Cargo.toml` could look like this:

```toml
[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1", features = ["full"] }
```

### Declare your own Scraper and model

```rust
// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
post_selector: Selector,
author_selector: Selector,
title_selector: Selector,
comment_selector: Selector,
max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
Page(usize),
Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
author: String,
url: Url,
link: Option,
title: String,
}
```

### Implement the `voyager::Scraper` trait

A `Scraper` consists of two associated types:

* `Output`, the type the scraper eventually produces
* `State`, the type, the scraper can drag along several requests that eventually lead to an `Output`

and the `scrape` callback, which is invoked after each received response.

Based on the state attached to `response` you can supply the crawler with new urls to visit with, or without a state attached to it.

Scraping is done with [causal-agent/scraper](https://github.com/causal-agent/scraper).

```rust
impl Scraper for HackernewsScraper {
type Output = Entry;
type State = HackernewsState;

/// do your scraping
fn scrape(
&mut self,
response: Response,
crawler: &mut Crawler,
) -> Result> {
let html = response.html();

if let Some(state) = response.state {
match state {
HackernewsState::Page(page) => {
// find all entries
for id in html
.select(&self.post_selector)
.filter_map(|el| el.value().attr("id"))
{
// submit an url to a post
crawler.visit_with_state(
&format!("https://news.ycombinator.com/item?id={}", id),
HackernewsState::Post,
);
}
if page < self.max_page {
// queue in next page
crawler.visit_with_state(
&format!("https://news.ycombinator.com/news?p={}", page + 1),
HackernewsState::Page(page + 1),
);
}
}

HackernewsState::Post => {
// scrape the entry
let entry = Entry {
// ...
};
return Ok(Some(entry))
}
}
}

Ok(None)
}
}
```

### Setup and collect all the output

Configure the crawler with via `CrawlerConfig`:

* Allow/Block list of Domains
* Delays between requests
* Whether to respect the `Robots.txt` rules

Feed your config and an instance of your scraper to the `Collector` that drives the `Crawler` and forwards the responses to your `Scraper`.

```rust
use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box> {

// only fulfill requests to `news.ycombinator.com`
let config = CrawlerConfig::default().allow_domain_with_delay(
"news.ycombinator.com",
// add a delay between requests
RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
);

let mut collector = Collector::new(HackernewsScraper::default(), config);

collector.crawler_mut().visit_with_state(
"https://news.ycombinator.com/news",
HackernewsState::Page(1),
);

while let Some(output) = collector.next().await {
let post = output?;
dbg!(post);
}

Ok(())
}
```

See [examples](./examples) for more.

### Inject async calls

Sometimes it might be helpful to execute some other calls first, get a token etc.,
You submit `async` closures to the crawler to manually get a response and inject a state or drive a state to completion

```rust

fn scrape(
&mut self,
response: Response,
crawler: &mut Crawler,
) -> Result> {

// inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
crawler.crawl(move |client| async move {
let state = response.state;
let auth = client.post("some auth end point ").send()?.await?.json().await?;
// do other async tasks etc..
let new_resp = client.get("the next html page").send().await?;
Ok((new_resp, state))
});

// submit a crawling job that completes to `Self::Output` directly
crawler.complete(move |client| async move {
// do other async tasks to create a `Self::Output` instance
let output = Self::Output{/*..*/};
Ok(Some(output))
});

Ok(None)
}
```

### Recover a state that got lost

If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as `CrawlError`, which carries the last valid state. The error then can be down casted.

```rust

let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {
match output {
Ok(post) => {/**/}
Err(err) => {
// recover the state by downcasting the error
if let Ok(err) = err.downcast::::State>>() {
let last_state = err.state();
}
}
}
}
```

Licensed under either of these:

* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
https://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or
https://opensource.org/licenses/MIT)