Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mattsse/voyager

crawl and scrape web pages in rust
https://github.com/mattsse/voyager

Last synced: 25 days ago
JSON representation

crawl and scrape web pages in rust

Host: GitHub
URL: https://github.com/mattsse/voyager
Owner: mattsse
License: apache-2.0
Created: 2020-12-22T10:26:48.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2023-06-20T07:18:00.000Z (over 1 year ago)
Last Synced: 2024-04-25T01:42:30.113Z (7 months ago)
Language: Rust
Homepage:
Size: 74.2 KB
Stars: 703
Watchers: 9
Forks: 32
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

        voyager

=========================

[](https://github.com/mattsse/voyager)

[](https://crates.io/crates/voyager)

[](https://docs.rs/voyager)

[](https://github.com/mattsse/voyager/actions?query=branch%3Amain)

With voyager you can easily extract structured data from websites.

Write your own crawler/scraper with voyager following a state machine model.

## Example

The examples use [tokio](https://tokio.rs/) as its runtime, so your `Cargo.toml` could look like this:

```toml

[dependencies]

voyager = { version = "0.1" }

tokio = { version = "1", features = ["full"] }

```

### Declare your own Scraper and model

```rust

// Declare your scraper, with all the selectors etc.

struct HackernewsScraper {

    post_selector: Selector,

    author_selector: Selector,

    title_selector: Selector,

    comment_selector: Selector,

    max_page: usize,

}

/// The state model

#[derive(Debug)]

enum HackernewsState {

    Page(usize),

    Post,

}

/// The ouput the scraper should eventually produce

#[derive(Debug)]

struct Entry {

    author: String,

    url: Url,

    link: Option,

    title: String,

}

```

### Implement the `voyager::Scraper` trait

A `Scraper` consists of two associated types:

* `Output`, the type the scraper eventually produces

* `State`, the type, the scraper can drag along several requests that eventually lead to an `Output`

and the `scrape` callback, which is invoked after each received response.

Based on the state attached to `response` you can supply the crawler with new urls to visit with, or without a state attached to it.

Scraping is done with [causal-agent/scraper](https://github.com/causal-agent/scraper).

```rust

impl Scraper for HackernewsScraper {

    type Output = Entry;

    type State = HackernewsState;

    /// do your scraping

    fn scrape(

        &mut self,

        response: Response,

        crawler: &mut Crawler,

    ) -> Result> {

        let html = response.html();

        if let Some(state) = response.state {

            match state {

                HackernewsState::Page(page) => {

                    // find all entries

                    for id in html

                        .select(&self.post_selector)

                        .filter_map(|el| el.value().attr("id"))

                    {

                        // submit an url to a post

                        crawler.visit_with_state(

                            &format!("https://news.ycombinator.com/item?id={}", id),

                            HackernewsState::Post,

                        );

                    }

                    if page < self.max_page {

                        // queue in next page

                        crawler.visit_with_state(

                            &format!("https://news.ycombinator.com/news?p={}", page + 1),

                            HackernewsState::Page(page + 1),

                        );

                    }

                }

                HackernewsState::Post => {

                    // scrape the entry

                    let entry = Entry {

                        // ...

                    };

                    return Ok(Some(entry))

                }

            }

        }

        Ok(None)

    }

}

```

### Setup and collect all the output

Configure the crawler with via `CrawlerConfig`:

* Allow/Block list of Domains

* Delays between requests

* Whether to respect the `Robots.txt` rules

Feed your config and an instance of your scraper to the `Collector` that drives the `Crawler` and forwards the responses to your `Scraper`.

```rust

use voyager::scraper::Selector;

use voyager::*;

use tokio::stream::StreamExt;

#[tokio::main]

async fn main() -> Result<(), Box> {

    

    // only fulfill requests to `news.ycombinator.com`

    let config = CrawlerConfig::default().allow_domain_with_delay(

        "news.ycombinator.com",

        // add a delay between requests

        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),

    );

    

    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(

        "https://news.ycombinator.com/news",

        HackernewsState::Page(1),

    );

    while let Some(output) = collector.next().await {

        let post = output?;

        dbg!(post);

    }

    

    Ok(())

}

```

See [examples](./examples) for more.

### Inject async calls

Sometimes it might be helpful to execute some other calls first, get a token etc.,

You submit `async` closures to the crawler to manually get a response and inject a state or drive a state to completion

```rust

fn scrape(

    &mut self,

    response: Response,

    crawler: &mut Crawler,

) -> Result> {

    // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.

    crawler.crawl(move |client| async move {

        let state = response.state;

        let auth = client.post("some auth end point ").send()?.await?.json().await?;

        // do other async tasks etc..

        let new_resp = client.get("the next html page").send().await?;

        Ok((new_resp, state))

    });

    

    // submit a crawling job that completes to `Self::Output` directly

    crawler.complete(move |client| async move {

        // do other async tasks to create a `Self::Output` instance

        let output = Self::Output{/*..*/};

        Ok(Some(output))

    });

    

    Ok(None)

}

```

### Recover a state that got lost

If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as `CrawlError`, which carries the last valid state. The error then can be down casted.

```rust

let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {

  match output {

    Ok(post) => {/**/}

    Err(err) => {

      // recover the state by downcasting the error

      if let Ok(err) = err.downcast::::State>>() {

        let last_state = err.state();

      }

    }

  }

}

```

Licensed under either of these:

* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or

  https://www.apache.org/licenses/LICENSE-2.0)

* MIT license ([LICENSE-MIT](LICENSE-MIT) or

  https://opensource.org/licenses/MIT)