https://github.com/holmofy/reqwest-scraper
web scraping integration with reqwest
https://github.com/holmofy/reqwest-scraper
proc-macro proc-macro-derive reqwest rust scraper
Last synced: 7 months ago
JSON representation
web scraping integration with reqwest
- Host: GitHub
- URL: https://github.com/holmofy/reqwest-scraper
- Owner: holmofy
- License: mit
- Created: 2024-07-08T10:48:28.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-27T14:21:19.000Z (11 months ago)
- Last Synced: 2025-02-28T00:51:17.739Z (7 months ago)
- Topics: proc-macro, proc-macro-derive, reqwest, rust, scraper
- Language: Rust
- Homepage: https://docs.rs/reqwest-scraper/
- Size: 135 KB
- Stars: 4
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## reqwest-scraper - Web scraping integration with reqwest
[](https://crates.io/crates/reqwest-scraper)
[](https://docs.rs/reqwest-scraper)
[](https://github.com/holmofy/reqwest-scraper/actions?query=workflow%3APublish)Extends [reqwest](https://github.com/seanmonstar/reqwest) to support multiple web scraping methods.
### Features
* [x] Use [JsonPath](#jsonpath) to select fields in json response
* [x] Select elements in HTML response using [CSS selector](#css-selector)
* [x] Evalute the value in HTML response using [xpath expression](#xpath)
* [x] [Derive macro extract](#macros)### Start Guide
* add dependency
```toml
reqwest = { version = "0.12", features = ["json"] }
reqwest-scraper="0.3.2"
```
* use ScraperResponse
```rust
use reqwest_scraper::ScraperResponse;
```JsonPath
* `Json::select(path: &str) -> Result>`
* `Json::select_one(path: &str) -> Result`
* `Json::select_as_str(path: &str) -> Result`[**example**](./examples/json.rs):
```rust
use reqwest_scraper::ScraperResponse;pub async fn request() -> Result<()> {
let json = reqwest::Client::builder()
.build()?
.get("https://api.github.com/search/repositories?q=rust")
.header("User-Agent", "Rust Reqwest")
.send()
.await?
.jsonpath()
.await?;let total_count = json.select_as_str("$.total_count")?;
let names: Vec = json.select("$.items[*].full_name")?;println!("{}", total_count);
println!("{}", names.join("\t"));Ok(())
}
```CSS selector
* `Html::select(selector: &str) -> Result`
* `Selectable::iter() -> impl Iterator`
* `Selectable::first() -> Option`
* `SelectItem::name() -> &str`
* `SelectItem::id() -> Option<&str>`
* `SelectItem::has_class(class: &str, case_sensitive: CaseSensitivity) -> bool`
* `SelectItem::classes() -> Classes`
* `SelectItem::attrs() -> Attrs`
* `SelectItem::attr(attr: &str) -> Option<&str>`
* `SelectItem::text() -> String`
* `SelectItem::html() -> String`
* `SelectItem::inner_html() -> String`
* `SelectItem::children() -> impl Iterator`
* `SelectItem::find(selector: &str) -> Result`[**example**](./examples/html.rs):
```rust
use reqwest_scraper::ScraperResponse;async fn request() -> Result<()> {
let html = reqwest::get("https://github.com/holmofy")
.await?
.css_selector()
.await?;assert_eq!(
html.select(".p-name")?.iter().nth(0).unwrap().text().trim(),
"holmofy"
);let select_result = html.select(".vcard-details > li.vcard-detail")?;
for detail_item in select_result.iter() {
println!("{}", detail_item.attr("aria-label").unwrap())
}Ok(())
}
```XPath
* `XHtml::select(xpath: &str) -> Result`
* `XPathResult::as_nodes() -> Vec`
* `XPathResult::as_strs() -> Vec`
* `XPathResult::as_node() -> Option`
* `XPathResult::as_str() -> Option`
* `Node::name() -> String`
* `Node::id() -> Option`
* `Node::classes() -> HashSet`
* `Node::attr(attr: &str) -> Option`
* `Node::has_attr(attr: &str) -> bool`
* `Node::text() -> String`
* TODO: `Node::html() -> String`
* TODO: `Node::inner_html() -> String`
* `Node::children() -> Vec`
* `Node::findnodes(relative_xpath: &str) -> Result>`
* `Node::findvalues(relative_xpath: &str) -> Result>`
* `Node::findnode(relative_xpath: &str) -> Result>`
* `Node::findvalue(relative_xpath: &str) -> Result>`[**example**](./examples/xpath.rs):
```rust
async fn request() -> Result<()> {
let html = reqwest::get("https://github.com/holmofy")
.await?
.xpath()
.await?;// simple extract element
let name = html
.select("//span[contains(@class,'p-name')]")?
.as_node()
.unwrap()
.text();
println!("{}", name);
assert_eq!(name.trim(), "holmofy");// iterate elements
let select_result = html
.select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]")?
.as_nodes();println!("{}", select_result.len());
for item in select_result.into_iter() {
let attr = item.attr("aria-label").unwrap_or_else(|| "".into());
println!("{}", attr);
println!("{}", item.text());
}// attribute extract
let select_result = html
.select("//ul[contains(@class,'vcard-details')]/li[contains(@class,'vcard-detail')]/@aria-label")?
.as_strs();println!("{}", select_result.len());
select_result.into_iter().for_each(|s| println!("{}", s));Ok(())
}
```Derive macro extract
**use `FromCssSelector` & `selector` to extract html element into struct**
```rust
// define struct and derive the FromCssSelector trait
#[derive(Debug, FromCssSelector)]
#[selector(path = "#user-repositories-list > ul > li")]
struct Repo {
#[selector(path = "a[itemprop~='name']", default = "", text)]
name: String,#[selector(path = "span[itemprop~='programmingLanguage']", text)]
program_lang: Option,#[selector(path = "div.topics-row-container>a", text)]
topics: Vec,
}// request
let html = reqwest::get("https://github.com/holmofy?tab=repositories")
.await?
.css_selector()
.await?;// Use the generated `from_html` method to extract data into the struct
let items = Repo::from_html(html)?;
items.iter().for_each(|item| println!("{:?}", item));
```**use `FromXPath` & `xpath` to extract html element into struct**
```rust
// define struct and derive the FromXPath trait
#[derive(Debug, FromXPath)]
#[xpath(path = "//div[@id='user-repositories-list']/ul/li")]
struct Repo {
#[xpath(path = ".//a[contains(@itemprop,'name')]/text()", default = "")]
name: String,#[xpath(path = ".//span[contains(@itemprop,'programmingLanguage')]/text()")]
program_lang: Option,#[xpath(path = ".//div[contains(@class,'topics-row-container')]/a/text()")]
topics: Vec,
}let html = reqwest::get("https://github.com/holmofy?tab=repositories")
.await?
.xpath()
.await?;// Use the generated `from_xhtml` method to extract data into the struct
let items = Repo::from_xhtml(html)?;
items.iter().for_each(|item| println!("{:?}", item));
```## Related Projects
* [reqwest](https://github.com/seanmonstar/reqwest)
* [scraper](https://github.com/causal-agent/scraper)
* [nipper](https://github.com/importcjj/nipper)
* [jsonpath_lib](https://github.com/freestrings/jsonpath)
* [unhtml.rs](https://github.com/Hexilee/unhtml.rs)
* [xpath-scraper](https://github.com/Its-its/xpath-scraper)