Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orottier/webpage-rs
Small Rust library to fetch info about a web page: title, description, language, HTTP info, RSS feeds, Opengraph, Schema.org, and more
https://github.com/orottier/webpage-rs
html html-parser json-ld opengraph rust
Last synced: 5 days ago
JSON representation
Small Rust library to fetch info about a web page: title, description, language, HTTP info, RSS feeds, Opengraph, Schema.org, and more
- Host: GitHub
- URL: https://github.com/orottier/webpage-rs
- Owner: orottier
- Created: 2018-06-24T16:24:59.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-09-16T20:37:27.000Z (about 2 months ago)
- Last Synced: 2024-10-05T17:34:54.325Z (about 1 month ago)
- Topics: html, html-parser, json-ld, opengraph, rust
- Language: Rust
- Homepage: https://docs.rs/webpage
- Size: 62.5 KB
- Stars: 53
- Watchers: 4
- Forks: 11
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Webpage.rs
[![crates.io](https://img.shields.io/crates/v/webpage.svg)](https://crates.io/crates/webpage)
[![docs.rs](https://img.shields.io/docsrs/webpage)](https://docs.rs/webpage)_Small library to fetch info about a web page: title, description, language,
HTTP info, links, RSS feeds, Opengraph, Schema.org, and more_## Usage
```rust
use webpage::{Webpage, WebpageOptions};let info = Webpage::from_url("http://www.rust-lang.org/en-US/", WebpageOptions::default())
.expect("Could not read from URL");// the HTTP transfer info
let http = info.http;assert_eq!(http.ip, "54.192.129.71".to_string());
assert!(http.headers[0].starts_with("HTTP"));
assert!(http.body.starts_with(""));
assert_eq!(http.url, "https://www.rust-lang.org/en-US/".to_string()); // followed redirects (HTTPS)
assert_eq!(http.content_type, "text/html".to_string());// the parsed HTML info
let html = info.html;assert_eq!(html.title, Some("The Rust Programming Language".to_string()));
assert_eq!(html.description, Some("A systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.".to_string()));
assert_eq!(html.opengraph.og_type, "website".to_string());
```You can also get HTML info about local data:
```rust
use webpage::HTML;
let html = HTML::from_file("index.html", None);
// or let html = HTML::from_string(input, None);
```## Features
### Serialization
If you need to be able to serialize the data provided by the library using
[serde](https://serde.rs/), you can include specify the `serde` *feature* while
declaring your dependencies in `Cargo.toml`:```toml
webpage = { version = "2.0", features = ["serde"] }
```### No curl dependency
The `curl` feature is enabled by default but is optional. This is useful if you
do not need a HTTP client but already have the HTML data at hand.## All fields
```rust
pub struct Webpage {
pub http: HTTP, // info about the HTTP transfer
pub html: HTML, // info from the parsed HTML doc
}pub struct HTTP {
pub ip: String,
pub transfer_time: Duration,
pub redirect_count: u32,
pub content_type: String,
pub response_code: u32,
pub headers: Vec, // raw headers from final request
pub url: String, // effective url
pub body: String,
}pub struct HTML {
pub title: Option,
pub description: Option,pub url: Option, // canonical url
pub feed: Option, // RSS feed typicallypub language: Option, // as specified, not detected
pub text_content: String, // all tags stripped from body
pub links: Vec, // all links in the documentpub meta: HashMap, // flattened down list of meta properties
pub opengraph: Opengraph,
pub schema_org: Vec,
}pub struct Link {
pub url: String, // resolved url of the link
pub text: String, // anchor text
}pub struct Opengraph {
pub og_type: String,
pub properties: HashMap,pub images: Vec,
pub videos: Vec,
pub audios: Vec,
}// Facebook's Opengraph structured data
pub struct OpengraphObject {
pub url: String,
pub properties: HashMap,
}// Google's schema.org structured data
pub struct SchemaOrg {
pub schema_type: String,
pub value: serde_json::Value,
}
```## Options
The following HTTP configurations are available:
```rust
pub struct WebpageOptions {
allow_insecure: false,
follow_location: true,
max_redirections: 5,
timeout: Duration::from_secs(10),
useragent: "Webpage - Rust crate - https://crates.io/crates/webpage".to_string(),
headers: vec!["X-My-Header: 1234".to_string()],
}// usage
let mut options = WebpageOptions::default();
options.allow_insecure = true;
let info = Webpage::from_url(&url, options).expect("Halp, could not fetch");
```