Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Hexilee/unhtml.rs
A magic html parser
https://github.com/Hexilee/unhtml.rs
html-parser rust rust-crate scraper
Last synced: 3 months ago
JSON representation
A magic html parser
- Host: GitHub
- URL: https://github.com/Hexilee/unhtml.rs
- Owner: Hexilee
- License: mit
- Created: 2018-11-26T18:19:48.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-07-15T03:33:11.000Z (over 2 years ago)
- Last Synced: 2024-07-18T10:02:11.520Z (4 months ago)
- Topics: html-parser, rust, rust-crate, scraper
- Language: Rust
- Homepage:
- Size: 163 KB
- Stars: 194
- Watchers: 5
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
A magic html parser.
[![Stable Test](https://github.com/Hexilee/unhtml.rs/workflows/Stable%20Test/badge.svg)](https://github.com/Hexilee/unhtml.rs/actions)
[![Rust Docs](https://docs.rs/unhtml/badge.svg)](https://docs.rs/unhtml)
[![Crate version](https://img.shields.io/crates/v/unhtml.svg)](https://crates.io/crates/unhtml)
[![Download](https://img.shields.io/crates/d/unhtml.svg)](https://crates.io/crates/unhtml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Hexilee/unhtml.rs/blob/master/LICENSE)
----------------Table of Contents
=================* [Derive Target](#derive-target)
* [Basic Usage](#basic-usage)
* [Attributes](#attributes)
* [html](#html)
* [target](#target)
* [specification](#specification)
* [selector](#selector)
* [target](#target-1)
* [literal type](#literal-type)
* [specification](#specification-1)
* [default behavior](#default-behavior)
* [attr](#attr)
* [target](#target-2)
* [literal type](#literal-type-1)
* [specification](#specification-2)
* [default behavior](#default-behavior-1)
* [default](#default)
* [target](#target-3)
* [literal type](#literal-type-2)
* [specification](#specification-3)
* [default behavior](#default-behavior-2)
* [Field Type](#field-type)
* [any type implemented FromHtml, without generics](#any-type-implemented-fromhtml-without-generics)
* [Vec](#vec)
* [Source HTML](#source-html)
* [with top selector](#with-top-selector)
* [without top selector](#without-top-selector)### Derive Target
`struct`
### Basic Usage
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
#[html(selector = "#test")]
struct SingleUser {
#[html(selector = "p:nth-child(1)", attr = "inner")]
name: String,#[html(selector = "p:nth-child(2)", attr = "inner")]
age: u8,#[html(selector = "p:nth-child(3)", attr = "inner")]
like_lemon: bool,
}let user = SingleUser::from_html(r#"
Title
Hexilee
20
true
"#).unwrap();
assert_eq!("Hexilee", &user.name);
assert_eq!(20, user.age);
assert!(user.like_lemon);
```### Attributes
#### html##### target
`derive target` or `field`
##### specification
`#[html(selector = "...", attr = "...", default = ...)]`
`selector`, `attr`, `default` or `html` itself can be unnecessary.
This is valid
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct SingleString {
value: String,
}
```#### selector
##### target
`derive target` or `field`
##### literal type
`string`
##### specification
selector must be a valid css-selector, invalid selector will cause a compile-time panic
```rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
#[html(selector = "<>")]
struct SingleUser {}
``````rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::*;#[derive(FromHtml)]
struct SingleUser {
#[html(selector = "<>", attr = "inner")]
name: String,
}
```if multi element is selected and field type is not `Vec`, the first will be chosen
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
#[html(selector = "a")]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}let link = Link::from_html(r#"
Github
"#).unwrap();
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
```##### default behavior
html of its root element
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}let link = Link::from_html(r#"Github"#).unwrap();
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
```#### attr
##### target
`field`
##### literal type
`string`
##### specification
- `inner` refer to `innerHtml`
- any other `attr` refer to `html element attribute````rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}let link = Link::from_html(r#"Github"#).unwrap();
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
```##### default behavior
html of the whole element (not `innerHtml`!)
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
source: String,
}let link = Link::from_html(r#"Github"#).unwrap();
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
assert_eq!(r#"Github"#, &link.source);
```#### default
##### target
`field`
##### literal type
any `literal type`
##### specification
- the same type with `field`
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct DefaultUser {
// invoke String::from_html
#[html(selector = "#non-exist", default = "Hexilee")]
name: String,// invoke u8::from
#[html(default = 20)]
age: u8,#[html(default = true)]
like_lemon: bool,
}let user = DefaultUser::from_html("
").unwrap();
assert_eq!("Hexilee", &user.name);
assert_eq!(20, user.age);
assert_eq!(-1000, user.assets);
assert!(user.like_lemon);
```- `string`
```rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}#[derive(FromHtml)]
struct Website {
#[html(default = "10")]
age: u8,#[html(default = "Github")]
link: Link,
}let website = Website::from_html("
").unwrap();
let link = website.link;
assert_eq!(10u8, website.age);
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
```##### default behavior
return a Err(unhtml::failure::Error) when selected nothing
```rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
// no default
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}let link = Link::from_html(r#"Github"#).unwrap();
```### Field Type
##### any type implemented FromHtml, without generics
```rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
// no default
#[html(attr = "href")]
href: &str,
}
``````rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Website {
// no default
#[html(attr = "href")]
hrefs: std::collections::LinkedList,
}
``````rust,should_panic
// panic
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Website {
// no default
#[html(attr = "href")]
hrefs: [String],
}
```##### Vec
> Should `use unhtml::VecFromHtml`
```rust
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml, VecFromHtml};#[derive(FromHtml)]
struct TestUser {
#[html(selector = "p:nth-child(1)", attr = "inner")]
name: String,#[html(selector = "p:nth-child(2)", attr = "inner")]
age: u8,#[html(selector = "p:nth-child(3)", attr = "inner")]
like_lemon: bool,
}#[derive(FromHtml)]
#[html(selector = "#test")]
struct TestUsers {
#[html(selector = "div")]
users: Vec,
}let users = TestUsers::from_html(r#"
Title
Hexilee
20
true
BigBrother
21
false
"#).unwrap();
let hexilee = &users.users[0];
let big_brother = &users.users[1];
assert_eq!("Hexilee", &hexilee.name);
assert_eq!(20, hexilee.age);
assert!(hexilee.like_lemon);
assert_eq!("BigBrother", &big_brother.name);
assert_eq!(21, big_brother.age);
assert!(!big_brother.like_lemon);
```as the documentation of crate `unhtml`, if you want `Vec` straightly, you can just:
```rust
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, VecFromHtml};#[derive(FromHtml)]
struct TestUser {
#[html(selector = "p:nth-child(1)", attr = "inner")]
name: String,#[html(selector = "p:nth-child(2)", attr = "inner")]
age: u8,#[html(selector = "p:nth-child(3)", attr = "inner")]
like_lemon: bool,
}let users = Vec::::from_html("#test > div", r#"
Title
Hexilee
20
true
BigBrother
21
false
"#).unwrap();
let hexilee = &users[0];
let big_brother = &users[1];
assert_eq!("Hexilee", &hexilee.name);
assert_eq!(20, hexilee.age);
assert!(hexilee.like_lemon);
assert_eq!("BigBrother", &big_brother.name);
assert_eq!(21, big_brother.age);
assert!(!big_brother.like_lemon);
```### Source HTML
##### with top selector
all source html will be parsed as `fragment`. The top element is `html` and there is no `DOCTYPE`, `head` or `body`.```html,ignore
Title
Hexilee
20
true
```
will be parsed as:
```html,ignore
Title
Hexilee
20
true
```
and
```html,ignore
Hexilee
```will be parsed as:
```html,ignore
Hexilee
``````rust,should_panic
// panic
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Document {
// no default
#[html(selector = "head")]
head: String,#[html(selector = "body")]
body: String,
}let dicument = Document::from_html(r#"
Title
Hexilee
20
true
"#).unwrap();
```##### without top selector
when derived struct doesn't have `top selector`, all source html will be parsed as `pure fragment`. There is no `DOCTYPE`, `html`, `head` or `body`.
```html,ignore
Title
Hexilee
20
true
```
will be parsed as:
```html,ignore
Title
Hexilee
20
true
```and
```html,ignore
Hexilee
```will be parsed as:
```html,ignore
Hexilee
``````rust
#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};#[derive(FromHtml)]
struct Link {
#[html(attr = "href")]
href: String,#[html(attr = "inner")]
value: String,
}let link = Link::from_html(r#"Github"#).unwrap();
assert_eq!("https://github.com", &link.href);
assert_eq!("Github", &link.value);
```