An open API service indexing awesome lists of open source software.

https://github.com/James-LG/Skyscraper

Rust library for scraping HTML using XPath expressions
https://github.com/James-LG/Skyscraper

html rust scraper xpath

Last synced: 4 months ago
JSON representation

Rust library for scraping HTML using XPath expressions

Awesome Lists containing this project

README

          

# Skyscraper - HTML scraping with XPath

[![Dependency Status](https://deps.rs/repo/github/James-LG/Skyscraper/status.svg)](https://deps.rs/repo/github/James-LG/Skyscraper)
[![License MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/James-LG/Skyscraper/blob/master/LICENSE)
[![Crates.io](https://img.shields.io/crates/v/skyscraper.svg)](https://crates.io/crates/skyscraper)
[![doc.rs](https://docs.rs/skyscraper/badge.svg)](https://docs.rs/skyscraper)

Rust library to scrape HTML documents with XPath expressions.

> This library is major-version 0 because there are still `todo!` calls for many xpath features.
>If you encounter one that you feel should be prioritized, open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).
>
> See the [Supported XPath Features](#supported-xpath-features) section for details.

## HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a
tree structure that can be traversed manually with parent/child relationships.

### Example: Simple HTML Parsing

```rust
use skyscraper::html::{self, parse::ParseError};
let html_text = r##"


Hello world


"##;

let document = html::parse(html_text)?;
```

### Example: Traversing Parent/Child Relationships

```rust
// Parse the HTML text into a document
let text = r#""#;
let document = html::parse(text)?;

// Get the children of the root node
let parent_node: DocumentNode = document.root_node;
let children: Vec = parent_node.children(&document).collect();
assert_eq!(2, children.len());

// Get the parent of both child nodes
let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");
let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");

assert_eq!(parent_node, parent_of_child0);
assert_eq!(parent_node, parent_of_child1);
```

## XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

Below is a basic xpath example. Please see the [docs](https://docs.rs/skyscraper/latest/skyscraper/xpath/index.html) for more examples.

```rust
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};
use std::error::Error;

fn main() -> Result<(), Box> {
let html_text = r##"


Hello world


"##;

let document = html::parse(html_text)?;
let xpath_item_tree = XpathItemTree::from(&document);
let xpath = xpath::parse("//div")?;

let item_set = xpath.apply(&xpath_item_tree)?;

assert_eq!(item_set.len(), 1);

let mut items = item_set.into_iter();

let item = items
.next()
.unwrap();

let element = item
.as_node()?
.as_tree_node()?
.data
.as_element_node()?;

assert_eq!(element.name, "div");
Ok(())
}
```

### Supported XPath Features

Below is a non-exhaustive list of all the features that are currently supported.

1. Basic xpath steps: `/html/body/div`, `//div/table//span`
1. Attribute selection: `//div/@class`
1. Text selection: `//div/text()`
1. Wildcard node selection: `//body/*`
1. Predicates:
1. Attributes: `//div[@class='hi']`
1. Indexing: `//div[1]`
1. Functions:
1. `fn:root()`
1. `contains(haystack, needle)`
1. Forward axes:
1. Child: `child::*`
1. Descendant: `descendant::*`
1. Attribute: `attribute::*`
1. DescendentOrSelf: `descendant-or-self::*`
1. (more coming soon)
1. Reverse axes:
1. Parent: `parent::*`
1. (more coming soon)
1. Treat expressions: `/html treat as node()`

This should cover most XPath use-cases.
If your use case requires an unimplemented feature,
please open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).