https://github.com/James-LG/Skyscraper

Rust library for scraping HTML using XPath expressions
https://github.com/James-LG/Skyscraper

html rust scraper xpath

Last synced: 5 months ago
JSON representation

Rust library for scraping HTML using XPath expressions

Host: GitHub
URL: https://github.com/James-LG/Skyscraper
Owner: James-LG
License: mit
Created: 2021-05-19T00:16:18.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2024-12-07T02:57:25.000Z (over 1 year ago)
Last Synced: 2024-12-07T03:25:11.479Z (over 1 year ago)
Topics: html, rust, scraper, xpath
Language: Rust
Homepage:
Size: 793 KB
Stars: 31
Watchers: 1
Forks: 4
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Skyscraper - HTML scraping with XPath

[![Dependency Status](https://deps.rs/repo/github/James-LG/Skyscraper/status.svg)](https://deps.rs/repo/github/James-LG/Skyscraper)

[![License MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/James-LG/Skyscraper/blob/master/LICENSE)

[![Crates.io](https://img.shields.io/crates/v/skyscraper.svg)](https://crates.io/crates/skyscraper)

[![doc.rs](https://docs.rs/skyscraper/badge.svg)](https://docs.rs/skyscraper)

Rust library to scrape HTML documents with XPath expressions.

> This library is major-version 0 because there are still `todo!` calls for many xpath features.

>If you encounter one that you feel should be prioritized, open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).

>

> See the [Supported XPath Features](#supported-xpath-features) section for details.

## HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a

tree structure that can be traversed manually with parent/child relationships.

### Example: Simple HTML Parsing

```rust

use skyscraper::html::{self, parse::ParseError};

let html_text = r##"

    

        
Hello world

    

"##;

 

let document = html::parse(html_text)?;

```

### Example: Traversing Parent/Child Relationships

```rust

// Parse the HTML text into a document

let text = r#""#;

let document = html::parse(text)?;

 

// Get the children of the root node

let parent_node: DocumentNode = document.root_node;

let children: Vec = parent_node.children(&document).collect();

assert_eq!(2, children.len());

 

// Get the parent of both child nodes

let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");

let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");

 

assert_eq!(parent_node, parent_of_child0);

assert_eq!(parent_node, parent_of_child1);

```

## XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

Below is a basic xpath example. Please see the [docs](https://docs.rs/skyscraper/latest/skyscraper/xpath/index.html) for more examples.

```rust

use skyscraper::html;

use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};

use std::error::Error;

fn main() -> Result<(), Box> {

    let html_text = r##"

    

        

            
Hello world

        

    "##;

    let document = html::parse(html_text)?;

    let xpath_item_tree = XpathItemTree::from(&document);

    let xpath = xpath::parse("//div")?;

   

    let item_set = xpath.apply(&xpath_item_tree)?;

   

    assert_eq!(item_set.len(), 1);

   

    let mut items = item_set.into_iter();

   

    let item = items

        .next()

        .unwrap();

    let element = item

        .as_node()?

        .as_tree_node()?

        .data

        .as_element_node()?;

    assert_eq!(element.name, "div");

    Ok(())

}

```

### Supported XPath Features

Below is a non-exhaustive list of all the features that are currently supported.

1. Basic xpath steps: `/html/body/div`, `//div/table//span`

1. Attribute selection: `//div/@class`

1. Text selection: `//div/text()`

1. Wildcard node selection: `//body/*`

1. Predicates:

    1. Attributes: `//div[@class='hi']`

    1. Indexing: `//div[1]`

1. Functions:

    1. `fn:root()`

    1. `contains(haystack, needle)`

1. Forward axes:

    1. Child: `child::*`

    1. Descendant: `descendant::*`

    1. Attribute: `attribute::*`

    1. DescendentOrSelf: `descendant-or-self::*`

    1. (more coming soon)

1. Reverse axes:

    1. Parent:  `parent::*`

    1. (more coming soon)

1. Treat expressions: `/html treat as node()`

This should cover most XPath use-cases.

If your use case requires an unimplemented feature,

please open an issue on [GitHub](https://github.com/James-LG/Skyscraper/issues).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/James-LG/Skyscraper

Awesome Lists containing this project

README