https://github.com/niklak/dom_query

A Flexible Rust Crate for DOM Querying and Manipulation
https://github.com/niklak/dom_query
css-selectors html html5ever parser scraping selectors web-scraping
Last synced: 3 months ago
JSON representation
A Flexible Rust Crate for DOM Querying and Manipulation
Host: GitHub
URL: https://github.com/niklak/dom_query
Owner: niklak
License: other
Created: 2023-12-22T08:09:20.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2026-04-06T09:40:10.000Z (3 months ago)
Last Synced: 2026-04-06T11:26:30.913Z (3 months ago)
Topics: css-selectors, html, html5ever, parser, scraping, selectors, web-scraping
Language: Rust
Homepage: https://docs.rs/dom_query
Size: 1.19 MB
Stars: 84
Watchers: 2
Forks: 10
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          
# DOM_QUERY: A Flexible Rust Crate for DOM Querying and Manipulation

[![Crates.io version](https://img.shields.io/crates/v/dom_query.svg?style=flat)](https://crates.io/crates/dom_query)

[![Download](https://img.shields.io/crates/d/dom_query.svg?style=flat)](https://crates.io/crates/dom_query)

[![docs.rs docs](https://img.shields.io/badge/docs-latest-blue.svg?style=flat)](https://docs.rs/dom_query)

[![codecov](https://codecov.io/gh/niklak/dom_query/graph/badge.svg?token=CFAVOIE61O)](https://codecov.io/gh/niklak/dom_query)

[![Build Status](https://github.com/niklak/dom_query/actions/workflows/rust.yml/badge.svg?branch=main)](https://github.com/niklak/dom_query/actions/workflows/rust.yml)

[![Rust CI ARM64](https://github.com/niklak/dom_query/actions/workflows/rust-arm64.yml/badge.svg)](https://github.com/niklak/dom_query/actions/workflows/rust-arm64.yml)

[![wasm ci](https://github.com/niklak/dom_query/actions/workflows/wasm.yml/badge.svg)](https://github.com/niklak/dom_query/actions/workflows/wasm.yml)

DOM_QUERY is a flexible Rust crate that simplifies HTML parsing, DOM querying and manipulation by providing a high-level jQuery-like API. It uses the `html5ever` crate for HTML parsing and the `selectors` crate for efficient DOM traversal and element selection.

## Features

- Parse HTML documents and fragments

- Query DOM elements using CSS selectors

- Traverse the DOM tree (ancestors, parents, children, siblings)

- Manipulate elements and their attributes:

  - Add/remove/modify attributes

  - Change element content

  - Add/remove elements

  - Rename elements

  - Move elements within the DOM tree

> [!NOTE]

> This crate is a significantly enhanced fork of [nipper](https://crates.io/crates/nipper),

> featuring expanded CSS selector support, enhanced DOM traversal  and improved DOM manipulation capabilities.

## Examples

Parsing a document

```rust

use dom_query::Document;

use tendril::StrTendril;

// Document may consume &str, String, StrTendril

let contents_str = r#"

Test Page"#;

let doc = Document::from(contents_str);

let contents_string = contents_str.to_string();

let doc = Document::from(contents_string);

let contents_tendril = StrTendril::from(contents_str);

let doc = Document::from(contents_tendril);

// The root element for the `Document` is a Document

assert!(doc.root().is_document());

// if the source has DocType, then the Document will also have one

// as a first child.

assert!(doc.root().first_child().unwrap().is_doctype());

//both of them are not elements.

```

Parsing a fragment

```rust

use dom_query::Document;

use tendril::StrTendril;

// fragment can be created with Document::fragment(), which accepts &str, String, StrTendril

let contents_str = r#"

Test Page"#;

let fragment = Document::fragment(contents_str);

let contents_string = contents_str.to_string();

let fragment = Document::fragment(contents_string);

let contents_tendril = StrTendril::from(contents_str);

let fragment = Document::fragment(contents_tendril);

// The root element for the  fragment is not a Document but a Fragment

assert!(!fragment.root().is_document());

assert!(fragment.root().is_fragment());

// and when it parses a fragment, it drops Doctype

assert!(!fragment.root().first_child().unwrap().is_doctype());

```

Selecting elements

```rust

use dom_query::Document;

let html = r#"

    

        

        Test Page

    

    

        
Test Page

        

            One

            Two

            Three

        

    

"#;

let document = Document::from(html);

// select a single element

let a = document.select("ul li:nth-child(2)");

let text = a.text().to_string();

assert!(text == "Two");

// selecting multiple elements

document.select("ul > li:has(a)").iter().for_each(|el| {

    assert!(el.is("li"));

})

// there is also `try_select` which returns an Option

let no_sel = document.try_select("p");

assert!(no_sel.is_none());

```

Selecting a single match and multiple matches

```rust

use dom_query::Document;

let doc: Document = r#"

    


        1

2

3

    

    

        4

5

6

    


"#

    .into();

// if you need to select only the first, single match, you can use following:

let single_selection = doc.select_single(".list");

// access is only for the first matching:

assert_eq!(single_selection.length(), 1);

assert_eq!(single_selection.inner_html().to_string().trim(), "
1
2
3
");

// simple selection contains all matches:

let selection = doc.select(".list");

assert_eq!(selection.length(), 2);

// but if you call inner_html() on it, you will get the inner_html of the first match:

assert_eq!(selection.inner_html().to_string().trim(), "
1
2
3
");

//this approach is using the first node from nodes vec and `select_single` consumes one iteration instead.

let first_selection = doc.select(".list").first();

assert_eq!(first_selection.length(), 1);

assert_eq!(first_selection.inner_html().to_string().trim(), "
1
2
3
");

// this approach is consuming all nodes into vec at first, and then you can call `iter().next()` to get the first one.

let next_selection = doc.select(".list").iter().next().unwrap();

assert_eq!(next_selection.length(), 1);

assert_eq!(next_selection.inner_html().to_string().trim(), "
1
2
3
");

// currently, to get data from all matches you need to iterate over them, either:

let all_matched: String = selection.iter().map(|s| s.inner_html().trim().to_string()).collect();

assert_eq!(

all_matched,

"
1
2
3
4
5
6"

);

// or:

let all_matched: String = selection.nodes().iter().map(|s| s.inner_html().trim().to_string()).collect();

// which is more efficient.

assert_eq!(

all_matched,

"
1
2
3
4
5
6"

);

```

Selecting descendent elements

```rust

 use dom_query::Document;

 let html = r#"

 

     

         

         Test Page

     

     

         
Test Page

         

             One

             Two

             Three

         

         

             Four

         

     

 "#;

 let document = Document::from(html);

 // select a parent element

 let ul = document.select("ul");

 // selecting multiple elements

 ul.select("li").iter().for_each(|el| {

     assert!(el.is("li"));

 });

 // also descendant selector may be specified starting from the parent elements

 let el = ul.select("body ul.list-b li").first();

 let text = el.text();

 assert_eq!("Four", text.to_string());

```

    Selecting ancestors

```rust

use dom_query::Document;

let doc: Document = r#"

    Test

    

        


            

                

                    Child

                

            

        

    

"#.into();

// selecting an element

let child_sel = doc.select("#child");

assert!(child_sel.exists());

let child_node = child_sel.nodes().first().unwrap();

// getting all ancestors

let ancestors = child_node.ancestors(None);

let ancestor_sel = Selection::from(ancestors);

// or just: let ancestor_sel = child_sel.ancestors(None);

// in this case ancestors includes all ancestral nodes including html

// the root html element is presented in the ancestor selection

assert!(ancestor_sel.is("html"));

// also the direct parent of our starting node is presented

assert!(ancestor_sel.is("#parent"));

// `Selection::is` matches only the current selection without descending down the tree,

// so it won't match the #child node.

assert!(!ancestor_sel.is("#child"));

// if you don't require all ancestors, you can specify a number of ancestors you need -- `max_limit`

let ancestors = child_node.ancestors(Some(2));

let ancestor_sel = Selection::from(ancestors);

// in this case ancestors includes only two ancestral nodes: #grand-parent and #parent

assert!(ancestor_sel.is("#grand-parent #parent"));

assert!(!ancestor_sel.is("#great-ancestor"));

```

Selecting with precompiled matchers (for reuse)

```rust

use dom_query::{Document, Matcher};

let html1 = r#"Test Page 1"#;

let html2 = r#"Test Page 2"#;

let doc1 = Document::from(html1);

let doc2 = Document::from(html2);

// create a matcher once, reuse on different documents

let title_matcher = Matcher::new("title").unwrap();

let title_el1 = doc1.select_matcher(&title_matcher);

assert_eq!(&title_el1.text(), "Test Page 1");

let title_el2 = doc2.select_matcher(&title_matcher);

assert_eq!(&title_el2.text(), "Test Page 2");

// selecting a single match

let title_single = doc1.select_single_matcher(&title_matcher);

assert_eq!(&title_single.text(), "Test Page 1");

```

Selecting with pseudo-classes (:has, :has-text, :contains, :only-text)

```rust

use dom_query::Document;

let html = include_str!("../test-pages/rustwiki_2024.html");

let doc = Document::from(html);

// searching list items inside a `tr` element which has a `a` element

// with title="Programming paradigm"

let paradigm_selection =

    doc.select(

        r#"table tr:has(a[title="Programming paradigm"]) td.infobox-data ul > li"#

    );

println!("Rust programming paradigms:");

for item in paradigm_selection.iter() {

    println!(" {}", item.text());

}

println!("{:-<50}", "");

//since `th` contains text "Paradigms" without sibling tags, we can use `:has-text` pseudo class

let influenced_by_selection =

    doc.select(r#"table tr:has-text("Influenced by") + tr td  ul > li > a"#);

println!("Rust influenced by:");

for item in influenced_by_selection.iter() {

    println!(" {}", item.text());

}

println!("{:-<50}", "");

// Extract all links from the block that contains certain text.

// Since `foreign function interface` located in its own tag,

// we have to use `:contains` pseudo class

let links_selection =

    doc.select(

        r#"p:contains("Rust has a foreign function interface") a[href^="/"]"#

    );

println!("Links in the FFI block:");

for item in links_selection.iter() {

    println!(" {}", item.attr("href").unwrap());

}

println!("{:-<50}", "");

// :only-text selects an element that contains only a single text node,

// with no child elements.

// It can be combined with other pseudo-classes to achieve more specific selections.

// For example, to select a 
 inside an 

//that has no siblings and no child elements other than text.

println!("Single  inside an  with text only:");

for el in doc.select("a div:only-text:only-child").iter() {

    println!("{}", el.text().trim());

}

```

Serializing to HTML

```rust

use dom_query::Document;

let html = r#"

    Test

    
Test Page

"#;

let doc = Document::from(html);

let heading_selector = doc.select("div.content");

// serializing including the outer html tag

let content = heading_selector.html();

assert_eq!(content.to_string(), r#"Test Page"#);

// serializing without the outer html tag

let inner_content = heading_selector.inner_html();

assert_eq!(inner_content.to_string(), "Test Page
");

// there is also `try_html()` method, which returns an `Option`,

// and if there is no matching selection it returns None

let opt_no_content = doc.select("div.no-content").try_html();

assert_eq!(opt_no_content, None);

//`html()` method will return an empty `StrTendril` if there is no matching selection

let no_content = doc.select("div.no-content").html();

assert_eq!(&no_content, "");

//Same things works for `inner_html()` and `try_inner_html()` method.

assert_eq!(doc.select("div.no-content").try_inner_html(), None);

assert_eq!(&doc.select("div.no-content").inner_html(), "");

```

Accessing descendent text

```rust

use dom_query::Document;

let html = r#"

    Test

    
Test Page



"#;

let doc = Document::from(html);

let body_selection = doc.select("body div").first();

let text = body_selection.text();

assert_eq!(text.to_string(), "Test Page");

```

Accessing immediate text

```rust

use dom_query::Document;

let html = r#"

    Test

    
Test Page



"#;

let doc = Document::from(html);

let body_selection = doc.select("body div h1").first();

// accessing immediate text without descendants

let text = body_selection.immediate_text();

assert_eq!(text.to_string(), "Test ");

```

Manipulating the attribute of an HTML element

```rust

use dom_query::Document;

let html = r#"

    Test

    

"#;

let doc = Document::from(html);

let mut input_selection = doc.select("input[name=k]");

// get the value of attribute "data-k"

let val = input_selection.attr("data-k").unwrap();

assert_eq!(val.to_string(), "100");

// remove the attribute "data-k" from the element

input_selection.remove_attr("data-k");

// get the value of attribute "data-k", if missing, return default value

let val_or = input_selection.attr_or("data-k", "0");

assert_eq!(val_or.to_string(), "0");

// remove a list of attributes from the element

input_selection.remove_attrs(&["id", "class"]);

// set a attribute "data-k" with value "200"

input_selection.set_attr("data-k", "200");

assert_eq!(&input_selection.html(), r#""#);

// check if attribute "hidden" exists on the element

let is_hidden = input_selection.has_attr("hidden");

assert!(is_hidden);

let has_title = input_selection.has_attr("title");

assert!(!has_title);

// remove all attributes from the element

input_selection.remove_all_attrs();

assert_eq!(&input_selection.html(), r#""#);

```

    Manipulating the DOM

```rust

use dom_query::Document;

let html_contents = r#"

    

        Test

        

            


            

            

                Remove me

            

            

                Replace me

            

        

    "#;;

let doc = Document::from(html_contents);

let mut content_selection = doc.select("body .content");

// append a new html node to the selection

content_selection.append_html(r#"
inner block"#);

assert!(doc.select("body .content .inner").exists());

// set a new content to the selection, replacing existing content

let mut set_selection = doc.select(".inner");

set_selection.set_html(r#"
1,2,3"#);

assert_eq!(&doc.select(".inner").html(), r#"1,2,3
"#);

// remove the selection

doc.select(".remove-it").remove();

assert!(!doc.select(".remove-it").exists());

// replace the selection with a new html, current selection will not change.

let mut replace_selection = doc.select(".replace-it");

replace_selection.replace_with_html(r#"
Replaced"#);

assert_eq!(replace_selection.text().trim(), "Replace me");

//but the document will change

assert_eq!(&doc.select(".replaced").text(),"Replaced");

//instead of appending content, you can prepend it

let mut content_selection = doc.select_single("body .content");

// you can prepend one element or,

content_selection.prepend_html(r#"
3"#);

// more:

content_selection.prepend_html(r#"2

2
"#);

// Also you can insert html before selection:

let first = content_selection.select(".first");

first.before_html(r#"
None"#);

// or after:

let third = content_selection.select(".third");

third.after_html(r#"4
"#);

// now the added paragraphs standing in front of `div`

assert!(doc.select(r#".content > .none + .first + .second + .third + .fourth + div:has-text("1,2,3")"#).exists());

// to set a text to the selection you can use `set_html` but `set_text` is preferable:

let p_sel = content_selection.select("p");

let total_p = p_sel.length();

p_sel.set_text("test content");

assert_eq!(doc.select(r#"p:has-text("test content")"#).length(), total_p);

```

    Node manipulations: Creating an empty element, adding a single element to a single node

```rust

use dom_query::Document;

let doc: Document = r#"

    


        It's

    


"#.into();

// selecting a node we want to attach a new element

let main_sel = doc.select_single("#main");

let main_node = main_sel.nodes().first().unwrap();

// if you need just to create a simple element, then you can use the following:

let el = doc.tree.new_element("p");

// you still able to deal with element's attributes:

el.set_attr("id", "second");

// and set text

el.set_text("test");

main_node.append_child(&el);

// also main_node.append_child(&el);

assert!(doc.select(r#"#main #second:has-text("test")"#).exists());

// because this method doesn't parse anything it is much more cheaper than following approaches.

// if you need to add a more complex element, you can use `node.append_html`,

// which is much more convenient, then previous approach:

main_node.append_html(r#"
Wonderful"#);

assert_eq!(doc.select("#main #third").text().as_ref(), "Wonderful");

// There is also a `prepend_child` and `prepend_html` methods which allows

// to insert content to the begging of the node.

main_node.prepend_html(r#"-1

0"#);

assert!(doc.select("#main > #minus-one + #zero + #first + #second + #third").exists());

// if we need to replace existing element content inside a node with a new one, then use `node.set_html`.

// It changes the inner html contents of the node.

main_node.set_html(r#"
Wonderful"#);

assert_eq!(doc.select("#main #the-only").text().as_ref(), "Wonderful");

assert!(!doc.select("#first").exists());

// To completely replace contents of the node,

// including itself use `node.replace_with_html`.

// Also we can specify more than one element in the string for methods

// like `replace_with_html`, `set_html` and `append_html`.

main_node.replace_with_html(r#"Tweedledum and Tweedledee"#);

assert!(!doc.select("#main").exists());

assert_eq!(doc.select("span + span").text().as_ref(), "Tweedledee");

```

    Renaming selected elements without changing the contents

```rust

use dom_query::Document;

let doc: Document = r#"

Test

    


        1

        2

        3

        4

    


"#

.into();

let mut sel = doc.select("div.content > div, div.content > span");

// before renaming, there are 3 `div` and 1 `span`

assert_eq!(sel.length(), 4);

sel.rename("p");

// after renaming, there are no `div` and `span` elements

assert_eq!(doc.select("div.content > div, div.content > span").length(), 0);

// but there are four `p` elements

assert_eq!(doc.select("div.content > p").length(), 4);

```

    Retrieving The Base URI

```rust

use dom_query::Document;

let contents: &str = r#"

    

        

        Test

    

    

        


    

"#;

let doc = Document::from(contents);

// This method is a much faster alternative to

// `doc.select("html > head > base").attr("href")`.

// Currently, it does not cache the result, so each time you call it,

// it will traverse the tree again.

// The reason it is not cached is to keep `Document` implementing the `Send` trait.

// It may be called from the document level.

let base_uri = doc.base_uri().unwrap();

assert_eq!(base_uri.as_ref(), "https://www.example.com/");

let sel = doc.select_single("#main");

let node = sel.nodes().first().unwrap();

// Also it is accessible from any node of the tree.

let base_uri = node.base_uri().unwrap();

assert_eq!(base_uri.as_ref(), "https://www.example.com/");

```

    Verifying Selection and Node Matches

```rust

use dom_query::Document;

let contents: &str = r#"

    

        Test

    

    

        


        

    

"#;

let doc = Document::from(contents);

let main_sel = doc.select_single("#main");

let extra_sel = doc.select_single("#extra");

// The `is()` method is available for `Selection` and `NodeRef`.

// For `Selection`, it verifies that at least one of the nodes in the selection

// matches the selector.

assert!(main_sel.is("div#main"));

assert!(!extra_sel.is("div#main"));

// For `NodeRef`, the `is` method verifies that the node matches the selector.

// This method is useful if you need to combine several checks into one expression.

// It can check for having a certain position in the DOM tree,

// having a certain attribute, or a certain element name all at once.

let main_node = main_sel.nodes().first().unwrap();

let extra_node = extra_sel.nodes().first().unwrap();

assert!(main_node.is("html > body > div#main[dir=ltr]"));

assert!(extra_node.is("html > body > div#main + div"));

```

    Fast Finding Child Elements

```rust

use dom_query::Document;

let doc: Document = r#"

    Test

    

        


    

"#.into();

let main_sel = doc.select_single("#main");

let main_node = main_sel.nodes().first().unwrap();

// create 10 child blocks with links

let total_links = 10usize;

for i in 0..total_links {

    let content = format!(r#"{0} link"#, i);

    main_node.append_html(content);

}

let selected_count = doc.select("html body a").nodes().len();

assert_eq!(selected_count, total_links);

// `find` currently can deal only with paths that start after the current node.

// In the following example, `&["html", "body", "div", "a"]` will fail,

// while `&["a"]` or `&["div", "a"]` are okay.

let found_count = main_node.find(&["div", "a"]).len();

assert_eq!(found_count, total_links);

```

    Serializing a document to Markdown

*This example requires `markdown` feature.*

```rust

use dom_query::Document;

let contents = "

p {color: blue;}

I really like using Markdown.


I think I'll use it to format all of my documents from now on.
";

let expected = "I really like using **Markdown**\\.\n\n\

I think I'll use it to format all of my documents from now on\\.";

let doc = Document::from(contents);

// Passing `None` into md allows to use default skip tags, which are:

// `["script", "style", "meta", "head"]`.

let got = doc.md(None);

assert_eq!(got.as_ref(), expected);

// If you need the full text content of the elements, pass `Some(&vec![])` to `md`.

// If you pass content like the example below to `Document::from`,

// `html5ever` will create a `` element and place your `` element inside it.

// To preserve the original order, use `Document::fragment`.

let contents = "<style>p {color: blue;}\

Content Heading


\

I really like using Markdown.\

I think I'll use it to format all of my documents from now on.
";

let expected = "p \\{color: blue;\\}\n\

I really like using Markdown\\.\n\n\

I think I'll use it to format all of my documents from now on\\.";

let doc = Document::fragment(contents);

let got = doc.md(Some(&["div"]));

assert_eq!(got.as_ref(), expected);

```

- **[more examples](./examples/)**

- **[dom_query by example](https://niklak.github.io/dom_query_by_example/)**

## Crate features

- `hashbrown` — optional,replaces standard hashmaps and hashsets with `hashbrown` hashmaps and hashsets.

- `atomic` — optional, switches `NodeData` from using `StrTendril` to `Tendril`.

This allows `NodeData` and all ascending structures, including `Document`, to implement the `Send` trait;

- `markdown` — optional, enables the `Document::md` and `NodeRef::md` methods, allowing serialization of a document or node to `Markdown` text.

- `mini_selector` — optional, provides a lightweight and faster alternative for element matching with limited CSS selector support.

  This includes additional `NodeRef` methods: `find_descendants`, `try_find_descendants`, `mini_is`, and `mini_match`.

  *This is an experimental feature that may change in future releases.*

## Possible issues

* [wasm32 compilation](https://niklak.github.io/dom_query_by_example/WASM32-compilation.html)

## Changelog

[Changelog](./CHANGELOG.md)

## License

Licensed under MIT ([LICENSE](LICENSE) or http://opensource.org/licenses/MIT)

## Contribution

Any contribution intentionally submitted for inclusion in the work by you, shall be

licensed with MIT license, without any additional terms or conditions.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/niklak/dom_query

Awesome Lists containing this project

README

Test Page

Test Page

Test Page

Test Page

Test Page

Test Page

Test Page

Content Heading