https://github.com/niklak/dom_query
A Flexible Rust Crate for DOM Querying and Manipulation
https://github.com/niklak/dom_query
css-selectors html html5ever parser scraping selectors web-scraping
Last synced: about 2 months ago
JSON representation
A Flexible Rust Crate for DOM Querying and Manipulation
- Host: GitHub
- URL: https://github.com/niklak/dom_query
- Owner: niklak
- License: other
- Created: 2023-12-22T08:09:20.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2026-04-06T09:40:10.000Z (about 2 months ago)
- Last Synced: 2026-04-06T11:26:30.913Z (about 2 months ago)
- Topics: css-selectors, html, html5ever, parser, scraping, selectors, web-scraping
- Language: Rust
- Homepage: https://docs.rs/dom_query
- Size: 1.19 MB
- Stars: 84
- Watchers: 2
- Forks: 10
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# DOM_QUERY: A Flexible Rust Crate for DOM Querying and Manipulation
[](https://crates.io/crates/dom_query)
[](https://crates.io/crates/dom_query)
[](https://docs.rs/dom_query)
[](https://codecov.io/gh/niklak/dom_query)
[](https://github.com/niklak/dom_query/actions/workflows/rust.yml)
[](https://github.com/niklak/dom_query/actions/workflows/rust-arm64.yml)
[](https://github.com/niklak/dom_query/actions/workflows/wasm.yml)
DOM_QUERY is a flexible Rust crate that simplifies HTML parsing, DOM querying and manipulation by providing a high-level jQuery-like API. It uses the `html5ever` crate for HTML parsing and the `selectors` crate for efficient DOM traversal and element selection.
## Features
- Parse HTML documents and fragments
- Query DOM elements using CSS selectors
- Traverse the DOM tree (ancestors, parents, children, siblings)
- Manipulate elements and their attributes:
- Add/remove/modify attributes
- Change element content
- Add/remove elements
- Rename elements
- Move elements within the DOM tree
> [!NOTE]
> This crate is a significantly enhanced fork of [nipper](https://crates.io/crates/nipper),
> featuring expanded CSS selector support, enhanced DOM traversal and improved DOM manipulation capabilities.
## Examples
Parsing a document
```rust
use dom_query::Document;
use tendril::StrTendril;
// Document may consume &str, String, StrTendril
let contents_str = r#"
Test Page"#;
let doc = Document::from(contents_str);
let contents_string = contents_str.to_string();
let doc = Document::from(contents_string);
let contents_tendril = StrTendril::from(contents_str);
let doc = Document::from(contents_tendril);
// The root element for the `Document` is a Document
assert!(doc.root().is_document());
// if the source has DocType, then the Document will also have one
// as a first child.
assert!(doc.root().first_child().unwrap().is_doctype());
//both of them are not elements.
```
Parsing a fragment
```rust
use dom_query::Document;
use tendril::StrTendril;
// fragment can be created with Document::fragment(), which accepts &str, String, StrTendril
let contents_str = r#"
Test Page"#;
let fragment = Document::fragment(contents_str);
let contents_string = contents_str.to_string();
let fragment = Document::fragment(contents_string);
let contents_tendril = StrTendril::from(contents_str);
let fragment = Document::fragment(contents_tendril);
// The root element for the fragment is not a Document but a Fragment
assert!(!fragment.root().is_document());
assert!(fragment.root().is_fragment());
// and when it parses a fragment, it drops Doctype
assert!(!fragment.root().first_child().unwrap().is_doctype());
```
Selecting elements
```rust
use dom_query::Document;
let html = r#"
Test Page
Test Page
"#;
let document = Document::from(html);
// select a single element
let a = document.select("ul li:nth-child(2)");
let text = a.text().to_string();
assert!(text == "Two");
// selecting multiple elements
document.select("ul > li:has(a)").iter().for_each(|el| {
assert!(el.is("li"));
})
// there is also `try_select` which returns an Option
let no_sel = document.try_select("p");
assert!(no_sel.is_none());
```
Selecting a single match and multiple matches
```rust
use dom_query::Document;
let doc: Document = r#"
- 1
- 2
- 3
- 4
- 5
- 6
"#
.into();
// if you need to select only the first, single match, you can use following:
let single_selection = doc.select_single(".list");
// access is only for the first matching:
assert_eq!(single_selection.length(), 1);
assert_eq!(single_selection.inner_html().to_string().trim(), "
// simple selection contains all matches:
let selection = doc.select(".list");
assert_eq!(selection.length(), 2);
// but if you call inner_html() on it, you will get the inner_html of the first match:
assert_eq!(selection.inner_html().to_string().trim(), "
//this approach is using the first node from nodes vec and `select_single` consumes one iteration instead.
let first_selection = doc.select(".list").first();
assert_eq!(first_selection.length(), 1);
assert_eq!(first_selection.inner_html().to_string().trim(), "
// this approach is consuming all nodes into vec at first, and then you can call `iter().next()` to get the first one.
let next_selection = doc.select(".list").iter().next().unwrap();
assert_eq!(next_selection.length(), 1);
assert_eq!(next_selection.inner_html().to_string().trim(), "
// currently, to get data from all matches you need to iterate over them, either:
let all_matched: String = selection.iter().map(|s| s.inner_html().trim().to_string()).collect();
assert_eq!(
all_matched,
"
);
// or:
let all_matched: String = selection.nodes().iter().map(|s| s.inner_html().trim().to_string()).collect();
// which is more efficient.
assert_eq!(
all_matched,
"
);
```
Selecting descendent elements
```rust
use dom_query::Document;
let html = r#"
Test Page
Test Page
"#;
let document = Document::from(html);
// select a parent element
let ul = document.select("ul");
// selecting multiple elements
ul.select("li").iter().for_each(|el| {
assert!(el.is("li"));
});
// also descendant selector may be specified starting from the parent elements
let el = ul.select("body ul.list-b li").first();
let text = el.text();
assert_eq!("Four", text.to_string());
```
Selecting ancestors
```rust
use dom_query::Document;
let doc: Document = r#"
Test
"#.into();
// selecting an element
let child_sel = doc.select("#child");
assert!(child_sel.exists());
let child_node = child_sel.nodes().first().unwrap();
// getting all ancestors
let ancestors = child_node.ancestors(None);
let ancestor_sel = Selection::from(ancestors);
// or just: let ancestor_sel = child_sel.ancestors(None);
// in this case ancestors includes all ancestral nodes including html
// the root html element is presented in the ancestor selection
assert!(ancestor_sel.is("html"));
// also the direct parent of our starting node is presented
assert!(ancestor_sel.is("#parent"));
// `Selection::is` matches only the current selection without descending down the tree,
// so it won't match the #child node.
assert!(!ancestor_sel.is("#child"));
// if you don't require all ancestors, you can specify a number of ancestors you need -- `max_limit`
let ancestors = child_node.ancestors(Some(2));
let ancestor_sel = Selection::from(ancestors);
// in this case ancestors includes only two ancestral nodes: #grand-parent and #parent
assert!(ancestor_sel.is("#grand-parent #parent"));
assert!(!ancestor_sel.is("#great-ancestor"));
```
Selecting with precompiled matchers (for reuse)
```rust
use dom_query::{Document, Matcher};
let html1 = r#"Test Page 1"#;
let html2 = r#"Test Page 2"#;
let doc1 = Document::from(html1);
let doc2 = Document::from(html2);
// create a matcher once, reuse on different documents
let title_matcher = Matcher::new("title").unwrap();
let title_el1 = doc1.select_matcher(&title_matcher);
assert_eq!(&title_el1.text(), "Test Page 1");
let title_el2 = doc2.select_matcher(&title_matcher);
assert_eq!(&title_el2.text(), "Test Page 2");
// selecting a single match
let title_single = doc1.select_single_matcher(&title_matcher);
assert_eq!(&title_single.text(), "Test Page 1");
```
Selecting with pseudo-classes (:has, :has-text, :contains, :only-text)
```rust
use dom_query::Document;
let html = include_str!("../test-pages/rustwiki_2024.html");
let doc = Document::from(html);
// searching list items inside a `tr` element which has a `a` element
// with title="Programming paradigm"
let paradigm_selection =
doc.select(
r#"table tr:has(a[title="Programming paradigm"]) td.infobox-data ul > li"#
);
println!("Rust programming paradigms:");
for item in paradigm_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
//since `th` contains text "Paradigms" without sibling tags, we can use `:has-text` pseudo class
let influenced_by_selection =
doc.select(r#"table tr:has-text("Influenced by") + tr td ul > li > a"#);
println!("Rust influenced by:");
for item in influenced_by_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
// Extract all links from the block that contains certain text.
// Since `foreign function interface` located in its own tag,
// we have to use `:contains` pseudo class
let links_selection =
doc.select(
r#"p:contains("Rust has a foreign function interface") a[href^="/"]"#
);
println!("Links in the FFI block:");
for item in links_selection.iter() {
println!(" {}", item.attr("href").unwrap());
}
println!("{:-<50}", "");
// :only-text selects an element that contains only a single text node,
// with no child elements.
// It can be combined with other pseudo-classes to achieve more specific selections.
// For example, to select a
//that has no siblings and no child elements other than text.
println!("Single