Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/softcircuits/htmlmonkey
Lightweight HTML/XML parser written in C#.
https://github.com/softcircuits/htmlmonkey
csharp dotnet html html-parser parser
Last synced: 3 days ago
JSON representation
Lightweight HTML/XML parser written in C#.
- Host: GitHub
- URL: https://github.com/softcircuits/htmlmonkey
- Owner: SoftCircuits
- License: other
- Created: 2017-11-05T21:00:41.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-03-17T19:16:58.000Z (8 months ago)
- Last Synced: 2024-10-29T05:17:07.473Z (16 days ago)
- Topics: csharp, dotnet, html, html-parser, parser
- Language: C#
- Homepage:
- Size: 436 KB
- Stars: 55
- Watchers: 3
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: License.txt
Awesome Lists containing this project
README
# HtmlMonkey
[![NuGet version (SoftCircuits.HtmlMonkey)](https://img.shields.io/nuget/v/SoftCircuits.HtmlMonkey.svg?style=flat-square)](https://www.nuget.org/packages/SoftCircuits.HtmlMonkey/)
```
Install-Package SoftCircuits.HtmlMonkey
```## Overview
HtmlMonkey is a lightweight HTML/XML parser written in C#. It parses HTML or XML into a hierarchy of node objects, which can then be traversed. It also supports searching those nodes using jQuery-like selectors. The library can also be used to create and modify the nodes. And it can generate new HTML or XML from the current nodes.
## Getting Started
You can use either of the static methods `HtmlDocument.FromHtml()` or `HtmlDocument.FromFile()` to parse HTML and create an `HtmlDocument` object. (Note: If you're using WinForms, watch out for conflict with `System.Windows.Forms.HtmlDocument`.)
#### Parse an HTML Document
```cs
string html = "..."; // HTML markup
HtmlDocument document = HtmlDocument.FromHtml(html);
```This code parses the HTML document into a hierarchy of nodes and returns a new `HtmlDocument` object. The `HtmlDocument.RootNodes` property contains the top-level nodes that were parsed.
#### Types of Nodes
The parsed nodes can include several different types of nodes, as outlined in the table below. All node types derive from the abstract class `HtmlNode`.
| Node Type | Description |
| --------- | ----------- |
| `HtmlElementNode` | Represents an HTML element, or tag. This is the only node type that can contain child nodes. |
| `HtmlTextNode` | Represents raw text in the document. |
| `HtmlCDataNode` | Represents any block of data like a comment or CDATA section. The library creates a node for these blocks but does not parse their contents. |
| `HtmlHeaderNode` | Represents an HTML document header. |
| `XmlHeaderNode` | Represents an XML document header. |## Navigating Parsed Nodes
HtmlMonkey provides several ways to navigate parsed nodes. Each `HtmlElementNode` node includes a `Children` property, which can be used to access that node's children. In addition, all nodes have `NextNode`, `PrevNode`, and `ParentNode` properties, which you can use to navigate the nodes in every direction.
The `HtmlDocument` class also includes a `Find()` method, which accepts a predicate argument. This method will recursively find all the nodes in the document for which the predicate returns true, and return those nodes in a flat list.
```cs
// Returns all nodes that are the first node of its parent
IEnumerable nodes = document.Find(n => n.PrevNode == null);
```You can also use the `FindOfType()` method. This method traverses the entire document tree to find all the nodes of the specified type.
```cs
// Returns all text nodes
IEnumerable nodes = document.FindOfType();
```The `FindOfType()` method is also overloaded to accept an optional predicate argument.
```cs
// Returns all HtmlElementNodes that have children
IEnumerable nodes = document.FindOfType(n => n.Children.Any());
```## Using Selectors
The `HtmlDocument.Find()` method also has an overload that supports using jQuery-like selectors to find nodes. Selectors provide a powerful and flexible way to locate nodes.
#### Specifying Tag Names
You can specify a tag name to return all the nodes with that tag.
```cs
// Get alltags in the document
// Search is not case-sensitive
IEnumerable nodes = document.Find("p");// Get all HtmlElementNode nodes (tags) in the document
// Same result as not specifying the tag name
// Also the same result as document.FindOfType();
nodes = document.Find("*");
```#### Specifying Attributes
There are several ways to search for nodes with specific attributes. You can use the pound (#), period (.) or colon (:) to specify a value for the `id`, `class` or `type` attribute, respectively.
```cs
// Get any nodes with the attribute id="center-ad"
IEnumerable nodes = document.Find("#center-ad");// Get any
tags with the attribute class="align-right"
nodes = document.Find("div.align-right");// Returns all tags with the attribute type="button"
nodes = document.Find("input:button");
```For greater control over attributes, you can use square brackets ([]). This is similar to specifying attributes in jQuery, but there are some differences. The first difference is that all the variations for finding a match at the start, middle or end are not supported by HtmlMonkey. Instead, HtmlMonkey allows you to use the `:=` operator to specify that the value is a regular expression and the code will match if the attribute value matches that regular expression.
```cs
// Get anytags with the attribute id="center-ad"
IEnumerable nodes = document.Find("p[id=\"center-ad\"]");// Get any
tags that have both attributes id="center-ad" and class="align-right"
// Quotes within the square brackets are optional if the value contains no whitespace or most punctuation.
nodes = document.Find("p[id=center-ad][class=align-right]");// Returns all tags that have an href attribute
// The value of that attribute does not matter
nodes = document.Find("a[href]");// Get any
tags with the attribute data-id with a value that matches the regular
// expression "abc-\d+"
// Not case-sensitive
nodes = document.Find("p[data-id:=\"abc-\\d+\"]");Note that there is one key difference when using square brackets. When using a pound (#), period (.) or colon (:) to specify an attribute value, it is considered a match if it matches any value within that attribute. For example, the selector `div.right-align` would match the attribute `class="main-content right-align"`. When using square brackets, it must match the entire value (although there are exceptions to this when using regular expressions).
#### Multiple Selectors
There are several cases where you can specify multiple selectors.