{"id":19626592,"url":"https://github.com/softcircuits/htmlmonkey","last_synced_at":"2025-10-04T08:00:04.936Z","repository":{"id":46657310,"uuid":"109615936","full_name":"SoftCircuits/HtmlMonkey","owner":"SoftCircuits","description":"Lightweight HTML/XML parser written in C#.","archived":false,"fork":false,"pushed_at":"2025-07-16T22:37:01.000Z","size":523,"stargazers_count":58,"open_issues_count":0,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-03T01:11:36.012Z","etag":null,"topics":["csharp","dotnet","html","html-parser","parser"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SoftCircuits.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-05T21:00:41.000Z","updated_at":"2025-07-16T22:37:05.000Z","dependencies_parsed_at":"2024-02-11T20:13:05.292Z","dependency_job_id":"df13cdbe-7e8e-42b4-9b28-8dc9474801ed","html_url":"https://github.com/SoftCircuits/HtmlMonkey","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SoftCircuits/HtmlMonkey","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftCircuits%2FHtmlMonkey","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftCircuits%2FHtmlMonkey/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftCircuits%2FHtmlMonkey/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftCircuits%2FHtmlMonkey/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SoftCircuits","download_url":"https://codeload.github.com/SoftCircuits/HtmlMonkey/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftCircuits%2FHtmlMonkey/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278283511,"owners_count":25961311,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp","dotnet","html","html-parser","parser"],"created_at":"2024-11-11T11:47:06.111Z","updated_at":"2025-10-04T08:00:04.924Z","avatar_url":"https://github.com/SoftCircuits.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HtmlMonkey\n\n[![NuGet version (SoftCircuits.HtmlMonkey)](https://img.shields.io/nuget/v/SoftCircuits.HtmlMonkey.svg?style=flat-square)](https://www.nuget.org/packages/SoftCircuits.HtmlMonkey/)\n\n```\nInstall-Package SoftCircuits.HtmlMonkey\n```\n\n## Overview\n\nHtmlMonkey is a lightweight HTML/XML parser written in C#. It parses HTML or XML into a hierarchy of node objects, which can then be traversed. It also supports searching those nodes using jQuery-like selectors. The library can also be used to create and modify the nodes. And it can generate new HTML or XML from the current nodes.\n\n## Getting Started\n\nYou can use either of the static methods `HtmlDocument.FromHtml()` or `HtmlDocument.FromFile()` to parse HTML and create an `HtmlDocument` object. (Note: If you're using WinForms, watch out for conflict with `System.Windows.Forms.HtmlDocument`.)\n\n#### Parse an HTML Document\n\n```cs\nstring html = \"...\";   // HTML markup\nHtmlDocument document = HtmlDocument.FromHtml(html);\n```\n\nThis code parses the HTML document into a hierarchy of nodes and returns a new `HtmlDocument` object. The `HtmlDocument.RootNodes` property contains the top-level nodes that were parsed.\n\n#### Types of Nodes\n\nThe parsed nodes can include several different types of nodes, as outlined in the table below. All node types derive from the abstract class `HtmlNode`.\n\n| Node Type | Description |\n| --------- | ----------- |\n| `HtmlElementNode` | Represents an HTML element, or tag. This is the only node type that can contain child nodes. |\n| `HtmlTextNode` | Represents raw text in the document. |\n| `HtmlCDataNode` | Represents any block of data like a comment or CDATA section. The library creates a node for these blocks but does not parse their contents. |\n| `HtmlHeaderNode` | Represents an HTML document header. |\n| `XmlHeaderNode` | Represents an XML document header. |\n\n## Navigating Parsed Nodes\n\nHtmlMonkey provides several ways to navigate parsed nodes. Each `HtmlElementNode` node includes a `Children` property, which can be used to access that node's children. In addition, all nodes have `NextNode`, `PrevNode`, and `ParentNode` properties, which you can use to navigate the nodes in every direction.\n\nThe `HtmlDocument` class also includes a `Find()` method, which accepts a predicate argument. This method will recursively find all the nodes in the document for which the predicate returns true, and return those nodes in a flat list.\n\n```cs\n// Returns all nodes that are the first node of its parent\nIEnumerable\u003cHtmlNode\u003e nodes = document.Find(n =\u003e n.PrevNode == null);\n```\n\nYou can also use the `FindOfType()` method. This method traverses the entire document tree to find all the nodes of the specified type.\n\n```cs\n// Returns all text nodes\nIEnumerable\u003cHtmlTextNode\u003e nodes = document.FindOfType\u003cHtmlTextNode\u003e();\n```\n\nThe `FindOfType()` method is also overloaded to accept an optional predicate argument.\n\n```cs\n// Returns all HtmlElementNodes that have children\nIEnumerable\u003cHtmlElementNode\u003e nodes = document.FindOfType\u003cHtmlElementNode\u003e(n =\u003e n.Children.Any());\n```\n\n## Using Selectors\n\nThe `HtmlDocument.Find()` method also has an overload that supports using jQuery-like selectors to find nodes. Selectors provide a powerful and flexible way to locate nodes.\n\n#### Specifying Tag Names\n\nYou can specify a tag name to return all the nodes with that tag.\n\n```cs\n// Get all \u003cp\u003e tags in the document\n// Search is not case-sensitive\nIEnumerable\u003cHtmlElementNode\u003e nodes = document.Find(\"p\");\n\n// Get all HtmlElementNode nodes (tags) in the document\n// Same result as not specifying the tag name\n// Also the same result as document.FindOfType\u003cHtmlElementNode\u003e();\nnodes = document.Find(\"*\");\n```\n\n#### Specifying Attributes\n\nThere are several ways to search for nodes with specific attributes. You can use the pound (#), period (.) or colon (:) to specify a value for the `id`, `class` or `type` attribute, respectively.\n\n```cs\n// Get any nodes with the attribute id=\"center-ad\"\nIEnumerable\u003cHtmlElementNode\u003e nodes = document.Find(\"#center-ad\");\n\n// Get any \u003cdiv\u003e tags with the attribute class=\"align-right\"\nnodes = document.Find(\"div.align-right\");\n\n// Returns all \u003cinput\u003e tags with the attribute type=\"button\"\nnodes = document.Find(\"input:button\");\n```\n\nFor greater control over attributes, you can use square brackets ([]). This is similar to specifying attributes in jQuery, but there are some differences. The first difference is that all the variations for finding a match at the start, middle or end are not supported by HtmlMonkey. Instead, HtmlMonkey allows you to use the `:=` operator to specify that the value is a regular expression and the code will match if the attribute value matches that regular expression.\n\n```cs\n// Get any \u003cp\u003e tags with the attribute id=\"center-ad\"\nIEnumerable\u003cHtmlElementNode\u003e nodes = document.Find(\"p[id=\\\"center-ad\\\"]\");\n\n// Get any \u003cp\u003e tags that have both attributes id=\"center-ad\" and class=\"align-right\"\n// Quotes within the square brackets are optional if the value contains no whitespace or most punctuation.\nnodes = document.Find(\"p[id=center-ad][class=align-right]\");\n\n// Returns all \u003ca\u003e tags that have an href attribute\n// The value of that attribute does not matter\nnodes = document.Find(\"a[href]\");\n\n// Get any \u003cp\u003e tags with the attribute data-id with a value that matches the regular\n// expression \"abc-\\d+\"\n// Not case-sensitive\nnodes = document.Find(\"p[data-id:=\\\"abc-\\\\d+\\\"]\");\n\n// Finds all \u003ca\u003e links that link to blackbeltcoder.com\n// Uses a regular expression to allow optional http:// or https://, and www. prefix\n// This example is also not case-sensitive\nnodes = document.Find(\"a[href:=\\\"^(http:\\\\/\\\\/|https:\\\\/\\\\/)?(www\\\\.)?blackbeltcoder.com\\\"]\");\n```\n\nNote that there is one key difference when using square brackets. When using a pound (#), period (.) or colon (:) to specify an attribute value, it is considered a match if it matches any value within that attribute. For example, the selector `div.right-align` would match the attribute `class=\"main-content right-align\"`. When using square brackets, it must match the entire value (although there are exceptions to this when using regular expressions).\n\n#### Multiple Selectors\n\nThere are several cases where you can specify multiple selectors.\n\n```cs\n// Returns all \u003ca\u003e, \u003cdiv\u003e and \u003cp\u003e tags\nIEnumerable\u003cHtmlElementNode\u003e nodes = document.Find(\"a, div, p\");\n\n// Returns all \u003cspan\u003e tags that are descendants of a \u003cdiv\u003e tag\nnodes = document.Find(\"div span\");\n\n// Returns all \u003cspan\u003e tags that are a direct descendant of a \u003cdiv\u003e tag\nnodes = document.Find(\"div \u003e span\");\n```\n\n#### Selector Performance\n\nObviously, there is some overhead parsing selectors. If you want to use the same selectors more than once, you can optimize your code by parsing the selectors into data structures and then passing those data structures to the find methods. The following code is further optimized by first finding a set of container nodes, and then potentially performing multiple searches against those container nodes.\n\n```cs\n// Parse selectors into SelectorCollections\nSelectorCollection containerSelectors = Selector.ParseSelector(\"div.container\");\nSelectorCollection itemSelectors = Selector.ParseSelector(\"p.item\");\n\n// Search document for container nodes\nIEnumerable\u003cHtmlElementNode\u003e containerNodes = containerSelectors.Find(document.RootNodes);\n\n// Finally, search container nodes for item nodes\nIEnumerable\u003cHtmlElementNode\u003e itemNodes = itemSelectors.Find(containerNodes);\n```\n\n#### HTML Rules\n\nThere are a lot of rules that can apply to HTML and XML documents. These rules can determine how the markup is parsed. For example,\n`\u003ca\u003e` tags cannot be nested. And `\u003cli\u003e` tags must be a child of either an `\u003col\u003e` or `\u003cul\u003e` tag. If these rules are set, HTMLMonkey\nwill terminate the previous tag before starting the new tag when that new tag is not valid as a child of the previous tag.\n\nThese rules can be accessed and modified using the `TagRules` property of the `HtmlRules` class. `TagRules` tracks two kinds of rules:\nIt defines attributes of HTML tags, and it defines nesting rules for HTML tags. The attributes include whether the tag is a self-closing\ntag, whether it can have children, whether it can be nested, etc. The nesting rules define which tags can be nested within other tags.\nYou can specify all the tags that a particular HTML tag can be a child of.\n\nTags with no attributes set, default to `HtmlTagAttributes.None`. Tags with no nesting rules set, default to no restrictions on which\ntags those tags can be a child of. (Note that this is different from having an empty list of nesting rules, which means that the tag\ncannot be a child of any other tag.)\n\nBy default, attributes are set for common HTML tags, and no nesting rules are set. This means that, by default, all HTML tags can be\nnested within any other HTML tag. You can modify these rules to suit your needs. For example, the following code clears any existing\nnesting rules and then sets some common HTML nesting rules.\n\n```cs\nHtmlRules.TagRules.ClearNestingRules();\nHtmlRules.TagRules.SetNestingRule(\"html\", []);\nHtmlRules.TagRules.SetNestingRule(\"head\", [\"html\"]);\nHtmlRules.TagRules.SetNestingRule(\"body\", [\"html\"]);\nHtmlRules.TagRules.SetNestingRule(\"thead\", [\"table\"]);\nHtmlRules.TagRules.SetNestingRule(\"tbody\", [\"table\"]);\nHtmlRules.TagRules.SetNestingRule(\"tfoot\", [\"table\"]);\nHtmlRules.TagRules.SetNestingRule(\"tr\", [\"table\", \"thead\", \"tbody\"]);\nHtmlRules.TagRules.SetNestingRule(\"td\", [\"tr\"]);\nHtmlRules.TagRules.SetNestingRule(\"th\", [\"tr\"]);\nHtmlRules.TagRules.SetNestingRule(\"li\", [\"ol\", \"ul\"]);\nHtmlRules.TagRules.SetNestingRule(\"option\", [\"select\", \"optgroup\"]);\nHtmlRules.TagRules.SetNestingRule(\"optgroup\", [\"select\"]);\nHtmlRules.TagRules.SetNestingRule(\"dt\", [\"dl\"]);\nHtmlRules.TagRules.SetNestingRule(\"dd\", [\"dl\"]);\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftcircuits%2Fhtmlmonkey","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoftcircuits%2Fhtmlmonkey","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftcircuits%2Fhtmlmonkey/lists"}