Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Chi-EEE/html-parser
C++ HTML parser that generates a simple DOM tree in C++17
https://github.com/Chi-EEE/html-parser
boost cpp cpp-html-parser cpp17 css html html-parser html-parser-library html5 library parser scraping xmake
Last synced: 3 months ago
JSON representation
C++ HTML parser that generates a simple DOM tree in C++17
- Host: GitHub
- URL: https://github.com/Chi-EEE/html-parser
- Owner: Chi-EEE
- License: unlicense
- Fork: true (Menci/html-parser)
- Created: 2023-02-18T09:45:55.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-12-19T10:29:57.000Z (about 1 year ago)
- Last Synced: 2024-07-30T21:05:49.278Z (6 months ago)
- Topics: boost, cpp, cpp-html-parser, cpp17, css, html, html-parser, html-parser-library, html5, library, parser, scraping, xmake
- Language: C++
- Homepage:
- Size: 95.7 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HTML Parser
# Requirements
* [XMake](https://xmake.io)# How to install
Using [XMake](https://xmake.io), run `xmake install` on the repository to install the library(Use `xmake f --boost=n` to disable installing the [Boost](https://github.com/boostorg/boost) Library beforehand)
# API
Include `html-parser/HTMLDocument.h`.## HTMLDocument
The interface to parse HTML string and get data from it.### `HTMLDocument::HTMLDocument`
Construct a `HTMLDocument` object from a `std::istream` or string.```cpp
using namespace html_parser;// explicit HTMLDocument::HTMLDocument(std::istream &)
HTMLDocument document1(std::cin);// explicit HTMLDocument::HTMLDocument(std::istream &&)
HTMLDocument document2(std::ifstream("index.html"));// explicit HTMLDocument::HTMLDocument(const std::string &)
HTMLDocument document3("a ≤ b");
```### `HTMLDocument::parse`
Parse HTML document from a new string, replacing the current if exists.```cpp
using namespace html_parser;HTMLDocument document(std::cin);
// void HTMLDocument::parse(const std::string &)
document.parse("a ≤ b");
```### `HTMLDocument::inspect`
Print the colorized DOM tree of HTML document to the terminal.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ b");// void HTMLDocument::inspect()
document.inspect();
```### `HTMLDocument::getTextContent`
Get all text in the document.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ bqwq");// std::string HTMLDocument::getTextContent()
std::string textContent = document.getTextContent();
// textContent = "a ≤ bqwq"
```### `HTMLDocument::getDirectTextContent`
Get all the direct text in the document.```cpp
using namespace html_parser;HTMLDocument document(" Don't want this text I want this text");
// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = document.getDirectTextContent();
// directTextContent = "I want this text"
```### `HTMLDocument::getElementById`
Get the element whose `id` attribute equals to a string. Return a `HTMLDocument::Element` object if found, a null `HTMLDocument::Element` object if NOT found.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ b");// HTMLDocument::Element HTMLDocument::getElementById(const std::string &)
HTMLDocument::Element div = document.getElementById("my-div");
```### `HTMLDocument::getElementsByName`
Get all elements whose `name` attribute equal to a string. Return a `std::vector` that contains all matching elements.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ bqwq");// std::vector HTMLDocument::getElementsByName(const std::string &)
std::vector elements = document.getElementsByName("my");
```### `HTMLDocument::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ bqwq");// std::vector HTMLDocument::getElementsByTagName(const std::string &)
std::vector elements = document.getElementsByTagName("div");
```### `HTMLDocument::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.```cpp
using namespace html_parser;HTMLDocument document("
a ≤ bqwq");// std::vector HTMLDocument::getElementsByClassName(const std::string &)
std::vector elements = document.getElementsByClassName("my-class");
```### `HTMLDocument::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.```cpp
using namespace html_parser;HTMLDocument document("FirstSecond");
// std::vector HTMLDocument::getChildren()
std::vector elements = document.getChildren();
```## HTMLDocument::Element
The interface to get data from a HTML element or its subtree.The default constructor constructs a empty element, on which you do any operation will result in a `std::invalid_argument` exception. Check it with `if (element)` first.
### `HTMLDocument::Element::inspect`
Print the colorized DOM tree of this element to the terminal.```cpp
using namespace html_parser;HTMLDocument document("
");a ≤ b
HTMLDocument::Element element = document.getElementById("wrapper");// void HTMLDocument::Element::inspect()
element.inspect();
```### `HTMLDocument::Element::getTextContent`
Get all text in the element.```cpp
using namespace html_parser;HTMLDocument document("
");a ≤ bqwq
HTMLDocument::Element element = document.getElementById("wrapper");// std::string HTMLDocument::Element::getTextContent()
std::string textContent = element.getTextContent();
// textContent = "a ≤ b"
```### `HTMLDocument::Element::getDirectTextContent`
Get all the direct text in the element.```cpp
using namespace html_parser;HTMLDocument document("
");
Don't want this text I want this text
HTMLDocument::Element element = document.getElementById("wrapper");// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = element.getDirectTextContent();
// directTextContent = "I want this text"
```### `HTMLDocument::Element::getAttribute`
Get a attribute with specfied name of the element. Return a empty string if not found.```cpp
using namespace html_parser;HTMLDocument document("
");
HTMLDocument::Element element = document.getElementById("wrapper");// std::string HTMLDocument::Element::getAttribute(const std::string &)
std::string value = element.getTextContent("data-url");
// value = "/qwq"
```### `HTMLDocument::Element::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.```cpp
using namespace html_parser;HTMLDocument document("
");a ≤ bqwq
HTMLDocument::Element element = document.getElementById("wrapper");// std::vector HTMLDocument::Element::getElementsByTagName(const std::string &)
std::vector elements = element.getElementsByTagName("div");
```### `HTMLDocument::Element::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.```cpp
using namespace html_parser;HTMLDocument document("
");a ≤ bqwq
HTMLDocument::Element element = document.getElementById("wrapper");// std::vector HTMLDocument::Element::getElementsByClassName(const std::string &)
std::vector elements = element.getElementsByClassName("my-class");
```### `HTMLDocument::Element::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.```cpp
using namespace html_parser;HTMLDocument document("
");
FirstSecond
HTMLDocument::Element element = document.getElementById("wrapper");// std::vector HTMLDocument::Element::getChildren()
std::vector elements = element.getChildren();
```