https://github.com/Chi-EEE/html-parser
C++ HTML parser that generates a simple DOM tree in C++17
https://github.com/Chi-EEE/html-parser
boost cpp cpp-html-parser cpp17 css html html-parser html-parser-library html5 library parser scraping xmake
Last synced: 8 months ago
JSON representation
C++ HTML parser that generates a simple DOM tree in C++17
- Host: GitHub
- URL: https://github.com/Chi-EEE/html-parser
- Owner: Chi-EEE
- License: unlicense
- Fork: true (Menci/html-parser)
- Created: 2023-02-18T09:45:55.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2023-12-19T10:29:57.000Z (almost 2 years ago)
- Last Synced: 2024-10-24T13:58:59.603Z (about 1 year ago)
- Topics: boost, cpp, cpp-html-parser, cpp17, css, html, html-parser, html-parser-library, html5, library, parser, scraping, xmake
- Language: C++
- Homepage:
- Size: 95.7 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HTML Parser
# Requirements
* [XMake](https://xmake.io)
# How to install
Using [XMake](https://xmake.io), run `xmake install` on the repository to install the library
(Use `xmake f --boost=n` to disable installing the [Boost](https://github.com/boostorg/boost) Library beforehand)
# API
Include `html-parser/HTMLDocument.h`.
## HTMLDocument
The interface to parse HTML string and get data from it.
### `HTMLDocument::HTMLDocument`
Construct a `HTMLDocument` object from a `std::istream` or string.
```cpp
using namespace html_parser;
// explicit HTMLDocument::HTMLDocument(std::istream &)
HTMLDocument document1(std::cin);
// explicit HTMLDocument::HTMLDocument(std::istream &&)
HTMLDocument document2(std::ifstream("index.html"));
// explicit HTMLDocument::HTMLDocument(const std::string &)
HTMLDocument document3("
a ≤ b");
```
### `HTMLDocument::parse`
Parse HTML document from a new string, replacing the current if exists.
```cpp
using namespace html_parser;
HTMLDocument document(std::cin);
// void HTMLDocument::parse(const std::string &)
document.parse("
a ≤ b");
```
### `HTMLDocument::inspect`
Print the colorized DOM tree of HTML document to the terminal.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b");
// void HTMLDocument::inspect()
document.inspect();
```
### `HTMLDocument::getTextContent`
Get all text in the document.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ bqwq");
// std::string HTMLDocument::getTextContent()
std::string textContent = document.getTextContent();
// textContent = "a ≤ bqwq"
```
### `HTMLDocument::getDirectTextContent`
Get all the direct text in the document.
```cpp
using namespace html_parser;
HTMLDocument document(" Don't want this text I want this text");
// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = document.getDirectTextContent();
// directTextContent = "I want this text"
```
### `HTMLDocument::getElementById`
Get the element whose `id` attribute equals to a string. Return a `HTMLDocument::Element` object if found, a null `HTMLDocument::Element` object if NOT found.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b");
// HTMLDocument::Element HTMLDocument::getElementById(const std::string &)
HTMLDocument::Element div = document.getElementById("my-div");
```
### `HTMLDocument::getElementsByName`
Get all elements whose `name` attribute equal to a string. Return a `std::vector` that contains all matching elements.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ bqwq");
// std::vector HTMLDocument::getElementsByName(const std::string &)
std::vector elements = document.getElementsByName("my");
```
### `HTMLDocument::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ bqwq");
// std::vector HTMLDocument::getElementsByTagName(const std::string &)
std::vector elements = document.getElementsByTagName("div");
```
### `HTMLDocument::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ bqwq");
// std::vector HTMLDocument::getElementsByClassName(const std::string &)
std::vector elements = document.getElementsByClassName("my-class");
```
### `HTMLDocument::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.
```cpp
using namespace html_parser;
HTMLDocument document("FirstSecond");
// std::vector HTMLDocument::getChildren()
std::vector elements = document.getChildren();
```
## HTMLDocument::Element
The interface to get data from a HTML element or its subtree.
The default constructor constructs a empty element, on which you do any operation will result in a `std::invalid_argument` exception. Check it with `if (element)` first.
### `HTMLDocument::Element::inspect`
Print the colorized DOM tree of this element to the terminal.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b");
HTMLDocument::Element element = document.getElementById("wrapper");
// void HTMLDocument::Element::inspect()
element.inspect();
```
### `HTMLDocument::Element::getTextContent`
Get all text in the element.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b
qwq
");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::string HTMLDocument::Element::getTextContent()
std::string textContent = element.getTextContent();
// textContent = "a ≤ b"
```
### `HTMLDocument::Element::getDirectTextContent`
Get all the direct text in the element.
```cpp
using namespace html_parser;
HTMLDocument document("
Don't want this text I want this text");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = element.getDirectTextContent();
// directTextContent = "I want this text"
```
### `HTMLDocument::Element::getAttribute`
Get a attribute with specfied name of the element. Return a empty string if not found.
```cpp
using namespace html_parser;
HTMLDocument document("
");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::string HTMLDocument::Element::getAttribute(const std::string &)
std::string value = element.getTextContent("data-url");
// value = "/qwq"
```
### `HTMLDocument::Element::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b
qwq
");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::vector HTMLDocument::Element::getElementsByTagName(const std::string &)
std::vector elements = element.getElementsByTagName("div");
```
### `HTMLDocument::Element::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.
```cpp
using namespace html_parser;
HTMLDocument document("
a ≤ b
qwq
");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::vector HTMLDocument::Element::getElementsByClassName(const std::string &)
std::vector elements = element.getElementsByClassName("my-class");
```
### `HTMLDocument::Element::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.
```cpp
using namespace html_parser;
HTMLDocument document("
FirstSecond
");
HTMLDocument::Element element = document.getElementById("wrapper");
// std::vector HTMLDocument::Element::getChildren()
std::vector elements = element.getChildren();
```