Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Chi-EEE/html-parser

C++ HTML parser that generates a simple DOM tree.
https://github.com/Chi-EEE/html-parser

boost cpp cpp-html-parser cpp17 css html html-parser html-parser-library html5 library parser scraping xmake

Last synced: about 2 months ago
JSON representation

C++ HTML parser that generates a simple DOM tree.

Awesome Lists containing this project

README

        

# HTML Parser
# Requirements
* [XMake](https://xmake.io)

# How to install
Using [XMake](https://xmake.io), run `xmake install` on the repository to install the library

(Use `xmake f --boost=n` to disable installing the [Boost](https://github.com/boostorg/boost) Library beforehand)

# API
Include `html-parser/HTMLDocument.h`.

## HTMLDocument
The interface to parse HTML string and get data from it.

### `HTMLDocument::HTMLDocument`
Construct a `HTMLDocument` object from a `std::istream` or string.

```cpp
using namespace html_parser;

// explicit HTMLDocument::HTMLDocument(std::istream &)
HTMLDocument document1(std::cin);

// explicit HTMLDocument::HTMLDocument(std::istream &&)
HTMLDocument document2(std::ifstream("index.html"));

// explicit HTMLDocument::HTMLDocument(const std::string &)
HTMLDocument document3("

a ≤ b
");
```

### `HTMLDocument::parse`
Parse HTML document from a new string, replacing the current if exists.

```cpp
using namespace html_parser;

HTMLDocument document(std::cin);

// void HTMLDocument::parse(const std::string &)
document.parse("

a ≤ b
");
```

### `HTMLDocument::inspect`
Print the colorized DOM tree of HTML document to the terminal.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
");

// void HTMLDocument::inspect()
document.inspect();
```

### `HTMLDocument::getTextContent`
Get all text in the document.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
qwq
");

// std::string HTMLDocument::getTextContent()
std::string textContent = document.getTextContent();
// textContent = "a ≤ bqwq"
```

### `HTMLDocument::getDirectTextContent`
Get all the direct text in the document.

```cpp
using namespace html_parser;

HTMLDocument document(" Don't want this text I want this text");

// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = document.getDirectTextContent();
// directTextContent = "I want this text"
```

### `HTMLDocument::getElementById`
Get the element whose `id` attribute equals to a string. Return a `HTMLDocument::Element` object if found, a null `HTMLDocument::Element` object if NOT found.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
");

// HTMLDocument::Element HTMLDocument::getElementById(const std::string &)
HTMLDocument::Element div = document.getElementById("my-div");
```

### `HTMLDocument::getElementsByName`
Get all elements whose `name` attribute equal to a string. Return a `std::vector` that contains all matching elements.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
qwq");

// std::vector HTMLDocument::getElementsByName(const std::string &)
std::vector elements = document.getElementsByName("my");
```

### `HTMLDocument::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
qwq
");

// std::vector HTMLDocument::getElementsByTagName(const std::string &)
std::vector elements = document.getElementsByTagName("div");
```

### `HTMLDocument::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
qwq
");

// std::vector HTMLDocument::getElementsByClassName(const std::string &)
std::vector elements = document.getElementsByClassName("my-class");
```

### `HTMLDocument::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.

```cpp
using namespace html_parser;

HTMLDocument document("FirstSecond");

// std::vector HTMLDocument::getChildren()
std::vector elements = document.getChildren();
```

## HTMLDocument::Element
The interface to get data from a HTML element or its subtree.

The default constructor constructs a empty element, on which you do any operation will result in a `std::invalid_argument` exception. Check it with `if (element)` first.

### `HTMLDocument::Element::inspect`
Print the colorized DOM tree of this element to the terminal.

```cpp
using namespace html_parser;

HTMLDocument document("

a ≤ b
");
HTMLDocument::Element element = document.getElementById("wrapper");

// void HTMLDocument::Element::inspect()
element.inspect();
```

### `HTMLDocument::Element::getTextContent`
Get all text in the element.

```cpp
using namespace html_parser;

HTMLDocument document("


a ≤ b

qwq

");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::string HTMLDocument::Element::getTextContent()
std::string textContent = element.getTextContent();
// textContent = "a ≤ b"
```

### `HTMLDocument::Element::getDirectTextContent`
Get all the direct text in the element.

```cpp
using namespace html_parser;

HTMLDocument document("


Don't want this text I want this text
");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::string HTMLDocument::getDirectTextContent()
std::string directTextContent = element.getDirectTextContent();
// directTextContent = "I want this text"
```

### `HTMLDocument::Element::getAttribute`
Get a attribute with specfied name of the element. Return a empty string if not found.

```cpp
using namespace html_parser;

HTMLDocument document("

");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::string HTMLDocument::Element::getAttribute(const std::string &)
std::string value = element.getTextContent("data-url");
// value = "/qwq"
```

### `HTMLDocument::Element::getElementsByTagName`
Get all elements whose tag name equals to a string. Return a `std::vector` that contains all matching elements.

```cpp
using namespace html_parser;

HTMLDocument document("


a ≤ b

qwq

");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::vector HTMLDocument::Element::getElementsByTagName(const std::string &)
std::vector elements = element.getElementsByTagName("div");
```

### `HTMLDocument::Element::getElementsByClassName`
Get all elements which have a certain class. Return a `std::vector` that contains all matching elements.

```cpp
using namespace html_parser;

HTMLDocument document("


a ≤ b

qwq

");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::vector HTMLDocument::Element::getElementsByClassName(const std::string &)
std::vector elements = element.getElementsByClassName("my-class");
```

### `HTMLDocument::Element::getChildren`
Get all child elements of the element upon which it was called. Return a `std::vector` that contains all the child elements.

```cpp
using namespace html_parser;

HTMLDocument document("


FirstSecond
");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::vector HTMLDocument::Element::getChildren()
std::vector elements = element.getChildren();
```