Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shantanu-verma-salpro/hpscrapper

Web Scraper in C++ built on libuv,libcurl and lexbor
https://github.com/shantanu-verma-salpro/hpscrapper

cplusplus lexbor libcurl libuv webscraping

Last synced: 24 days ago
JSON representation

Web Scraper in C++ built on libuv,libcurl and lexbor

Awesome Lists containing this project

README

        



# HPScraper

HPScraper is a high-performance, event-driven utility designed to asynchronously fetch and process web content. Equipped to manage multiple requests concurrently, it stands out with its efficiency and robustness.

[![Build](https://github.com/shantanu-verma-salpro/HPScrapper/actions/workflows/build.yml/badge.svg)](https://github.com/shantanu-verma-salpro/HPScrapper/actions/workflows/build.yml)
![GitHub commit activity (branch)](https://img.shields.io/github/commit-activity/m/shantanu-verma-salpro/HPScrapper)
![GitHub](https://img.shields.io/github/license/shantanu-verma-salpro/HPScrapper)

## ๐ŸŒŸ Features

- **Flexible DOM Handling**: Offers a wrapper over `lexbor`, providing a JS-like API for HTML DOM.
- **Highly Asynchronous**: Designed for non-blocking operations.
- **Protocol Support**: Works with HTTP/1, HTTP/1.1, and HTTP/2.
- **Advanced Network Features**: Supports proxies, authentication, and post requests.
- **Optimized**: Benefits from a preallocated connection pool.
- **User-Friendly**: Clean and straightforward API.
- **Platform Compatibility**: Cross-platform support.
- **Highly Efficient**: Uses epoll, kqueue, Windows IOCP, Solaris event ports, and Linux io_uring via `libuv`.
- **Extensible**: Provides wrappers for `curl`, `libuv`, and `lexbor` for independent use.
- **Modern C++**: Harnesses modern C++ capabilities for optimized memory management and a more intuitive API.

## ๐Ÿš€ Remaining Tasks

Here's our prioritized roadmap:

1. โš™๏ธ **Error Management**: Better error insights and debugging.
2. ๐ŸŒ **Headless Chrome**: Access JS-rendered pages efficiently.
3. ๐Ÿ“š **Expand Documentation**: Cover all features and use-cases.
4. ๐Ÿงช **CI/CD Improvements**: Streamline updates.
5. ๐Ÿ **Performance Benchmarks**: Compare against competitors.

## ๐Ÿ“š Table of Contents

- [Prerequisites](#prerequisites)
- [Getting Started](#getting-started)
- [Advanced Options](#advanced-options)
- [Contributing](#contributing)
- [License](#license)

## ๐Ÿ›  Prerequisites

Before diving in, ensure you have installed the following libraries:

- `libcurl`
- `libuv`
- `liblexbor`

## ๐Ÿš€ Getting Started

### 1. **Compilation**:

` $ g++ your_source_file.cpp -o scraper -lcurl -luv -llexbor -Ofast `

### 2. **Initialization**:

Initialize the `Async` instance:

```cpp
constexpr int concurrent_connections = 200, max_host_connections = 10;
std::unique_ptr scraper = std::make_unique(concurrent_connections, max_host_connections);
```

### 3. **Configuration**:

Customize as needed:

```cpp
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
```

For additional settings:

```cpp
scraper->setVerbose(true);
scraper->setProxy("188.87.102.128", 3128);`
```

### 4. **Seed URL**:

Start your scraping journey:

```cpp
scraper->seed("https://www.google.com/");
```

### 5. **Event Management**:

Incorporate custom event handlers:

```cpp
scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance, Document& page) {
// Process the response...
});
```

### 6. **Execution**:

Get the scraper running:

```cpp
scraper->run();
```

## โš™๏ธ Advanced Options

HPScraper offers a myriad of options to fine-tune your scraping experience:

- `setMultiplexing(bool)`: Enable or disable HTTP/2 multiplexing.
- `setHttpVersion(HTTP)`: Opt for your preferred HTTP version.
- More options available in our detailed documentation.

## ๐Ÿ“Œ Example Usage

### Using Scraper

```cpp
int main(){

constexpr int concurrent_connections = 200 , max_host_connections = 10 ;
std::unique_ptr scraper = std::make_unique(concurrent_connections , max_host_connections);
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
//scraper->setVerbose(true);
//scraper->setProxy("188.87.102.128",3128);
scraper->seed("https://www.google.com/");

scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance , Document& page){
std::cout << "URL: " << response.url << '\n';
std::cout << "Received: " << response.bytesRecieved << " bytes\n";
std::cout << "Content Type: " << response.contentType << '\n';
std::cout << "Total Time: " << response.totalTime << '\n';
std::cout << "HTTP Version: " << response.httpVersion << '\n';
std::cout << "HTTP Method: " << response.httpMethod << '\n';
std::cout << "Download speed: " << response.bytesPerSecondR << " bytes/sec\n";
std::cout << "Header Size: " << response.headerSize << " bytes\n";

auto body = page.rootElement();
auto div = body->getElementsByTagName("div")->item(0);
auto links = div->getLinksMatching("");

for(auto i: *links.get()) std::cout<onIdle([](long pending , Async& instance ){

});

scraper->onException([](const std::exception& e , Async& instance) {
std::cerr << "Exception encountered: " << e.what() << std::endl;
});

scraper->onFailure([](const CurlEasyHandle::Response& response , Async& instance){

});

scraper->run();
}
```

### Using Parser

```cpp
int main() {
std::string htmlContent = R"(



Test Page



Text 1 inside col-md

Example Link
Text 2 inside col-md



Text 3 inside col-md




)";

Parser parser;
Document doc = parser.createDOM(htmlContent);

auto root = doc.rootElement();
auto colMdElements = root->getElementsByClassName("col-md");

for (std::size_t i = 0; i < colMdElements->length(); ++i) {
auto colMdNode = colMdElements->item(i);
auto divElements = colMdNode->getElementsByTagName("div");

for (std::size_t j = 0; j < divElements->length(); ++j) {
auto divNode = divElements->item(j);
std::cout << divNode->text() << '\n';

if(divNode->hasAttributes()) {
auto attributes = divNode->getAttributes();
for(const auto& [attr, value] : *attributes) {
std::cout << "Attribute: " << attr << ", Value: " << value << std::endl;
}
}

if(divNode->hasAttribute("data-custom")) {
std::cout << "Data-custom attribute: " << divNode->getAttribute("data-custom") << '\n';
}
}

if (colMdNode->hasChildElements()) {
auto firstChild = colMdNode->firstChild();
auto lastChild = colMdNode->lastChild();

std::cout << "First child's text content: " << firstChild->text() << '\n';
std::cout << "Last child's text content: " << lastChild->text() << '\n';
}

auto links = colMdNode->getLinksMatching("http://example.com");
for(const auto& link : *links) {
std::cout << "Matching Link: " << link << '\n';
}
}

return 0;
}
```

### Using Eventloop

```cpp
int main() {

EventLoop eventLoop;

TimerWrapper timer(eventLoop);
timer.on([](const TimerEvent&, TimerWrapper&) {
std::cout << "Timer triggered after 2 seconds!" << std::endl;
});
timer.start(2000, 0); // Timeout of 2 seconds, no repeat


IdleWrapper idle(eventLoop);
idle.on([](const IdleEvent&, IdleWrapper&) {
std::cout << "Idle handler running..." << std::endl;
});
idle.start();


TimerWrapper stopTimer(eventLoop);
stopTimer.on([&eventLoop](const TimerEvent&, TimerWrapper&) {
std::cout << "Stopping event loop after 5 seconds..." << std::endl;
eventLoop.stop();
});
stopTimer.start(5000, 0);

eventLoop.run();

return 0;
}

```

### Using Fetcher

```cpp
int main() {

int buffer_size = 1024;
long timeout = 5000;
CurlEasyHandle curlHandle(buffer_size, timeout);

curlHandle.setUrl("www.google.com");
curlHandle.fetch([](CurlEasyHandle::Response* response){
std::cout << response->message;
});
return 0;
}

```

`Check examples directory`

## ๐Ÿค Contributing

We appreciate contributions! If you're considering significant modifications, kindly initiate a discussion by opening an issue first.

## ๐Ÿ“„ License

HPScraper is licensed under the [MIT](https://choosealicense.com/licenses/mit/)

Acknowledgements
This software uses the following libraries:

`libcurl` : Licensed under the MIT License.

`libuv` : Licensed under the MIT License.

`liblexbor`: Licensed under the Apache License, Version 2.0.

When using `HPScraper`, please ensure you comply with the requirements and conditions of all included licenses.