Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shantanu-verma-salpro/hpscrapper
Web Scraper in C++ built on libuv,libcurl and lexbor
https://github.com/shantanu-verma-salpro/hpscrapper
cplusplus lexbor libcurl libuv webscraping
Last synced: 24 days ago
JSON representation
Web Scraper in C++ built on libuv,libcurl and lexbor
- Host: GitHub
- URL: https://github.com/shantanu-verma-salpro/hpscrapper
- Owner: shantanu-verma-salpro
- License: mit
- Created: 2023-10-05T16:18:30.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-07T10:56:23.000Z (about 1 year ago)
- Last Synced: 2024-10-21T02:55:24.763Z (2 months ago)
- Topics: cplusplus, lexbor, libcurl, libuv, webscraping
- Language: C++
- Homepage:
- Size: 3.23 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HPScraper
HPScraper is a high-performance, event-driven utility designed to asynchronously fetch and process web content. Equipped to manage multiple requests concurrently, it stands out with its efficiency and robustness.
[![Build](https://github.com/shantanu-verma-salpro/HPScrapper/actions/workflows/build.yml/badge.svg)](https://github.com/shantanu-verma-salpro/HPScrapper/actions/workflows/build.yml)
![GitHub commit activity (branch)](https://img.shields.io/github/commit-activity/m/shantanu-verma-salpro/HPScrapper)
![GitHub](https://img.shields.io/github/license/shantanu-verma-salpro/HPScrapper)## ๐ Features
- **Flexible DOM Handling**: Offers a wrapper over `lexbor`, providing a JS-like API for HTML DOM.
- **Highly Asynchronous**: Designed for non-blocking operations.
- **Protocol Support**: Works with HTTP/1, HTTP/1.1, and HTTP/2.
- **Advanced Network Features**: Supports proxies, authentication, and post requests.
- **Optimized**: Benefits from a preallocated connection pool.
- **User-Friendly**: Clean and straightforward API.
- **Platform Compatibility**: Cross-platform support.
- **Highly Efficient**: Uses epoll, kqueue, Windows IOCP, Solaris event ports, and Linux io_uring via `libuv`.
- **Extensible**: Provides wrappers for `curl`, `libuv`, and `lexbor` for independent use.
- **Modern C++**: Harnesses modern C++ capabilities for optimized memory management and a more intuitive API.## ๐ Remaining Tasks
Here's our prioritized roadmap:
1. โ๏ธ **Error Management**: Better error insights and debugging.
2. ๐ **Headless Chrome**: Access JS-rendered pages efficiently.
3. ๐ **Expand Documentation**: Cover all features and use-cases.
4. ๐งช **CI/CD Improvements**: Streamline updates.
5. ๐ **Performance Benchmarks**: Compare against competitors.## ๐ Table of Contents
- [Prerequisites](#prerequisites)
- [Getting Started](#getting-started)
- [Advanced Options](#advanced-options)
- [Contributing](#contributing)
- [License](#license)## ๐ Prerequisites
Before diving in, ensure you have installed the following libraries:
- `libcurl`
- `libuv`
- `liblexbor`## ๐ Getting Started
### 1. **Compilation**:
` $ g++ your_source_file.cpp -o scraper -lcurl -luv -llexbor -Ofast `
### 2. **Initialization**:
Initialize the `Async` instance:
```cpp
constexpr int concurrent_connections = 200, max_host_connections = 10;
std::unique_ptr scraper = std::make_unique(concurrent_connections, max_host_connections);
```### 3. **Configuration**:
Customize as needed:
```cpp
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
```For additional settings:
```cpp
scraper->setVerbose(true);
scraper->setProxy("188.87.102.128", 3128);`
```### 4. **Seed URL**:
Start your scraping journey:
```cpp
scraper->seed("https://www.google.com/");
```### 5. **Event Management**:
Incorporate custom event handlers:
```cpp
scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance, Document& page) {
// Process the response...
});
```### 6. **Execution**:
Get the scraper running:
```cpp
scraper->run();
```## โ๏ธ Advanced Options
HPScraper offers a myriad of options to fine-tune your scraping experience:
- `setMultiplexing(bool)`: Enable or disable HTTP/2 multiplexing.
- `setHttpVersion(HTTP)`: Opt for your preferred HTTP version.
- More options available in our detailed documentation.## ๐ Example Usage
### Using Scraper
```cpp
int main(){constexpr int concurrent_connections = 200 , max_host_connections = 10 ;
std::unique_ptr scraper = std::make_unique(concurrent_connections , max_host_connections);
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
//scraper->setVerbose(true);
//scraper->setProxy("188.87.102.128",3128);
scraper->seed("https://www.google.com/");scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance , Document& page){
std::cout << "URL: " << response.url << '\n';
std::cout << "Received: " << response.bytesRecieved << " bytes\n";
std::cout << "Content Type: " << response.contentType << '\n';
std::cout << "Total Time: " << response.totalTime << '\n';
std::cout << "HTTP Version: " << response.httpVersion << '\n';
std::cout << "HTTP Method: " << response.httpMethod << '\n';
std::cout << "Download speed: " << response.bytesPerSecondR << " bytes/sec\n";
std::cout << "Header Size: " << response.headerSize << " bytes\n";auto body = page.rootElement();
auto div = body->getElementsByTagName("div")->item(0);
auto links = div->getLinksMatching("");for(auto i: *links.get()) std::cout<onIdle([](long pending , Async& instance ){
});
scraper->onException([](const std::exception& e , Async& instance) {
std::cerr << "Exception encountered: " << e.what() << std::endl;
});scraper->onFailure([](const CurlEasyHandle::Response& response , Async& instance){
});
scraper->run();
}
```### Using Parser
```cpp
int main() {
std::string htmlContent = R"(
Test Page
Text 3 inside col-md
)";Parser parser;
Document doc = parser.createDOM(htmlContent);auto root = doc.rootElement();
auto colMdElements = root->getElementsByClassName("col-md");for (std::size_t i = 0; i < colMdElements->length(); ++i) {
auto colMdNode = colMdElements->item(i);
auto divElements = colMdNode->getElementsByTagName("div");for (std::size_t j = 0; j < divElements->length(); ++j) {
auto divNode = divElements->item(j);
std::cout << divNode->text() << '\n';if(divNode->hasAttributes()) {
auto attributes = divNode->getAttributes();
for(const auto& [attr, value] : *attributes) {
std::cout << "Attribute: " << attr << ", Value: " << value << std::endl;
}
}if(divNode->hasAttribute("data-custom")) {
std::cout << "Data-custom attribute: " << divNode->getAttribute("data-custom") << '\n';
}
}if (colMdNode->hasChildElements()) {
auto firstChild = colMdNode->firstChild();
auto lastChild = colMdNode->lastChild();std::cout << "First child's text content: " << firstChild->text() << '\n';
std::cout << "Last child's text content: " << lastChild->text() << '\n';
}auto links = colMdNode->getLinksMatching("http://example.com");
for(const auto& link : *links) {
std::cout << "Matching Link: " << link << '\n';
}
}return 0;
}
```### Using Eventloop
```cpp
int main() {
EventLoop eventLoop;TimerWrapper timer(eventLoop);
timer.on([](const TimerEvent&, TimerWrapper&) {
std::cout << "Timer triggered after 2 seconds!" << std::endl;
});
timer.start(2000, 0); // Timeout of 2 seconds, no repeat
IdleWrapper idle(eventLoop);
idle.on([](const IdleEvent&, IdleWrapper&) {
std::cout << "Idle handler running..." << std::endl;
});
idle.start();
TimerWrapper stopTimer(eventLoop);
stopTimer.on([&eventLoop](const TimerEvent&, TimerWrapper&) {
std::cout << "Stopping event loop after 5 seconds..." << std::endl;
eventLoop.stop();
});
stopTimer.start(5000, 0);eventLoop.run();
return 0;
}```
### Using Fetcher
```cpp
int main() {int buffer_size = 1024;
long timeout = 5000;
CurlEasyHandle curlHandle(buffer_size, timeout);curlHandle.setUrl("www.google.com");
curlHandle.fetch([](CurlEasyHandle::Response* response){
std::cout << response->message;
});
return 0;
}```
`Check examples directory`
## ๐ค Contributing
We appreciate contributions! If you're considering significant modifications, kindly initiate a discussion by opening an issue first.
## ๐ License
HPScraper is licensed under the [MIT](https://choosealicense.com/licenses/mit/)
Acknowledgements
This software uses the following libraries:`libcurl` : Licensed under the MIT License.
`libuv` : Licensed under the MIT License.
`liblexbor`: Licensed under the Apache License, Version 2.0.
When using `HPScraper`, please ensure you comply with the requirements and conditions of all included licenses.