https://github.com/zoroxide/spiderman
simple web scraping library for C++
https://github.com/zoroxide/spiderman
Last synced: 10 months ago
JSON representation
simple web scraping library for C++
- Host: GitHub
- URL: https://github.com/zoroxide/spiderman
- Owner: zoroxide
- Created: 2025-01-16T16:26:57.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-16T18:14:15.000Z (12 months ago)
- Last Synced: 2025-01-26T00:57:26.110Z (12 months ago)
- Language: C++
- Size: 6.84 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spiderman Web Scraper Framework
Spiderman is a lightweight C++ framework for web scraping. It allows you to fetch and parse HTML content from a given URL using `libcurl`. The parsed content is tokenized into `TAG` and `TEXT` tokens, providing a foundation for further processing.
## Features
- Fetch HTML content from a URL.
- Tokenize HTML into `TAG` and `TEXT` components.
- Simple and easy-to-use API.
## Requirements
- C++17 or later
- `libcurl`
## Installation
### *Windows support and test are comming soon :)*
1. Clone the repository:
```bash
git clone https://github.com/zoroxide/spiderman.git
cd spiderman
```
2. Install `libcurl` (if not already installed):
```bash
# For Debian/Ubuntu
sudo apt update && sudo apt install libcurl4-openssl-dev
```
3. Compile the code using a C++ compiler:
```bash
g++ -std=c++17 -lcurl -o spiderman main.cpp spiderman.cpp
```
## Usage
### Example Code
```cpp
#include
#include "spiderman.hpp"
int main() {
std::string url = "https://zoroxide.pages.dev";
spiderman scraper(url);
std::string content = scraper.fetch();
std::vector tokens = scraper.parse();
if (!content.empty()) {
for (const auto& token : tokens) {
std::cout << "Type: " << token.type << ", Value: " << token.value << std::endl;
}
} else {
std::cerr << "Failed to fetch content from " << url << std::endl;
}
return 0;
}
```
### Token Structure
The `Token` structure represents the components of the parsed HTML:
```cpp
struct Token {
std::string type; // "TAG" or "TEXT"
std::string value; // The HTML tag or text content
};
```
## File Structure
```
├── examples # Examples
├── spiderman.hpp # Header file
├── spiderman.cpp # Implementation file
```
## Compilation
To compile the example:
```bash
g++ examples/{example}.cpp spiderman.cpp -o example -lcurl
sudo ./example
```
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Contribution
Contributions are welcome! Please feel free to open issues or submit pull requests.
## Author
[Loay Mohamed](https://github.com/zoroxide)