An open API service indexing awesome lists of open source software.

https://github.com/heinrichb/scrapey-cli

Scrapey CLI is a lightweight, modular command-line tool built in Go for web crawling and scraping. It allows users to collect and parse HTML data based on customizable configuration files or command-line flags, with plans to support multiple storage options such as JSON, XML, and various databases.
https://github.com/heinrichb/scrapey-cli

cicd cli-tool configurable data-extraction github-actions golang html-parsing lightweight mit-license modular-design web-crawler web-scraping

Last synced: 3 months ago
JSON representation

Scrapey CLI is a lightweight, modular command-line tool built in Go for web crawling and scraping. It allows users to collect and parse HTML data based on customizable configuration files or command-line flags, with plans to support multiple storage options such as JSON, XML, and various databases.

Awesome Lists containing this project

README

          

# โœจ Scrapey CLI

[![Build & Test](https://github.com/heinrichb/scrapey-cli/actions/workflows/ci.yml/badge.svg?branch=develop)](https://github.com/heinrichb/scrapey-cli/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/heinrichb/scrapey-cli/branch/develop/graph/badge.svg?token=P45PASDIKF)](https://codecov.io/gh/heinrichb/scrapey-cli)
[![Go Reference](https://pkg.go.dev/badge/github.com/heinrichb/scrapey-cli.svg)](https://pkg.go.dev/github.com/heinrichb/scrapey-cli)

Scrapey CLI is a lightweight, configurable web crawler and scraper. It collects data from websites based on rules defined in a config file. It can handle HTML parsing, data extraction, and plans to offer multiple storage options (JSON, XML, Excel, databases, etc.).

---

## ๐Ÿš€ Features

- **Lightweight & Modular CLI:** Built with clean, DRY code principles.
- **Configurable Input:** Accepts configuration via a JSON file or command-line flags.
- **Extensible Parsing:** Customizable HTML parsing logic.
- **Planned Storage Options:** Future support for multiple output formats including JSON, XML, Excel, MongoDB, MySQL.

---

## ๐ŸŒฑ Getting Started

1. **Clone the Repo**

git clone https://github.com/heinrichb/scrapey-cli.git

2. **Initialize Go Modules & Build the CLI**

- **Option 1:** Using the Makefile (recommended)

make build

- This command runs `go mod tidy` and then builds the binary into the `build` folder.

- **Option 2:** Directly via Go

go mod tidy
go build -o build/scrapeycli ./cmd/scrapeycli

3. **Run the CLI**

- **Direct Execution:**

./build/scrapeycli --config configs/default.json

- **Using the Makefile:**

The Makefile provides a `run` target which allows you to pass in optional variables:

- **Default Run:**

make run

- This uses the default configuration file (`configs/default.json`).

- **Override Config:**

make run CONFIG=configs/other.json

- **Pass a URL:**

make run URL=https://example.org

- **Combined:**

make run CONFIG=configs/other.json URL=https://example.org

---

## โš™๏ธ Project Structure

```
scrapey-cli/
โ”œโ”€โ”€ .github/
โ”‚ โ””โ”€โ”€ workflows/
โ”‚ โ””โ”€โ”€ ci.yml
โ”œโ”€โ”€ .vscode/
โ”‚ โ””โ”€โ”€ settings.json # VS Code settings (format on save for Go)
โ”œโ”€โ”€ cmd/
โ”‚ โ””โ”€โ”€ scrapeycli/
โ”‚ โ””โ”€โ”€ main.go
โ”œโ”€โ”€ configs/
โ”‚ โ””โ”€โ”€ default.json # Default/example configuration file
โ”œโ”€โ”€ pkg/
โ”‚ โ”œโ”€โ”€ config/
โ”‚ โ”‚ โ””โ”€โ”€ config.go # Config loading logic
โ”‚ โ”œโ”€โ”€ crawler/
โ”‚ โ”‚ โ””โ”€โ”€ crawler.go # Core web crawling logic
โ”‚ โ”œโ”€โ”€ parser/
โ”‚ โ”‚ โ””โ”€โ”€ parser.go # HTML parsing logic
โ”‚ โ”œโ”€โ”€ storage/
โ”‚ โ”‚ โ””โ”€โ”€ storage.go # Storage logic
โ”‚ โ””โ”€โ”€ utils/
โ”‚ โ”œโ”€โ”€ printcolor.go # Colorized terminal output utility
โ”‚ โ””โ”€โ”€ printstruct.go # Utility for printing non-empty struct fields
โ”œโ”€โ”€ scripts/
โ”‚ โ””โ”€โ”€ coverage_formatter.go # Formats and colorizes Go test coverage output
โ”œโ”€โ”€ test/ # Optional integration tests
โ”‚ โ””โ”€โ”€ fail_test.go # Test case designed to always fail, used to debug test output
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE # MIT License file
โ”œโ”€โ”€ Makefile # Build & run script for CLI (includes targets for build, run, and test)
โ”œโ”€โ”€ go.mod
โ”œโ”€โ”€ go.sum
โ””โ”€โ”€ README.md
```

---

## ๐Ÿ”ง Configuration Options

Scrapey CLI is configured using a JSON file that defines how websites are crawled and scraped. Below is a detailed breakdown of the available configuration options.

### ๐ŸŒ URL Configuration

```json
"url": {
"base": "https://example.com",
"routes": [
"/route1",
"/route2",
"*"
],
"includeBase": false
}
```

- **base**: The primary domain to scrape.
- **routes**: List of specific paths to scrape. Supports `*` as a wildcard for full site crawling.
- **includeBase**: Whether to include the base URL in the scrape.

### ๐Ÿ” Parsing Rules

```json
"parseRules": {
"title": "title",
"metaDescription": "meta[name='description']",
"articleContent": "article",
"author": ".author-name",
"datePublished": "meta[property='article:published_time']"
}
```

- **title**: Extracts the page title.
- **metaDescription**: Extracts the meta description.
- **articleContent**: Defines the main article section.
- **author**: Selector for extracting author names.
- **datePublished**: Extracts the publication date from meta properties.

### ๐Ÿ’พ Storage Options

```json
"storage": {
"outputFormats": ["json", "csv", "xml"],
"savePath": "output/",
"fileName": "scraped_data"
}
```

- **outputFormats**: List of formats in which data will be stored.
- **savePath**: Directory where scraped content is saved.
- **fileName**: Base name for output files.

### โšก Scraping Behavior

```json
"scrapingOptions": {
"maxDepth": 2,
"rateLimit": 1.5,
"retryAttempts": 3,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
```

- **maxDepth**: Defines how deep the scraper should follow links.
- **rateLimit**: Time delay (in seconds) between requests to avoid rate-limiting.
- **retryAttempts**: Number of retries for failed requests.
- **userAgent**: Custom user-agent string to mimic a browser.

### ๐Ÿ›  Data Formatting

```json
"dataFormatting": {
"cleanWhitespace": true,
"removeHTML": true
}
```

- **cleanWhitespace**: Removes unnecessary whitespace in extracted content.
- **removeHTML**: Strips HTML tags from extracted content for cleaner output.

This configuration file allows fine-tuning of scraping behavior, data extraction, and storage formats for ultimate flexibility in web scraping.

---

## ๐Ÿ›  Usage

- **Basic Execution:**

./build/scrapeycli --url https://example.com

- **With a Config File:**

./build/scrapeycli --config configs/default.json

- **Using the Makefile:**

- Run with defaults:

make run

- Override configuration and/or URL:

make run CONFIG=configs/other.json URL=https://example.org

- **Future Enhancements:**
- Save scraped data to JSON.
- Support for scraping multiple URLs simultaneously.
- Concurrency and rate-limiting.

---

## ๐Ÿงช Tests

- **Run Unit Tests Locally:**
To run tests for all modules and the test folder (if it exists), use:

make test

This command first runs "go test ./..." to execute tests in all packages, and then, if the "test" folder exists and contains Go files, it will run tests in that folder as well.

- **Automated Tests on GitHub Actions:**
- Tests are triggered on every push and pull request to the "main" or "develop" branches.
- See Build & Test (https://github.com/heinrichb/scrapey-cli/actions) for logs and results.

---

## ๐Ÿค Contributing

1. Fork the project.
2. Create your feature branch:

git checkout -b feature/amazing-feature

3. Commit your changes:

git commit -m 'Add some amazing feature'

4. Push to the branch:

git push origin feature/amazing-feature

5. Open a Pull Request.

---

## ๐Ÿ“„ License

This project is licensed under the MIT License ([LICENSE](LICENSE)).

---

Happy Scraping!