https://github.com/heinrichb/scrapey-cli
Scrapey CLI is a lightweight, modular command-line tool built in Go for web crawling and scraping. It allows users to collect and parse HTML data based on customizable configuration files or command-line flags, with plans to support multiple storage options such as JSON, XML, and various databases.
https://github.com/heinrichb/scrapey-cli
cicd cli-tool configurable data-extraction github-actions golang html-parsing lightweight mit-license modular-design web-crawler web-scraping
Last synced: 3 months ago
JSON representation
Scrapey CLI is a lightweight, modular command-line tool built in Go for web crawling and scraping. It allows users to collect and parse HTML data based on customizable configuration files or command-line flags, with plans to support multiple storage options such as JSON, XML, and various databases.
- Host: GitHub
- URL: https://github.com/heinrichb/scrapey-cli
- Owner: heinrichb
- License: mit
- Created: 2025-02-14T00:33:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-06T04:24:32.000Z (about 1 year ago)
- Last Synced: 2025-04-11T17:51:13.277Z (about 1 year ago)
- Topics: cicd, cli-tool, configurable, data-extraction, github-actions, golang, html-parsing, lightweight, mit-license, modular-design, web-crawler, web-scraping
- Language: Go
- Homepage:
- Size: 1.48 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# โจ Scrapey CLI
[](https://github.com/heinrichb/scrapey-cli/actions/workflows/ci.yml)
[](https://codecov.io/gh/heinrichb/scrapey-cli)
[](https://pkg.go.dev/github.com/heinrichb/scrapey-cli)
Scrapey CLI is a lightweight, configurable web crawler and scraper. It collects data from websites based on rules defined in a config file. It can handle HTML parsing, data extraction, and plans to offer multiple storage options (JSON, XML, Excel, databases, etc.).
---
## ๐ Features
- **Lightweight & Modular CLI:** Built with clean, DRY code principles.
- **Configurable Input:** Accepts configuration via a JSON file or command-line flags.
- **Extensible Parsing:** Customizable HTML parsing logic.
- **Planned Storage Options:** Future support for multiple output formats including JSON, XML, Excel, MongoDB, MySQL.
---
## ๐ฑ Getting Started
1. **Clone the Repo**
git clone https://github.com/heinrichb/scrapey-cli.git
2. **Initialize Go Modules & Build the CLI**
- **Option 1:** Using the Makefile (recommended)
make build
- This command runs `go mod tidy` and then builds the binary into the `build` folder.
- **Option 2:** Directly via Go
go mod tidy
go build -o build/scrapeycli ./cmd/scrapeycli
3. **Run the CLI**
- **Direct Execution:**
./build/scrapeycli --config configs/default.json
- **Using the Makefile:**
The Makefile provides a `run` target which allows you to pass in optional variables:
- **Default Run:**
make run
- This uses the default configuration file (`configs/default.json`).
- **Override Config:**
make run CONFIG=configs/other.json
- **Pass a URL:**
make run URL=https://example.org
- **Combined:**
make run CONFIG=configs/other.json URL=https://example.org
---
## โ๏ธ Project Structure
```
scrapey-cli/
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml
โโโ .vscode/
โ โโโ settings.json # VS Code settings (format on save for Go)
โโโ cmd/
โ โโโ scrapeycli/
โ โโโ main.go
โโโ configs/
โ โโโ default.json # Default/example configuration file
โโโ pkg/
โ โโโ config/
โ โ โโโ config.go # Config loading logic
โ โโโ crawler/
โ โ โโโ crawler.go # Core web crawling logic
โ โโโ parser/
โ โ โโโ parser.go # HTML parsing logic
โ โโโ storage/
โ โ โโโ storage.go # Storage logic
โ โโโ utils/
โ โโโ printcolor.go # Colorized terminal output utility
โ โโโ printstruct.go # Utility for printing non-empty struct fields
โโโ scripts/
โ โโโ coverage_formatter.go # Formats and colorizes Go test coverage output
โโโ test/ # Optional integration tests
โ โโโ fail_test.go # Test case designed to always fail, used to debug test output
โโโ .gitignore
โโโ LICENSE # MIT License file
โโโ Makefile # Build & run script for CLI (includes targets for build, run, and test)
โโโ go.mod
โโโ go.sum
โโโ README.md
```
---
## ๐ง Configuration Options
Scrapey CLI is configured using a JSON file that defines how websites are crawled and scraped. Below is a detailed breakdown of the available configuration options.
### ๐ URL Configuration
```json
"url": {
"base": "https://example.com",
"routes": [
"/route1",
"/route2",
"*"
],
"includeBase": false
}
```
- **base**: The primary domain to scrape.
- **routes**: List of specific paths to scrape. Supports `*` as a wildcard for full site crawling.
- **includeBase**: Whether to include the base URL in the scrape.
### ๐ Parsing Rules
```json
"parseRules": {
"title": "title",
"metaDescription": "meta[name='description']",
"articleContent": "article",
"author": ".author-name",
"datePublished": "meta[property='article:published_time']"
}
```
- **title**: Extracts the page title.
- **metaDescription**: Extracts the meta description.
- **articleContent**: Defines the main article section.
- **author**: Selector for extracting author names.
- **datePublished**: Extracts the publication date from meta properties.
### ๐พ Storage Options
```json
"storage": {
"outputFormats": ["json", "csv", "xml"],
"savePath": "output/",
"fileName": "scraped_data"
}
```
- **outputFormats**: List of formats in which data will be stored.
- **savePath**: Directory where scraped content is saved.
- **fileName**: Base name for output files.
### โก Scraping Behavior
```json
"scrapingOptions": {
"maxDepth": 2,
"rateLimit": 1.5,
"retryAttempts": 3,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
```
- **maxDepth**: Defines how deep the scraper should follow links.
- **rateLimit**: Time delay (in seconds) between requests to avoid rate-limiting.
- **retryAttempts**: Number of retries for failed requests.
- **userAgent**: Custom user-agent string to mimic a browser.
### ๐ Data Formatting
```json
"dataFormatting": {
"cleanWhitespace": true,
"removeHTML": true
}
```
- **cleanWhitespace**: Removes unnecessary whitespace in extracted content.
- **removeHTML**: Strips HTML tags from extracted content for cleaner output.
This configuration file allows fine-tuning of scraping behavior, data extraction, and storage formats for ultimate flexibility in web scraping.
---
## ๐ Usage
- **Basic Execution:**
./build/scrapeycli --url https://example.com
- **With a Config File:**
./build/scrapeycli --config configs/default.json
- **Using the Makefile:**
- Run with defaults:
make run
- Override configuration and/or URL:
make run CONFIG=configs/other.json URL=https://example.org
- **Future Enhancements:**
- Save scraped data to JSON.
- Support for scraping multiple URLs simultaneously.
- Concurrency and rate-limiting.
---
## ๐งช Tests
- **Run Unit Tests Locally:**
To run tests for all modules and the test folder (if it exists), use:
make test
This command first runs "go test ./..." to execute tests in all packages, and then, if the "test" folder exists and contains Go files, it will run tests in that folder as well.
- **Automated Tests on GitHub Actions:**
- Tests are triggered on every push and pull request to the "main" or "develop" branches.
- See Build & Test (https://github.com/heinrichb/scrapey-cli/actions) for logs and results.
---
## ๐ค Contributing
1. Fork the project.
2. Create your feature branch:
git checkout -b feature/amazing-feature
3. Commit your changes:
git commit -m 'Add some amazing feature'
4. Push to the branch:
git push origin feature/amazing-feature
5. Open a Pull Request.
---
## ๐ License
This project is licensed under the MIT License ([LICENSE](LICENSE)).
---
Happy Scraping!