https://github.com/taiizor/gocrawler
A high-performance web crawler with concurrent processing capabilities written in Go.
https://github.com/taiizor/gocrawler
crawler csv go golang golang-application golang-library json storage url web
Last synced: 11 months ago
JSON representation
A high-performance web crawler with concurrent processing capabilities written in Go.
- Host: GitHub
- URL: https://github.com/taiizor/gocrawler
- Owner: Taiizor
- License: mit
- Created: 2025-05-07T19:21:08.000Z (11 months ago)
- Default Branch: develop
- Last Pushed: 2025-05-07T20:04:42.000Z (11 months ago)
- Last Synced: 2025-05-11T20:45:34.941Z (11 months ago)
- Topics: crawler, csv, go, golang, golang-application, golang-library, json, storage, url, web
- Language: Go
- Homepage:
- Size: 26.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Parallel Web Crawler
A high-performance web crawler with concurrent processing capabilities written in Go.
## Features
- URL filtering and normalization
- Link extraction from HTML pages
- Rate limiting and timeout support
- Logging and graceful error handling
- Results export to JSON or CSV formats
- Configurable crawl depth and concurrency
- Parallel crawling using a worker pool architecture
- Domain-specific crawling (stays within the same domain)
## Installation
### Prerequisites
- Go 1.20 or higher
### Steps
1. Clone this repository:
```bash
git clone https://github.com/Taiizor/goCrawler.git
cd goCrawler
```
2. Build the application:
```bash
go build -o goCrawler
```
## Usage
Run the crawler with the following command:
```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```
### Command Line Flags
| Flag | Description | Default |
|------|-------------|---------|
| `-depth` | Maximum crawling depth | 2 |
| `-timeout` | HTTP request timeout | 10s |
| `-rate` | Rate limit between requests | 100ms |
| `-workers` | Number of concurrent workers | 5 |
| `-url` | Starting URL for crawling | (required) |
| `-output` | Output file name (CSV or JSON) | results.json |
## Examples
Crawl a website with 10 workers to a depth of 3, saving output as JSON:
```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```
Crawl a website and save results as CSV:
```bash
./goCrawler -url "https://www.vegalya.com" -output results.csv
```
Crawl with custom timeout and rate limiting:
```bash
./goCrawler -url "https://www.vegalya.com" -timeout 5s -rate 200ms
```
## Output Format
### JSON Output
The JSON output contains:
- `results`: Array of crawled pages
- `count`: Number of pages crawled
- `timestamp`: When the crawl completed
Each page result includes:
- `title`: Page title
- `url`: The page URL
- `status_code`: HTTP status code
- `depth`: Crawl depth of this page
- `links`: Array of links found on the page
- `timestamp`: When this page was crawled
- `content_length`: Content length in bytes
### CSV Output
The CSV output contains one row per page with columns:
- URL
- Title
- Depth
- Timestamp
- StatusCode
- ContentLength
- LinksCount (number of links found)
## License
MIT
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.