https://github.com/taiizor/gocrawler

A high-performance web crawler with concurrent processing capabilities written in Go.
https://github.com/taiizor/gocrawler

crawler csv go golang golang-application golang-library json storage url web

Last synced: about 2 months ago
JSON representation

A high-performance web crawler with concurrent processing capabilities written in Go.

Host: GitHub
URL: https://github.com/taiizor/gocrawler
Owner: Taiizor
License: mit
Created: 2025-05-07T19:21:08.000Z (about 1 year ago)
Default Branch: develop
Last Pushed: 2025-05-07T20:04:42.000Z (about 1 year ago)
Last Synced: 2025-05-11T20:45:34.941Z (about 1 year ago)
Topics: crawler, csv, go, golang, golang-application, golang-library, json, storage, url, web
Language: Go
Homepage:
Size: 26.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Parallel Web Crawler

A high-performance web crawler with concurrent processing capabilities written in Go.

## Features

- URL filtering and normalization
- Link extraction from HTML pages
- Rate limiting and timeout support
- Logging and graceful error handling
- Results export to JSON or CSV formats
- Configurable crawl depth and concurrency
- Parallel crawling using a worker pool architecture
- Domain-specific crawling (stays within the same domain)

## Installation

### Prerequisites

- Go 1.20 or higher

### Steps

1. Clone this repository:
```bash
git clone https://github.com/Taiizor/goCrawler.git
cd goCrawler
```

2. Build the application:
```bash
go build -o goCrawler
```

## Usage

Run the crawler with the following command:

```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```

### Command Line Flags

| Flag | Description | Default |
|------|-------------|---------|
| `-depth` | Maximum crawling depth | 2 |
| `-timeout` | HTTP request timeout | 10s |
| `-rate` | Rate limit between requests | 100ms |
| `-workers` | Number of concurrent workers | 5 |
| `-url` | Starting URL for crawling | (required) |
| `-output` | Output file name (CSV or JSON) | results.json |

## Examples

Crawl a website with 10 workers to a depth of 3, saving output as JSON:
```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```

Crawl a website and save results as CSV:
```bash
./goCrawler -url "https://www.vegalya.com" -output results.csv
```

Crawl with custom timeout and rate limiting:
```bash
./goCrawler -url "https://www.vegalya.com" -timeout 5s -rate 200ms
```

## Output Format

### JSON Output

The JSON output contains:
- `results`: Array of crawled pages
- `count`: Number of pages crawled
- `timestamp`: When the crawl completed

Each page result includes:
- `title`: Page title
- `url`: The page URL
- `status_code`: HTTP status code
- `depth`: Crawl depth of this page
- `links`: Array of links found on the page
- `timestamp`: When this page was crawled
- `content_length`: Content length in bytes

### CSV Output

The CSV output contains one row per page with columns:
- URL
- Title
- Depth
- Timestamp
- StatusCode
- ContentLength
- LinksCount (number of links found)

## License

MIT

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/taiizor/gocrawler

Awesome Lists containing this project

README