An open API service indexing awesome lists of open source software.

https://github.com/taiizor/gocrawler

A high-performance web crawler with concurrent processing capabilities written in Go.
https://github.com/taiizor/gocrawler

crawler csv go golang golang-application golang-library json storage url web

Last synced: 11 months ago
JSON representation

A high-performance web crawler with concurrent processing capabilities written in Go.

Awesome Lists containing this project

README

          

# Parallel Web Crawler

A high-performance web crawler with concurrent processing capabilities written in Go.

## Features

- URL filtering and normalization
- Link extraction from HTML pages
- Rate limiting and timeout support
- Logging and graceful error handling
- Results export to JSON or CSV formats
- Configurable crawl depth and concurrency
- Parallel crawling using a worker pool architecture
- Domain-specific crawling (stays within the same domain)

## Installation

### Prerequisites

- Go 1.20 or higher

### Steps

1. Clone this repository:
```bash
git clone https://github.com/Taiizor/goCrawler.git
cd goCrawler
```

2. Build the application:
```bash
go build -o goCrawler
```

## Usage

Run the crawler with the following command:

```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```

### Command Line Flags

| Flag | Description | Default |
|------|-------------|---------|
| `-depth` | Maximum crawling depth | 2 |
| `-timeout` | HTTP request timeout | 10s |
| `-rate` | Rate limit between requests | 100ms |
| `-workers` | Number of concurrent workers | 5 |
| `-url` | Starting URL for crawling | (required) |
| `-output` | Output file name (CSV or JSON) | results.json |

## Examples

Crawl a website with 10 workers to a depth of 3, saving output as JSON:
```bash
./goCrawler -url "https://www.vegalya.com" -depth 3 -workers 10 -output results.json
```

Crawl a website and save results as CSV:
```bash
./goCrawler -url "https://www.vegalya.com" -output results.csv
```

Crawl with custom timeout and rate limiting:
```bash
./goCrawler -url "https://www.vegalya.com" -timeout 5s -rate 200ms
```

## Output Format

### JSON Output

The JSON output contains:
- `results`: Array of crawled pages
- `count`: Number of pages crawled
- `timestamp`: When the crawl completed

Each page result includes:
- `title`: Page title
- `url`: The page URL
- `status_code`: HTTP status code
- `depth`: Crawl depth of this page
- `links`: Array of links found on the page
- `timestamp`: When this page was crawled
- `content_length`: Content length in bytes

### CSV Output

The CSV output contains one row per page with columns:
- URL
- Title
- Depth
- Timestamp
- StatusCode
- ContentLength
- LinksCount (number of links found)

## License

MIT

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.