An open API service indexing awesome lists of open source software.

https://github.com/pranav11024/smart-crawler

A high-performance, context-aware web crawler in Go with PostgreSQL backend. Features intelligent prioritization, duplicate detection, and benchmarking against traditional crawler
https://github.com/pranav11024/smart-crawler

concurrency context-aware data-mining go golang postgresql scraping web-crawler

Last synced: about 2 months ago
JSON representation

A high-performance, context-aware web crawler in Go with PostgreSQL backend. Features intelligent prioritization, duplicate detection, and benchmarking against traditional crawler

Awesome Lists containing this project

README

          

# Smart-Go-WebCrawler

A high-performance, context-aware web crawler built in Go with PostgreSQL backend. Features intelligent duplicate detection, content quality analysis, and priority-based crawling.

## 🚀 Features

### Smart Crawling
- **Context-Aware**: Analyzes page content and structure to make intelligent crawling decisions
- **Priority-Based**: Uses sophisticated algorithms to prioritize high-value content
- **Duplicate Detection**: Advanced hash-based duplicate detection to avoid redundant crawling
- **Content Quality Analysis**: Evaluates page quality using multiple metrics
- **Adaptive Rate Limiting**: Dynamic rate limiting based on page priority and server response

### Performance Optimizations
- **Concurrent Workers**: Configurable worker pool for parallel processing
- **Database Optimization**: Efficient PostgreSQL schema with proper indexing
- **Memory Management**: Optimized memory usage for large-scale crawling
- **Connection Pooling**: HTTP connection reuse for better performance

### Traditional vs Smart Comparison
The project includes comprehensive benchmarking between traditional breadth-first crawling and the smart approach, typically showing:
- **30-50% faster crawling** due to better prioritization
- **60-80% reduction in duplicate content** processing
- **40-60% improvement in content quality** of crawled pages
- **Reduced server load** through smarter request patterns

## 🛠️ Setup (Windows)

### Prerequisites
- Go 1.21 or later
- PostgreSQL 12 or later
-
### Installation

1. **Install PostgreSQL**
```bash
# Download and install PostgreSQL from https://www.postgresql.org/download/windows/
# Or use Chocolatey:
choco install postgresql
```

2. **Create Database**
```sql
-- Connect to PostgreSQL as superuser
psql -U postgres

-- Create database and user
CREATE DATABASE smart_crawler;
CREATE USER crawler_user WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE smart_crawler TO crawler_user;
```

3. **Build**
```bash

go mod init smart-crawler
go mod tidy

# Build the project
go build -o smart-crawler.exe main.go
```

4. **Environment Configuration**
Create a `.env` file in the project root:
```env
DATABASE_URL=postgres://crawler_user:your_password@localhost/smart_crawler?sslmode=disable
USER_AGENT=SmartCrawler/1.0
REQUEST_TIMEOUT=30
RATE_LIMIT=100
```

## 🏃‍♂️ Usage

### Basic Crawling

```bash
# Smart crawling (default)
./smart-crawler.exe -url="https://example.com" -depth=3 -workers=10

# Traditional crawling
./smart-crawler.exe -mode=traditional -url="https://example.com" -depth=3 -workers=10

# Performance benchmark
./smart-crawler.exe -mode=benchmark -url="https://example.com" -depth=2 -workers=5
```

### Command Line Options

- `-mode`: Crawler mode (`smart`, `traditional`, `benchmark`)
- `-url`: Starting URL to crawl
- `-depth`: Maximum crawl depth (default: 3)
- `-workers`: Number of concurrent workers (default: 10)

## 🏗️ Architecture

### Project Structure
```
smart-crawler/
├── main.go # Application entry point
├── config/
│ └── config.go # Configuration management
├── models/
│ └── models.go # Data models and structures
├── crawler/
│ ├── traditional.go # Traditional BFS crawler
│ └── smart.go # Smart context-aware crawler
├── database/
│ └── postgres.go # PostgreSQL operations
├── utils/
│ └── utils.go # Utility functions
├── benchmark/
│ └── benchmark.go # Performance benchmarking
└── README.md
```

### Database Schema

```sql
-- Pages table stores crawled content
pages (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT,
status_code INTEGER,
content_type TEXT,
size BIGINT,
load_time_ms BIGINT,
depth INTEGER,
parent_url TEXT,
crawled_at TIMESTAMP,
hash TEXT,
importance_score FLOAT,
content_quality FLOAT,
link_density FLOAT
);

-- Links table stores page relationships
links (
id SERIAL PRIMARY KEY,
source_id BIGINT REFERENCES pages(id),
target_id BIGINT REFERENCES pages(id),
url TEXT NOT NULL,
anchor TEXT,
rel TEXT
);

-- Crawl queue for smart crawler
crawl_queue (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
priority INTEGER,
depth INTEGER,
parent_url TEXT,
scheduled_at TIMESTAMP,
attempts INTEGER,
status TEXT
);
```

## 🧠 Smart Crawler Algorithm

### 1. Content Analysis
- **Quality Scoring**: Analyzes text length, structure, meta tags
- **Importance Calculation**: Considers semantic content, navigation depth
- **Link Density**: Evaluates ratio of links to content

### 2. Priority Calculation
```go
priority = base_priority +
anchor_text_bonus +
semantic_bonus +
page_importance_bonus -
navigation_penalty
```

### 3. Duplicate Detection
- **Content Hashing**: MD5 hash comparison for exact duplicates
- **Similarity Detection**: Future enhancement for near-duplicate detection

### 4. Adaptive Rate Limiting
- **Priority-Based**: Higher priority pages get faster processing
- **Server-Respectful**: Dynamic delays based on server response

## 🚀 Performance Optimizations

### Go-Specific Optimizations
- **Goroutine Pool**: Efficient worker management
- **Channel-Based Communication**: Non-blocking inter-goroutine communication
- **Connection Pooling**: HTTP client reuse
- **Memory Management**: Efficient string handling and buffer reuse

### Database Optimizations
- **Indexed Queries**: Strategic indexing on frequently queried columns
- **Batch Operations**: Bulk inserts for better performance
- **Connection Pooling**: Database connection reuse
- **Prepared Statements**: Query optimization

## 🔧 Configuration Options

### Environment Variables
```env
DATABASE_URL=postgres://user:pass@localhost/db?sslmode=disable
USER_AGENT=SmartCrawler/1.0
REQUEST_TIMEOUT=30
RATE_LIMIT=100
```

### Crawler Parameters
- **Workers**: 1-50 (optimal: 5-15 for most sites)
- **Depth**: 1-10 (optimal: 2-5 for comprehensive crawling)
- **Rate Limit**: 1-1000 requests/minute

## 🛡️ Best Practices

### Respectful Crawling
- **robots.txt**: Respects robots.txt directives
- **Rate Limiting**: Configurable delays between requests
- **User Agent**: Clear identification in headers
- **Error Handling**: Graceful handling of server errors

### Resource Management
- **Memory Usage**: Efficient memory management for large crawls
- **Connection Limits**: Respects server connection limits
- **Timeout Management**: Proper timeout handling
- **Graceful Shutdown**: Clean shutdown on interruption

## 🔍 Monitoring and Logging

### Built-in Metrics
- Pages processed per second
- Error rates and types
- Memory usage statistics
- Database performance metrics

### Logging Levels
- **Info**: General crawling progress
- **Warning**: Recoverable errors
- **Error**: Critical failures
- **Debug**: Detailed debugging information

## 🚧 Future Enhancements

### Planned Features
- **JavaScript Rendering**: Headless browser integration
- **Machine Learning**: AI-powered content classification
- **Distributed Crawling**: Multi-node crawling support
- **Real-time Analytics**: Live crawling dashboard
- **API Interface**: RESTful API for remote control

### Scalability Improvements
- **Horizontal Scaling**: Multi-instance coordination
- **Cloud Storage**: S3/GCS integration
- **Message Queues**: Redis/RabbitMQ integration
- **Microservices**: Service decomposition