An open API service indexing awesome lists of open source software.

https://github.com/lifeislearningforever/wikipedia-crawler-hive

Production-ready Wikipedia crawler with PySpark and Apache Hive integration. Extracts article data and stores it in Hive with Parquet format and date partitioning.
https://github.com/lifeislearningforever/wikipedia-crawler-hive

apache-hive data-engineering data-pipeline parquet pyspark python web-scraping wikipedia

Last synced: 22 days ago
JSON representation

Production-ready Wikipedia crawler with PySpark and Apache Hive integration. Extracts article data and stores it in Hive with Parquet format and date partitioning.

Awesome Lists containing this project

README

          

# Amazon Crawler Safe Example

A production-ready web crawler built with PySpark, targeting Wikipedia articles for educational and demonstration purposes.

## Project Overview

This project implements a modular, scalable web crawler following SOLID design principles. It fetches, parses, transforms, and stores Wikipedia article metadata using PySpark 3.4.1 for distributed data processing.

### Why Wikipedia?

Wikipedia is chosen as a safe, legal target for web crawling demonstrations:
- Permissive robots.txt policies for reasonable crawling
- Public domain content with clear licensing
- Stable HTML structure ideal for parsing examples
- Educational use is explicitly supported

**Legal Note:** Always review and respect robots.txt policies. This crawler implements robots.txt parsing and rate limiting. Use responsibly and only for educational purposes.

## SOLID Design Principles Mapping

| Principle | Implementation |
|-----------|----------------|
| **Single Responsibility** | Each module has one clear purpose: `fetcher.py` (HTTP requests), `parser.py` (HTML parsing), `transformer.py` (data validation), `writer.py` (persistence), `orchestrator.py` (workflow coordination) |
| **Open/Closed** | Abstract `Fetcher` base class allows extension without modifying core logic. New parsers can be added by implementing the parser interface |
| **Liskov Substitution** | `RequestsFetcher` can replace abstract `Fetcher` without breaking functionality. Any SparkSession can be injected into `SparkWriter` |
| **Interface Segregation** | Small, focused interfaces - parsers return simple dicts, transformers accept/return typed dicts, writers accept standard lists |
| **Dependency Inversion** | High-level `orchestrator` depends on abstractions (fetcher interface) not concrete implementations. SparkSession is injected, not created internally |

## Architecture

```
┌─────────────┐
│ Orchestrator│ (CLI entry point, coordinates workflow)
└──────┬──────┘

├──> Fetcher (HTTP + robots.txt + rate limiting)

├──> Parser (BeautifulSoup HTML extraction)

├──> Transformer (Validation + schema mapping)

└──> Writer (PySpark DataFrame + Parquet/Hive)
```

## Prerequisites

- Python 3.9 or higher (Python 3.11+ recommended for full compatibility)
- Java 8 or 11 (required for PySpark)
- pip and virtualenv

## Quick Start

### 1. Create Virtual Environment and Install Dependencies

```bash
cd amazon_crawler_safe_example
make install
```

Or manually:

```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```

### 2. Run Tests

```bash
make test
```

Or manually:

```bash
source venv/bin/activate
pytest -v tests/
```

### 3. Run Locally (Dry Run Mode)

```bash
make run-local
```

Or manually:

```bash
source venv/bin/activate
python -m src.orchestrator --seed seed_urls.txt --dry_run
```

This will:
- Read URLs from `seed_urls.txt`
- Fetch and parse Wikipedia articles
- Write results to `output/wikipedia_articles_YYYYMMDD_HHMMSS.parquet`

### 4. Run with spark-submit (Production Mode)

For production deployment with Hive:

```bash
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--executor-memory 2G \
--executor-cores 2 \
src/orchestrator.py \
--seed seed_urls.txt \
--batch_size 100 \
--rate_limit 2.0
```

**Note:** Ensure Hive table exists before running in production mode (see Hive Setup below).

## Configuration Options

| Flag | Default | Description |
|------|---------|-------------|
| `--seed` | `seed_urls.txt` | Path to file containing seed URLs (one per line) |
| `--batch_size` | `50` | Number of records to batch before writing |
| `--rate_limit` | `1.0` | Seconds to wait between requests (respect robots.txt) |
| `--dry_run` | `True` | If True, writes to local parquet; if False, writes to Hive |

## Hive Setup

Before running in production mode (`--dry_run False`), create the Hive table:

```sql
CREATE DATABASE IF NOT EXISTS default;

CREATE EXTERNAL TABLE IF NOT EXISTS default.wikipedia_articles (
id STRING COMMENT 'Unique identifier (UUID)',
title STRING COMMENT 'Article title',
summary STRING COMMENT 'First paragraph summary',
last_edited TIMESTAMP COMMENT 'Last edit timestamp from Wikipedia',
source_url STRING COMMENT 'Original Wikipedia URL',
crawl_ts TIMESTAMP COMMENT 'Timestamp when crawled'
)
PARTITIONED BY (dt STRING COMMENT 'Partition date YYYY-MM-DD')
STORED AS PARQUET
LOCATION '/Users/prakashhosalli/Personal_Data/Code/PythonProjects/amazon_crawler_safe_example/warehouse'
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'created_by'='amazon_crawler_safe_example'
);

-- After data is written, recover partitions:
MSCK REPAIR TABLE default.wikipedia_articles;
```

## Output Schema

| Column | Type | Description |
|--------|------|-------------|
| id | STRING | UUID v4 unique identifier |
| title | STRING | Wikipedia article title |
| summary | STRING | First paragraph text |
| last_edited | TIMESTAMP | Last modified timestamp |
| source_url | STRING | Source Wikipedia URL |
| crawl_ts | TIMESTAMP | Crawl execution timestamp |
| dt | STRING | Partition key (YYYY-MM-DD) |

## Project Structure

```
amazon_crawler_safe_example/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore patterns
├── Makefile # Build automation
├── seed_urls.txt # Input URLs
├── src/
│ ├── __init__.py
│ ├── logger.py # Structured JSON logging
│ ├── fetcher.py # HTTP fetcher with robots.txt
│ ├── parser.py # Wikipedia HTML parser
│ ├── transformer.py # Data validation and transformation
│ ├── writer.py # PySpark writer (Parquet/Hive)
│ ├── orchestrator.py # Main CLI orchestrator
│ └── utils.py # Utility functions
└── tests/
├── __init__.py
├── test_parser.py # Parser unit tests
├── test_transformer.py # Transformer unit tests
├── test_fetcher.py # Fetcher unit tests
└── test_writer.py # Writer unit tests
```

## Development

### Running Individual Tests

```bash
pytest tests/test_parser.py -v
pytest tests/test_transformer.py::test_transform_success -v
```

### Adding New Seed URLs

Edit `seed_urls.txt` and add Wikipedia URLs (one per line):

```
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/Apache_Spark
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Machine_learning
```

### Viewing Logs

Logs are output in structured JSON format to stdout:

```json
{"ts": "2025-12-21T10:30:45.123456", "level": "INFO", "name": "orchestrator", "msg": "Starting crawl with 3 seed URLs"}
```

### Extending the Crawler

To add support for new websites:

1. Create a new parser class in `src/parser.py` implementing the same interface
2. Update `orchestrator.py` to use the appropriate parser based on URL domain
3. Follow the same structure: extract relevant fields, return dict with consistent keys

## Troubleshooting

### PySpark Issues

**Error:** `JAVA_HOME is not set`

```bash
export JAVA_HOME=$(/usr/libexec/java_home) # macOS
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk # Linux
```

### Robots.txt Blocked

If fetcher returns "Disallowed by robots.txt":
- Verify the URL is a Wikipedia article
- Check robots.txt manually: https://en.wikipedia.org/robots.txt
- Increase `--rate_limit` to be more conservative

### Hive Table Not Found

Ensure you've created the table (see Hive Setup) and that your Spark session has Hive support enabled.

## Performance Tuning

- **Batch size:** Increase `--batch_size` for better write performance (diminishing returns above 500)
- **Rate limiting:** Decrease `--rate_limit` only if allowed by robots.txt (minimum 0.5s recommended)
- **Spark resources:** Adjust executor memory/cores based on workload and cluster capacity

## License

This project is for educational purposes only. Wikipedia content is licensed under CC BY-SA 3.0.

## Contributing

This is a demonstration project. For production use, consider:
- Distributed fetching with Scrapy or similar
- Deduplication logic
- Incremental crawling with state tracking
- Monitoring and alerting
- Error recovery and checkpointing