https://github.com/lifeislearningforever/wikipedia-crawler-hive
Production-ready Wikipedia crawler with PySpark and Apache Hive integration. Extracts article data and stores it in Hive with Parquet format and date partitioning.
https://github.com/lifeislearningforever/wikipedia-crawler-hive
apache-hive data-engineering data-pipeline parquet pyspark python web-scraping wikipedia
Last synced: 22 days ago
JSON representation
Production-ready Wikipedia crawler with PySpark and Apache Hive integration. Extracts article data and stores it in Hive with Parquet format and date partitioning.
- Host: GitHub
- URL: https://github.com/lifeislearningforever/wikipedia-crawler-hive
- Owner: lifeislearningforever
- Created: 2025-12-22T12:12:53.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-22T12:14:22.000Z (6 months ago)
- Last Synced: 2025-12-23T23:21:23.472Z (6 months ago)
- Topics: apache-hive, data-engineering, data-pipeline, parquet, pyspark, python, web-scraping, wikipedia
- Language: Python
- Size: 60.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Amazon Crawler Safe Example
A production-ready web crawler built with PySpark, targeting Wikipedia articles for educational and demonstration purposes.
## Project Overview
This project implements a modular, scalable web crawler following SOLID design principles. It fetches, parses, transforms, and stores Wikipedia article metadata using PySpark 3.4.1 for distributed data processing.
### Why Wikipedia?
Wikipedia is chosen as a safe, legal target for web crawling demonstrations:
- Permissive robots.txt policies for reasonable crawling
- Public domain content with clear licensing
- Stable HTML structure ideal for parsing examples
- Educational use is explicitly supported
**Legal Note:** Always review and respect robots.txt policies. This crawler implements robots.txt parsing and rate limiting. Use responsibly and only for educational purposes.
## SOLID Design Principles Mapping
| Principle | Implementation |
|-----------|----------------|
| **Single Responsibility** | Each module has one clear purpose: `fetcher.py` (HTTP requests), `parser.py` (HTML parsing), `transformer.py` (data validation), `writer.py` (persistence), `orchestrator.py` (workflow coordination) |
| **Open/Closed** | Abstract `Fetcher` base class allows extension without modifying core logic. New parsers can be added by implementing the parser interface |
| **Liskov Substitution** | `RequestsFetcher` can replace abstract `Fetcher` without breaking functionality. Any SparkSession can be injected into `SparkWriter` |
| **Interface Segregation** | Small, focused interfaces - parsers return simple dicts, transformers accept/return typed dicts, writers accept standard lists |
| **Dependency Inversion** | High-level `orchestrator` depends on abstractions (fetcher interface) not concrete implementations. SparkSession is injected, not created internally |
## Architecture
```
┌─────────────┐
│ Orchestrator│ (CLI entry point, coordinates workflow)
└──────┬──────┘
│
├──> Fetcher (HTTP + robots.txt + rate limiting)
│
├──> Parser (BeautifulSoup HTML extraction)
│
├──> Transformer (Validation + schema mapping)
│
└──> Writer (PySpark DataFrame + Parquet/Hive)
```
## Prerequisites
- Python 3.9 or higher (Python 3.11+ recommended for full compatibility)
- Java 8 or 11 (required for PySpark)
- pip and virtualenv
## Quick Start
### 1. Create Virtual Environment and Install Dependencies
```bash
cd amazon_crawler_safe_example
make install
```
Or manually:
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```
### 2. Run Tests
```bash
make test
```
Or manually:
```bash
source venv/bin/activate
pytest -v tests/
```
### 3. Run Locally (Dry Run Mode)
```bash
make run-local
```
Or manually:
```bash
source venv/bin/activate
python -m src.orchestrator --seed seed_urls.txt --dry_run
```
This will:
- Read URLs from `seed_urls.txt`
- Fetch and parse Wikipedia articles
- Write results to `output/wikipedia_articles_YYYYMMDD_HHMMSS.parquet`
### 4. Run with spark-submit (Production Mode)
For production deployment with Hive:
```bash
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--executor-memory 2G \
--executor-cores 2 \
src/orchestrator.py \
--seed seed_urls.txt \
--batch_size 100 \
--rate_limit 2.0
```
**Note:** Ensure Hive table exists before running in production mode (see Hive Setup below).
## Configuration Options
| Flag | Default | Description |
|------|---------|-------------|
| `--seed` | `seed_urls.txt` | Path to file containing seed URLs (one per line) |
| `--batch_size` | `50` | Number of records to batch before writing |
| `--rate_limit` | `1.0` | Seconds to wait between requests (respect robots.txt) |
| `--dry_run` | `True` | If True, writes to local parquet; if False, writes to Hive |
## Hive Setup
Before running in production mode (`--dry_run False`), create the Hive table:
```sql
CREATE DATABASE IF NOT EXISTS default;
CREATE EXTERNAL TABLE IF NOT EXISTS default.wikipedia_articles (
id STRING COMMENT 'Unique identifier (UUID)',
title STRING COMMENT 'Article title',
summary STRING COMMENT 'First paragraph summary',
last_edited TIMESTAMP COMMENT 'Last edit timestamp from Wikipedia',
source_url STRING COMMENT 'Original Wikipedia URL',
crawl_ts TIMESTAMP COMMENT 'Timestamp when crawled'
)
PARTITIONED BY (dt STRING COMMENT 'Partition date YYYY-MM-DD')
STORED AS PARQUET
LOCATION '/Users/prakashhosalli/Personal_Data/Code/PythonProjects/amazon_crawler_safe_example/warehouse'
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'created_by'='amazon_crawler_safe_example'
);
-- After data is written, recover partitions:
MSCK REPAIR TABLE default.wikipedia_articles;
```
## Output Schema
| Column | Type | Description |
|--------|------|-------------|
| id | STRING | UUID v4 unique identifier |
| title | STRING | Wikipedia article title |
| summary | STRING | First paragraph text |
| last_edited | TIMESTAMP | Last modified timestamp |
| source_url | STRING | Source Wikipedia URL |
| crawl_ts | TIMESTAMP | Crawl execution timestamp |
| dt | STRING | Partition key (YYYY-MM-DD) |
## Project Structure
```
amazon_crawler_safe_example/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore patterns
├── Makefile # Build automation
├── seed_urls.txt # Input URLs
├── src/
│ ├── __init__.py
│ ├── logger.py # Structured JSON logging
│ ├── fetcher.py # HTTP fetcher with robots.txt
│ ├── parser.py # Wikipedia HTML parser
│ ├── transformer.py # Data validation and transformation
│ ├── writer.py # PySpark writer (Parquet/Hive)
│ ├── orchestrator.py # Main CLI orchestrator
│ └── utils.py # Utility functions
└── tests/
├── __init__.py
├── test_parser.py # Parser unit tests
├── test_transformer.py # Transformer unit tests
├── test_fetcher.py # Fetcher unit tests
└── test_writer.py # Writer unit tests
```
## Development
### Running Individual Tests
```bash
pytest tests/test_parser.py -v
pytest tests/test_transformer.py::test_transform_success -v
```
### Adding New Seed URLs
Edit `seed_urls.txt` and add Wikipedia URLs (one per line):
```
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/Apache_Spark
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Machine_learning
```
### Viewing Logs
Logs are output in structured JSON format to stdout:
```json
{"ts": "2025-12-21T10:30:45.123456", "level": "INFO", "name": "orchestrator", "msg": "Starting crawl with 3 seed URLs"}
```
### Extending the Crawler
To add support for new websites:
1. Create a new parser class in `src/parser.py` implementing the same interface
2. Update `orchestrator.py` to use the appropriate parser based on URL domain
3. Follow the same structure: extract relevant fields, return dict with consistent keys
## Troubleshooting
### PySpark Issues
**Error:** `JAVA_HOME is not set`
```bash
export JAVA_HOME=$(/usr/libexec/java_home) # macOS
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk # Linux
```
### Robots.txt Blocked
If fetcher returns "Disallowed by robots.txt":
- Verify the URL is a Wikipedia article
- Check robots.txt manually: https://en.wikipedia.org/robots.txt
- Increase `--rate_limit` to be more conservative
### Hive Table Not Found
Ensure you've created the table (see Hive Setup) and that your Spark session has Hive support enabled.
## Performance Tuning
- **Batch size:** Increase `--batch_size` for better write performance (diminishing returns above 500)
- **Rate limiting:** Decrease `--rate_limit` only if allowed by robots.txt (minimum 0.5s recommended)
- **Spark resources:** Adjust executor memory/cores based on workload and cluster capacity
## License
This project is for educational purposes only. Wikipedia content is licensed under CC BY-SA 3.0.
## Contributing
This is a demonstration project. For production use, consider:
- Distributed fetching with Scrapy or similar
- Deduplication logic
- Incremental crawling with state tracking
- Monitoring and alerting
- Error recovery and checkpointing