{"id":34501330,"url":"https://github.com/lifeislearningforever/wikipedia-crawler-hive","last_synced_at":"2026-06-06T14:31:49.923Z","repository":{"id":329942812,"uuid":"1121084639","full_name":"lifeislearningforever/wikipedia-crawler-hive","owner":"lifeislearningforever","description":"Production-ready Wikipedia crawler with PySpark and Apache Hive integration. Extracts article data and stores it in Hive with Parquet format and date partitioning.","archived":false,"fork":false,"pushed_at":"2025-12-22T12:14:22.000Z","size":62,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-23T23:21:23.472Z","etag":null,"topics":["apache-hive","data-engineering","data-pipeline","parquet","pyspark","python","web-scraping","wikipedia"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lifeislearningforever.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-22T12:12:53.000Z","updated_at":"2025-12-22T12:14:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lifeislearningforever/wikipedia-crawler-hive","commit_stats":null,"previous_names":["lifeislearningforever/wikipedia-crawler-hive"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lifeislearningforever/wikipedia-crawler-hive","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeislearningforever%2Fwikipedia-crawler-hive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeislearningforever%2Fwikipedia-crawler-hive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeislearningforever%2Fwikipedia-crawler-hive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeislearningforever%2Fwikipedia-crawler-hive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lifeislearningforever","download_url":"https://codeload.github.com/lifeislearningforever/wikipedia-crawler-hive/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeislearningforever%2Fwikipedia-crawler-hive/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27992996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-hive","data-engineering","data-pipeline","parquet","pyspark","python","web-scraping","wikipedia"],"created_at":"2025-12-24T02:00:58.741Z","updated_at":"2025-12-24T02:01:45.342Z","avatar_url":"https://github.com/lifeislearningforever.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Amazon Crawler Safe Example\n\nA production-ready web crawler built with PySpark, targeting Wikipedia articles for educational and demonstration purposes.\n\n## Project Overview\n\nThis project implements a modular, scalable web crawler following SOLID design principles. It fetches, parses, transforms, and stores Wikipedia article metadata using PySpark 3.4.1 for distributed data processing.\n\n### Why Wikipedia?\n\nWikipedia is chosen as a safe, legal target for web crawling demonstrations:\n- Permissive robots.txt policies for reasonable crawling\n- Public domain content with clear licensing\n- Stable HTML structure ideal for parsing examples\n- Educational use is explicitly supported\n\n**Legal Note:** Always review and respect robots.txt policies. This crawler implements robots.txt parsing and rate limiting. Use responsibly and only for educational purposes.\n\n## SOLID Design Principles Mapping\n\n| Principle | Implementation |\n|-----------|----------------|\n| **Single Responsibility** | Each module has one clear purpose: `fetcher.py` (HTTP requests), `parser.py` (HTML parsing), `transformer.py` (data validation), `writer.py` (persistence), `orchestrator.py` (workflow coordination) |\n| **Open/Closed** | Abstract `Fetcher` base class allows extension without modifying core logic. New parsers can be added by implementing the parser interface |\n| **Liskov Substitution** | `RequestsFetcher` can replace abstract `Fetcher` without breaking functionality. Any SparkSession can be injected into `SparkWriter` |\n| **Interface Segregation** | Small, focused interfaces - parsers return simple dicts, transformers accept/return typed dicts, writers accept standard lists |\n| **Dependency Inversion** | High-level `orchestrator` depends on abstractions (fetcher interface) not concrete implementations. SparkSession is injected, not created internally |\n\n## Architecture\n\n```\n┌─────────────┐\n│ Orchestrator│ (CLI entry point, coordinates workflow)\n└──────┬──────┘\n       │\n       ├──\u003e Fetcher (HTTP + robots.txt + rate limiting)\n       │\n       ├──\u003e Parser (BeautifulSoup HTML extraction)\n       │\n       ├──\u003e Transformer (Validation + schema mapping)\n       │\n       └──\u003e Writer (PySpark DataFrame + Parquet/Hive)\n```\n\n## Prerequisites\n\n- Python 3.9 or higher (Python 3.11+ recommended for full compatibility)\n- Java 8 or 11 (required for PySpark)\n- pip and virtualenv\n\n## Quick Start\n\n### 1. Create Virtual Environment and Install Dependencies\n\n```bash\ncd amazon_crawler_safe_example\nmake install\n```\n\nOr manually:\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\npip install --upgrade pip\npip install -r requirements.txt\n```\n\n### 2. Run Tests\n\n```bash\nmake test\n```\n\nOr manually:\n\n```bash\nsource venv/bin/activate\npytest -v tests/\n```\n\n### 3. Run Locally (Dry Run Mode)\n\n```bash\nmake run-local\n```\n\nOr manually:\n\n```bash\nsource venv/bin/activate\npython -m src.orchestrator --seed seed_urls.txt --dry_run\n```\n\nThis will:\n- Read URLs from `seed_urls.txt`\n- Fetch and parse Wikipedia articles\n- Write results to `output/wikipedia_articles_YYYYMMDD_HHMMSS.parquet`\n\n### 4. Run with spark-submit (Production Mode)\n\nFor production deployment with Hive:\n\n```bash\nspark-submit \\\n  --master yarn \\\n  --deploy-mode cluster \\\n  --num-executors 4 \\\n  --executor-memory 2G \\\n  --executor-cores 2 \\\n  src/orchestrator.py \\\n  --seed seed_urls.txt \\\n  --batch_size 100 \\\n  --rate_limit 2.0\n```\n\n**Note:** Ensure Hive table exists before running in production mode (see Hive Setup below).\n\n## Configuration Options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--seed` | `seed_urls.txt` | Path to file containing seed URLs (one per line) |\n| `--batch_size` | `50` | Number of records to batch before writing |\n| `--rate_limit` | `1.0` | Seconds to wait between requests (respect robots.txt) |\n| `--dry_run` | `True` | If True, writes to local parquet; if False, writes to Hive |\n\n## Hive Setup\n\nBefore running in production mode (`--dry_run False`), create the Hive table:\n\n```sql\nCREATE DATABASE IF NOT EXISTS default;\n\nCREATE EXTERNAL TABLE IF NOT EXISTS default.wikipedia_articles (\n    id STRING COMMENT 'Unique identifier (UUID)',\n    title STRING COMMENT 'Article title',\n    summary STRING COMMENT 'First paragraph summary',\n    last_edited TIMESTAMP COMMENT 'Last edit timestamp from Wikipedia',\n    source_url STRING COMMENT 'Original Wikipedia URL',\n    crawl_ts TIMESTAMP COMMENT 'Timestamp when crawled'\n)\nPARTITIONED BY (dt STRING COMMENT 'Partition date YYYY-MM-DD')\nSTORED AS PARQUET\nLOCATION '/Users/prakashhosalli/Personal_Data/Code/PythonProjects/amazon_crawler_safe_example/warehouse'\nTBLPROPERTIES (\n    'parquet.compression'='SNAPPY',\n    'created_by'='amazon_crawler_safe_example'\n);\n\n-- After data is written, recover partitions:\nMSCK REPAIR TABLE default.wikipedia_articles;\n```\n\n## Output Schema\n\n| Column | Type | Description |\n|--------|------|-------------|\n| id | STRING | UUID v4 unique identifier |\n| title | STRING | Wikipedia article title |\n| summary | STRING | First paragraph text |\n| last_edited | TIMESTAMP | Last modified timestamp |\n| source_url | STRING | Source Wikipedia URL |\n| crawl_ts | TIMESTAMP | Crawl execution timestamp |\n| dt | STRING | Partition key (YYYY-MM-DD) |\n\n## Project Structure\n\n```\namazon_crawler_safe_example/\n├── README.md                 # This file\n├── requirements.txt          # Python dependencies\n├── .gitignore               # Git ignore patterns\n├── Makefile                 # Build automation\n├── seed_urls.txt            # Input URLs\n├── src/\n│   ├── __init__.py\n│   ├── logger.py            # Structured JSON logging\n│   ├── fetcher.py           # HTTP fetcher with robots.txt\n│   ├── parser.py            # Wikipedia HTML parser\n│   ├── transformer.py       # Data validation and transformation\n│   ├── writer.py            # PySpark writer (Parquet/Hive)\n│   ├── orchestrator.py      # Main CLI orchestrator\n│   └── utils.py             # Utility functions\n└── tests/\n    ├── __init__.py\n    ├── test_parser.py       # Parser unit tests\n    ├── test_transformer.py  # Transformer unit tests\n    ├── test_fetcher.py      # Fetcher unit tests\n    └── test_writer.py       # Writer unit tests\n```\n\n## Development\n\n### Running Individual Tests\n\n```bash\npytest tests/test_parser.py -v\npytest tests/test_transformer.py::test_transform_success -v\n```\n\n### Adding New Seed URLs\n\nEdit `seed_urls.txt` and add Wikipedia URLs (one per line):\n\n```\nhttps://en.wikipedia.org/wiki/Python_(programming_language)\nhttps://en.wikipedia.org/wiki/Apache_Spark\nhttps://en.wikipedia.org/wiki/Data_science\nhttps://en.wikipedia.org/wiki/Machine_learning\n```\n\n### Viewing Logs\n\nLogs are output in structured JSON format to stdout:\n\n```json\n{\"ts\": \"2025-12-21T10:30:45.123456\", \"level\": \"INFO\", \"name\": \"orchestrator\", \"msg\": \"Starting crawl with 3 seed URLs\"}\n```\n\n### Extending the Crawler\n\nTo add support for new websites:\n\n1. Create a new parser class in `src/parser.py` implementing the same interface\n2. Update `orchestrator.py` to use the appropriate parser based on URL domain\n3. Follow the same structure: extract relevant fields, return dict with consistent keys\n\n## Troubleshooting\n\n### PySpark Issues\n\n**Error:** `JAVA_HOME is not set`\n\n```bash\nexport JAVA_HOME=$(/usr/libexec/java_home)  # macOS\nexport JAVA_HOME=/usr/lib/jvm/java-11-openjdk  # Linux\n```\n\n### Robots.txt Blocked\n\nIf fetcher returns \"Disallowed by robots.txt\":\n- Verify the URL is a Wikipedia article\n- Check robots.txt manually: https://en.wikipedia.org/robots.txt\n- Increase `--rate_limit` to be more conservative\n\n### Hive Table Not Found\n\nEnsure you've created the table (see Hive Setup) and that your Spark session has Hive support enabled.\n\n## Performance Tuning\n\n- **Batch size:** Increase `--batch_size` for better write performance (diminishing returns above 500)\n- **Rate limiting:** Decrease `--rate_limit` only if allowed by robots.txt (minimum 0.5s recommended)\n- **Spark resources:** Adjust executor memory/cores based on workload and cluster capacity\n\n## License\n\nThis project is for educational purposes only. Wikipedia content is licensed under CC BY-SA 3.0.\n\n## Contributing\n\nThis is a demonstration project. For production use, consider:\n- Distributed fetching with Scrapy or similar\n- Deduplication logic\n- Incremental crawling with state tracking\n- Monitoring and alerting\n- Error recovery and checkpointing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flifeislearningforever%2Fwikipedia-crawler-hive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flifeislearningforever%2Fwikipedia-crawler-hive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flifeislearningforever%2Fwikipedia-crawler-hive/lists"}