https://github.com/dscmatter/adaptive-qlearning-web-crawler
https://github.com/dscmatter/adaptive-qlearning-web-crawler
qlearning reinforcement-learning search-engine web-crawler
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dscmatter/adaptive-qlearning-web-crawler
- Owner: DSCmatter
- License: mit
- Created: 2026-01-17T11:51:47.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-04-03T19:28:12.000Z (3 months ago)
- Last Synced: 2026-04-13T10:17:37.748Z (3 months ago)
- Topics: qlearning, reinforcement-learning, search-engine, web-crawler
- Language: Python
- Homepage:
- Size: 376 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Adaptive Q-Learning Web Crawler with Contextual Bandits and GNNs
This project implements a **novel hybrid approach** to focused web crawling that combines three complementary techniques:
- **Q-Learning**: Provides high-level navigation strategy and long-term reward optimization
- **Contextual Bandits (LinUCB)**: Handles intelligent link selection with efficient exploration-exploitation balance
- **Graph Neural Networks (GNNs)**: Captures web graph structure for informed decision-making
The crawler learns to navigate the web by selecting links that maximize topical relevance while minimizing crawl cost. It receives rewards for discovering target-domain pages and penalties for inefficient navigation, enabling adaptive link selection over time. Its performance is evaluated against static heuristic-based crawlers and traditional RL approaches to analyze efficiency, coverage, and convergence behavior.
## Research Innovation
This project addresses limitations in existing RL-based crawlers by:
- **Leveraging graph topology** via GNN-based node embeddings
- **Using contextual information** for faster convergence on link selection
- **Combining value-based and bandit approaches** for hierarchical decision-making
## Relevant Research Papers
### Key Papers on RL-based Web Crawling:
1. [**Tree-based Focused Web Crawling with Reinforcement Learning**](https://arxiv.org/abs/2112.07620) (2021) - Kontogiannis et al.
2. [**Deep Reinforcement Learning for Web Crawling**](https://ieeexplore.ieee.org/abstract/document/9703160/) (2021) - Avrachenkov, Borkar, Patil
3. [**Efficient Deep Web Crawling Using Reinforcement Learning**](https://link.springer.com/chapter/10.1007/978-3-642-13657-3_46) (2010) - Jiang et al. (Cited 59 times)
4. [**Learning to Crawl Deep Web**](https://www.sciencedirect.com/science/article/pii/S0306437913000288) (2013) - Zheng et al. (Cited 71 times)
## Documentation
### Quick Start Guides
- **[STUDENT_BUDGET_GUIDE.md](docs/STUDENT_BUDGET_GUIDE.md)** - **START HERE!** Student-friendly quick start guide
- **[WALKTHROUGH.md](docs/WALKTHROUGH.md)** - Complete implementation guide (9.5-week timeline, optimized for students)
- **[PRACTICAL_GUIDE.md](docs/PRACTICAL_GUIDE.md)** - Simplified architecture and steps (recommended to get your way around)
### Technical Documentation
- **[DESIGN.md](docs/DESIGN.md)** - Technical design document with architecture and algorithms
### Phase-by-Phase Implementation Docs
Detailed documentation for each completed phase with step-by-step instructions:
- **[Phase 1: Project Setup](docs/phases/PHASE_1.md)** ✅ Complete
- Environment setup, dependencies, project structure
- Core component skeletons (GNN, Bandit, Q-learning)
- Testing & validation (38% baseline harvest rate)
- **[Phase 2: Data Collection & Preprocessing](docs/phases/PHASE_2.md)** ✅ Complete
- Seed URL collection (3 topics: ML, Climate, Blockchain)
- Bootstrap graph crawling (60 nodes, 600 edges)
- Feature extraction pipeline (174-dim context vectors)
- Labeled training data (42 train / 9 val / 9 test)
- **[Phase 3: GNN Pre-training](docs/phases/PHASE_3.md)** ✅ Complete
- Node feature integration
- SAGEConv structural embeddings
- Offline pre-training and freezing validation
- **[Phase 4: Q-Learning Integration](docs/phases/PHASE_4.md)** ✅ Complete
- Offline simulation environment setup
- Joint Q-Network and LinUCB Bandit optimization
- High coverage metric evaluation testing
- **[Phase 5: Hybrid System Integration](docs/phases/PHASE_5.md)** ✅ Complete
- Integration of Q-Agent, LinUCB, and frozen GraphSAGE.
- Live HTTP web crawling tests on standard seed targets.
- **[Phase 6: Evaluation & Baselines](docs/phases/PHASE_6.md)** ✅ Complete
- Baseline comparisons and strict topic-aware evaluation across 3 domains
- **[Phase 7: Finalization, Diagnostics, and Project Closure](docs/phases/PHASE_7.md)** ✅ Complete
- Strict diagnosis and final benchmark completed with documented final policy selection
This project is optimized for broke students:
- **$0.10 Total Cost** (just electricity)
- **No GPU Required** (CPU-only works great)
- **No Cloud Costs** (runs on your laptop)
- **3-4 Days Training** (run overnight)
- **8GB RAM Sufficient** (works on old laptops)
- **60-70% Harvest Rate** (publishable results!)
## Quick Start
```bash
# 1. Clone repo
git clone https://github.com/yourusername/adaptive-qlearning-web-crawler
cd adaptive-qlearning-web-crawler
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies (CPU-only, ~2GB)
pip install -r requirements.txt
```