An open API service indexing awesome lists of open source software.

https://github.com/dscmatter/adaptive-qlearning-web-crawler


https://github.com/dscmatter/adaptive-qlearning-web-crawler

qlearning reinforcement-learning search-engine web-crawler

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

          

# Adaptive Q-Learning Web Crawler with Contextual Bandits and GNNs

This project implements a **novel hybrid approach** to focused web crawling that combines three complementary techniques:

- **Q-Learning**: Provides high-level navigation strategy and long-term reward optimization
- **Contextual Bandits (LinUCB)**: Handles intelligent link selection with efficient exploration-exploitation balance
- **Graph Neural Networks (GNNs)**: Captures web graph structure for informed decision-making

The crawler learns to navigate the web by selecting links that maximize topical relevance while minimizing crawl cost. It receives rewards for discovering target-domain pages and penalties for inefficient navigation, enabling adaptive link selection over time. Its performance is evaluated against static heuristic-based crawlers and traditional RL approaches to analyze efficiency, coverage, and convergence behavior.

## Research Innovation

This project addresses limitations in existing RL-based crawlers by:
- **Leveraging graph topology** via GNN-based node embeddings
- **Using contextual information** for faster convergence on link selection
- **Combining value-based and bandit approaches** for hierarchical decision-making

## Relevant Research Papers

### Key Papers on RL-based Web Crawling:
1. [**Tree-based Focused Web Crawling with Reinforcement Learning**](https://arxiv.org/abs/2112.07620) (2021) - Kontogiannis et al.
2. [**Deep Reinforcement Learning for Web Crawling**](https://ieeexplore.ieee.org/abstract/document/9703160/) (2021) - Avrachenkov, Borkar, Patil
3. [**Efficient Deep Web Crawling Using Reinforcement Learning**](https://link.springer.com/chapter/10.1007/978-3-642-13657-3_46) (2010) - Jiang et al. (Cited 59 times)
4. [**Learning to Crawl Deep Web**](https://www.sciencedirect.com/science/article/pii/S0306437913000288) (2013) - Zheng et al. (Cited 71 times)

## Documentation

### Quick Start Guides
- **[STUDENT_BUDGET_GUIDE.md](docs/STUDENT_BUDGET_GUIDE.md)** - **START HERE!** Student-friendly quick start guide
- **[WALKTHROUGH.md](docs/WALKTHROUGH.md)** - Complete implementation guide (9.5-week timeline, optimized for students)
- **[PRACTICAL_GUIDE.md](docs/PRACTICAL_GUIDE.md)** - Simplified architecture and steps (recommended to get your way around)

### Technical Documentation
- **[DESIGN.md](docs/DESIGN.md)** - Technical design document with architecture and algorithms

### Phase-by-Phase Implementation Docs
Detailed documentation for each completed phase with step-by-step instructions:

- **[Phase 1: Project Setup](docs/phases/PHASE_1.md)** ✅ Complete
- Environment setup, dependencies, project structure
- Core component skeletons (GNN, Bandit, Q-learning)
- Testing & validation (38% baseline harvest rate)

- **[Phase 2: Data Collection & Preprocessing](docs/phases/PHASE_2.md)** ✅ Complete
- Seed URL collection (3 topics: ML, Climate, Blockchain)
- Bootstrap graph crawling (60 nodes, 600 edges)
- Feature extraction pipeline (174-dim context vectors)
- Labeled training data (42 train / 9 val / 9 test)

- **[Phase 3: GNN Pre-training](docs/phases/PHASE_3.md)** ✅ Complete
- Node feature integration
- SAGEConv structural embeddings
- Offline pre-training and freezing validation

- **[Phase 4: Q-Learning Integration](docs/phases/PHASE_4.md)** ✅ Complete
- Offline simulation environment setup
- Joint Q-Network and LinUCB Bandit optimization
- High coverage metric evaluation testing

- **[Phase 5: Hybrid System Integration](docs/phases/PHASE_5.md)** ✅ Complete
- Integration of Q-Agent, LinUCB, and frozen GraphSAGE.
- Live HTTP web crawling tests on standard seed targets.

- **[Phase 6: Evaluation & Baselines](docs/phases/PHASE_6.md)** ✅ Complete
- Baseline comparisons and strict topic-aware evaluation across 3 domains

- **[Phase 7: Finalization, Diagnostics, and Project Closure](docs/phases/PHASE_7.md)** ✅ Complete
- Strict diagnosis and final benchmark completed with documented final policy selection

This project is optimized for broke students:
- **$0.10 Total Cost** (just electricity)
- **No GPU Required** (CPU-only works great)
- **No Cloud Costs** (runs on your laptop)
- **3-4 Days Training** (run overnight)
- **8GB RAM Sufficient** (works on old laptops)
- **60-70% Harvest Rate** (publishable results!)

## Quick Start

```bash
# 1. Clone repo
git clone https://github.com/yourusername/adaptive-qlearning-web-crawler
cd adaptive-qlearning-web-crawler

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate

# 3. Install dependencies (CPU-only, ~2GB)
pip install -r requirements.txt
```