{"id":48768586,"url":"https://github.com/dscmatter/adaptive-qlearning-web-crawler","last_synced_at":"2026-04-25T20:03:33.819Z","repository":{"id":333119170,"uuid":"1136277271","full_name":"DSCmatter/adaptive-qlearning-web-crawler","owner":"DSCmatter","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-03T19:28:12.000Z","size":385,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-13T10:17:37.748Z","etag":null,"topics":["qlearning","reinforcement-learning","search-engine","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DSCmatter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-17T11:51:47.000Z","updated_at":"2026-04-03T19:28:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/DSCmatter/adaptive-qlearning-web-crawler","commit_stats":null,"previous_names":["dscmatter/adaptive-qlearning-web-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DSCmatter/adaptive-qlearning-web-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSCmatter%2Fadaptive-qlearning-web-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSCmatter%2Fadaptive-qlearning-web-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSCmatter%2Fadaptive-qlearning-web-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSCmatter%2Fadaptive-qlearning-web-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DSCmatter","download_url":"https://codeload.github.com/DSCmatter/adaptive-qlearning-web-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSCmatter%2Fadaptive-qlearning-web-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32274987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-25T18:29:39.964Z","status":"ssl_error","status_checked_at":"2026-04-25T18:29:32.149Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["qlearning","reinforcement-learning","search-engine","web-crawler"],"created_at":"2026-04-13T09:02:31.136Z","updated_at":"2026-04-25T20:03:33.808Z","avatar_url":"https://github.com/DSCmatter.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Adaptive Q-Learning Web Crawler with Contextual Bandits and GNNs\n\nThis project implements a **novel hybrid approach** to focused web crawling that combines three complementary techniques:\n\n- **Q-Learning**: Provides high-level navigation strategy and long-term reward optimization\n- **Contextual Bandits (LinUCB)**: Handles intelligent link selection with efficient exploration-exploitation balance\n- **Graph Neural Networks (GNNs)**: Captures web graph structure for informed decision-making\n\nThe crawler learns to navigate the web by selecting links that maximize topical relevance while minimizing crawl cost. It receives rewards for discovering target-domain pages and penalties for inefficient navigation, enabling adaptive link selection over time. Its performance is evaluated against static heuristic-based crawlers and traditional RL approaches to analyze efficiency, coverage, and convergence behavior.\n\n## Research Innovation\n\nThis project addresses limitations in existing RL-based crawlers by:\n- **Leveraging graph topology** via GNN-based node embeddings\n- **Using contextual information** for faster convergence on link selection\n- **Combining value-based and bandit approaches** for hierarchical decision-making\n\n## Relevant Research Papers\n\n### Key Papers on RL-based Web Crawling:\n1. [**Tree-based Focused Web Crawling with Reinforcement Learning**](https://arxiv.org/abs/2112.07620) (2021) - Kontogiannis et al.\n2. [**Deep Reinforcement Learning for Web Crawling**](https://ieeexplore.ieee.org/abstract/document/9703160/) (2021) - Avrachenkov, Borkar, Patil\n3. [**Efficient Deep Web Crawling Using Reinforcement Learning**](https://link.springer.com/chapter/10.1007/978-3-642-13657-3_46) (2010) - Jiang et al. (Cited 59 times)\n4. [**Learning to Crawl Deep Web**](https://www.sciencedirect.com/science/article/pii/S0306437913000288) (2013) - Zheng et al. (Cited 71 times)\n\n## Documentation\n\n### Quick Start Guides\n- **[STUDENT_BUDGET_GUIDE.md](docs/STUDENT_BUDGET_GUIDE.md)** - **START HERE!** Student-friendly quick start guide\n- **[WALKTHROUGH.md](docs/WALKTHROUGH.md)** - Complete implementation guide (9.5-week timeline, optimized for students)\n- **[PRACTICAL_GUIDE.md](docs/PRACTICAL_GUIDE.md)** - Simplified architecture and steps (recommended to get your way around)\n\n### Technical Documentation\n- **[DESIGN.md](docs/DESIGN.md)** - Technical design document with architecture and algorithms\n\n### Phase-by-Phase Implementation Docs\nDetailed documentation for each completed phase with step-by-step instructions:\n\n- **[Phase 1: Project Setup](docs/phases/PHASE_1.md)** ✅ Complete\n  - Environment setup, dependencies, project structure\n  - Core component skeletons (GNN, Bandit, Q-learning)\n  - Testing \u0026 validation (38% baseline harvest rate)\n  \n- **[Phase 2: Data Collection \u0026 Preprocessing](docs/phases/PHASE_2.md)** ✅ Complete\n  - Seed URL collection (3 topics: ML, Climate, Blockchain)\n  - Bootstrap graph crawling (60 nodes, 600 edges)\n  - Feature extraction pipeline (174-dim context vectors)\n  - Labeled training data (42 train / 9 val / 9 test)\n\n- **[Phase 3: GNN Pre-training](docs/phases/PHASE_3.md)** ✅ Complete\n  - Node feature integration\n  - SAGEConv structural embeddings\n  - Offline pre-training and freezing validation\n\n- **[Phase 4: Q-Learning Integration](docs/phases/PHASE_4.md)** ✅ Complete\n  - Offline simulation environment setup\n  - Joint Q-Network and LinUCB Bandit optimization\n  - High coverage metric evaluation testing\n\n- **[Phase 5: Hybrid System Integration](docs/phases/PHASE_5.md)** ✅ Complete\n  - Integration of Q-Agent, LinUCB, and frozen GraphSAGE.\n  - Live HTTP web crawling tests on standard seed targets.\n\n- **[Phase 6: Evaluation \u0026 Baselines](docs/phases/PHASE_6.md)** ✅ Complete\n  - Baseline comparisons and strict topic-aware evaluation across 3 domains\n\n- **[Phase 7: Finalization, Diagnostics, and Project Closure](docs/phases/PHASE_7.md)** ✅ Complete\n  - Strict diagnosis and final benchmark completed with documented final policy selection\n\nThis project is optimized for broke students:\n- **$0.10 Total Cost** (just electricity)\n- **No GPU Required** (CPU-only works great)\n- **No Cloud Costs** (runs on your laptop)\n- **3-4 Days Training** (run overnight)\n- **8GB RAM Sufficient** (works on old laptops)\n- **60-70% Harvest Rate** (publishable results!)\n\n## Quick Start\n\n```bash\n# 1. Clone repo\ngit clone https://github.com/yourusername/adaptive-qlearning-web-crawler\ncd adaptive-qlearning-web-crawler\n\n# 2. Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # Windows: venv\\Scripts\\activate\n\n# 3. Install dependencies (CPU-only, ~2GB)\npip install -r requirements.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdscmatter%2Fadaptive-qlearning-web-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdscmatter%2Fadaptive-qlearning-web-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdscmatter%2Fadaptive-qlearning-web-crawler/lists"}