{"id":49563197,"url":"https://github.com/alexnthnz/web-crawler","last_synced_at":"2026-05-03T10:47:17.771Z","repository":{"id":306053404,"uuid":"1024188999","full_name":"alexnthnz/web-crawler","owner":"alexnthnz","description":" Scalable web crawler built with Python, Redis, and Cassandra, inspired by Alex Xu's design. Crawls, indexes, and stores web content with robots.txt compliance and duplicate detection.","archived":false,"fork":false,"pushed_at":"2025-08-07T10:00:50.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-03T10:47:14.765Z","etag":null,"topics":["crawler","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexnthnz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-22T10:09:23.000Z","updated_at":"2025-08-07T10:02:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"e7d6a264-0256-4b67-967c-251316922864","html_url":"https://github.com/alexnthnz/web-crawler","commit_stats":null,"previous_names":["alexnthnz/web-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alexnthnz/web-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexnthnz%2Fweb-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexnthnz%2Fweb-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexnthnz%2Fweb-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexnthnz%2Fweb-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexnthnz","download_url":"https://codeload.github.com/alexnthnz/web-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexnthnz%2Fweb-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32566444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","python"],"created_at":"2026-05-03T10:47:17.040Z","updated_at":"2026-05-03T10:47:17.757Z","avatar_url":"https://github.com/alexnthnz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Crawler System\n\nA scalable web crawler system designed to efficiently crawl, index, and store web page data while handling large-scale web crawling tasks with robustness and scalability. This implementation follows system design principles outlined by Alex Xu and industry best practices for distributed systems.\n\n## 🚀 Features\n\n- **Distributed Architecture**: Scalable design with multiple components working in coordination\n- **Politeness Policies**: Respects robots.txt and implements rate limiting per domain\n- **Duplicate Detection**: Multiple strategies for detecting and avoiding duplicate content\n- **Storage Flexibility**: Support for both Cassandra (production) and file-based storage (development)\n- **Fault Tolerance**: Comprehensive error handling and retry mechanisms\n- **Real-time Monitoring**: Prometheus metrics and comprehensive logging\n- **Configurable**: YAML-based configuration for easy customization\n\n## 📋 Table of Contents\n\n- [Architecture](#architecture)\n- [System Requirements](#system-requirements)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [Usage](#usage)\n- [Components](#components)\n- [Monitoring](#monitoring)\n- [Scaling Considerations](#scaling-considerations)\n- [Development](#development)\n- [Contributing](#contributing)\n\n## 🏗️ Architecture\n\nThe system follows a distributed architecture with the following components:\n\n```\n┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n│   URL Frontier  │    │     Fetcher     │    │     Parser      │\n│   (Redis-based) │───▶│  (HTTP Client)  │───▶│  (Content Ext.) │\n└─────────────────┘    └─────────────────┘    └─────────────────┘\n         ▲                        │                        │\n         │                        │                        ▼\n         │                        │              ┌─────────────────┐\n         │                        │              │ Duplicate       │\n         │                        │              │ Detector        │\n         │                        │              └─────────────────┘\n         │                        │                        │\n         │                        ▼                        ▼\n┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n│   Scheduler     │    │   Monitoring    │    │    Storage      │\n│  (Coordinator)  │───▶│  (Prometheus)   │    │ (Cassandra/File)│\n└─────────────────┘    └─────────────────┘    └─────────────────┘\n```\n\n### Data Flow\n\n1. **Seed URLs** are added to the URL Frontier\n2. **Fetcher** retrieves URLs from the Frontier and downloads web pages\n3. **Parser** processes pages, extracts content and links\n4. **Duplicate Detector** filters out duplicate content\n5. **Storage** persists crawled data\n6. **Scheduler** coordinates the entire process and manages re-crawling\n\n## 🔧 System Requirements\n\n### Functional Requirements\n- Crawl web pages starting from seed URLs\n- Extract and store content (text, metadata, links)\n- Respect robots.txt and crawl politely\n- Detect and avoid duplicate content\n- Support continuous crawling and re-crawling\n\n### Non-Functional Requirements\n- **Scalability**: Handle millions of web pages\n- **Reliability**: Fault-tolerant with minimal data loss\n- **Performance**: Efficient crawling with low latency\n- **Extensibility**: Easy to add new features\n\n### Prerequisites\n\n- Python 3.8+\n- Redis (for URL frontier and duplicate detection)\n- Apache Cassandra (optional, for production storage)\n- 4GB+ RAM recommended\n- Network connectivity\n\n## 📦 Installation\n\n### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/alexnthnz/web-crawler.git\ncd web-crawler\n```\n\n### 2. Install Dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n### 3. Setup Infrastructure\n\n**Redis Installation:**\n```bash\n# macOS\nbrew install redis\nbrew services start redis\n\n# Ubuntu\nsudo apt-get install redis-server\nsudo systemctl start redis\n\n# Docker\ndocker run -d -p 6379:6379 redis:alpine\n```\n\n**Cassandra Installation (Optional):**\n```bash\n# Docker\ndocker run -d -p 9042:9042 cassandra:latest\n\n# Or use file-based storage for development\n```\n\n### 4. Configure the System\n\nCopy and modify the configuration file:\n```bash\ncp config.yaml config-local.yaml\n# Edit config-local.yaml with your settings\n```\n\n## ⚙️ Configuration\n\nThe system is configured via YAML files. Here's the configuration structure:\n\n```yaml\ncrawler:\n  seed_urls:\n    - https://example.com\n    - https://example.org\n  max_depth: 5\n  politeness_delay: 1.0\n  max_concurrent_requests: 10\n  request_timeout: 30\n  user_agent: \"WebCrawler/1.0\"\n  respect_robots_txt: true\n\ndatabase:\n  type: \"file\"  # or \"cassandra\"\n  file:\n    data_directory: \"./data\"\n  cassandra:\n    hosts: [\"localhost\"]\n    port: 9042\n    keyspace: \"crawler_data\"\n\nredis:\n  host: \"localhost\"\n  port: 6379\n  db: 0\n\nlogging:\n  level: \"INFO\"\n  file: \"./logs/crawler.log\"\n\nmonitoring:\n  prometheus_port: 8000\n  metrics_enabled: true\n```\n\n## 🚀 Usage\n\n### Basic Usage\n\n```bash\n# Run with default configuration\npython main.py\n\n# Run with custom configuration\npython main.py --config config-local.yaml\n\n# Limit crawling\npython main.py --max-pages 1000\npython main.py --max-duration 3600  # 1 hour\n\n# Test configuration\npython main.py --dry-run\n```\n\n### Advanced Usage\n\n```bash\n# Run with Prometheus monitoring\npython main.py --config config-production.yaml\n\n# Monitor metrics at http://localhost:8000/metrics\n```\n\n### Example Output\n\n```\n2024-01-20 10:30:15 - INFO - === WEB CRAWLER STARTING ===\n2024-01-20 10:30:15 - INFO - Configuration loaded from: config.yaml\n2024-01-20 10:30:15 - INFO - Seed URLs: ['https://example.com']\n2024-01-20 10:30:15 - INFO - Max depth: 5\n2024-01-20 10:30:16 - INFO - Crawler scheduler initialized successfully\n2024-01-20 10:30:16 - INFO - Added 1 seed URLs to frontier\n2024-01-20 10:30:16 - INFO - Started crawling with 10 workers\n2024-01-20 10:30:46 - INFO - Crawl Progress: Crawled=45, Stored=42, Queued=127, Errors=3, Rate=90.0 pages/min\n```\n\n## 🧩 Components\n\n### URL Frontier\n- **Purpose**: Manages URLs to be crawled with politeness policies\n- **Features**: Per-domain queues, priority scheduling, rate limiting\n- **Storage**: Redis-backed for persistence and distribution\n\n### Fetcher\n- **Purpose**: Downloads web pages efficiently\n- **Features**: Robots.txt compliance, concurrent requests, timeout handling\n- **Technology**: aiohttp for async HTTP operations\n\n### Parser\n- **Purpose**: Extracts structured data from HTML content\n- **Features**: Content extraction, metadata parsing, link discovery\n- **Technology**: BeautifulSoup for HTML parsing\n\n### Duplicate Detector\n- **Purpose**: Identifies and filters duplicate content\n- **Strategies**: URL normalization, content hashing, fuzzy matching\n- **Storage**: Redis-based hash storage\n\n### Storage Layer\n- **Purpose**: Persists crawled data\n- **Backends**: Cassandra (production), File system (development)\n- **Features**: Scalable storage, query optimization\n\n### Scheduler\n- **Purpose**: Coordinates crawling process\n- **Features**: Worker management, statistics, graceful shutdown\n- **Monitoring**: Real-time progress tracking\n\n## 📊 Monitoring\n\n### Prometheus Metrics\n\nAvailable at `http://localhost:8000/metrics`:\n\n- `crawler_urls_crawled_total`: Total URLs crawled\n- `crawler_pages_stored_total`: Total pages stored\n- `crawler_errors_total`: Total errors by type\n- `crawler_response_time_seconds`: HTTP response times\n- `crawler_queue_size`: URLs in queue\n- `crawler_bytes_downloaded_total`: Total bytes downloaded\n\n### Logging\n\n- **Application logs**: `logs/crawler.log`\n- **Error logs**: `logs/errors.log`\n- **Structured logging**: JSON format available\n- **Log rotation**: Automatic rotation and retention\n\n### Statistics\n\nReal-time statistics include:\n- Crawl rate (pages/minute)\n- Error rates by type\n- Queue depth\n- Storage metrics\n- Response time percentiles\n\n## 📈 Scaling Considerations\n\n### Horizontal Scaling\n- **Multiple Crawlers**: Run multiple crawler instances\n- **Load Balancing**: Use Redis for distributed queues\n- **Database Sharding**: Partition data by domain or date\n\n### Performance Optimization\n- **Concurrent Workers**: Adjust based on available resources\n- **Request Batching**: Group requests for efficiency\n- **Caching**: Cache robots.txt and DNS lookups\n- **Storage Optimization**: Use appropriate database schemas\n\n### Infrastructure Requirements\n\n| Scale | URLs/day | Workers | Memory | Storage |\n|-------|----------|---------|--------|---------|\n| Small | \u003c100K | 5-10 | 2GB | 10GB |\n| Medium | 1M | 20-50 | 8GB | 100GB |\n| Large | 10M+ | 100+ | 32GB+ | 1TB+ |\n\n## 🛠️ Development\n\n### Project Structure\n\n```\nweb-crawler/\n├── src/\n│   ├── crawler/          # Core crawler components\n│   │   ├── url_frontier.py\n│   │   ├── fetcher.py\n│   │   ├── parser.py\n│   │   └── scheduler.py\n│   ├── storage/          # Storage layer\n│   │   ├── database.py\n│   │   └── duplicate_detector.py\n│   └── utils/            # Utilities\n│       ├── config.py\n│       ├── monitoring.py\n│       └── logger.py\n├── main.py               # Entry point\n├── config.yaml          # Configuration\n├── requirements.txt      # Dependencies\n└── README.md\n```\n\n### Adding New Features\n\n1. **New Parser**: Extend `ContentParser` class\n2. **New Storage Backend**: Implement `StorageBackend` interface\n3. **New Metrics**: Add to `MetricsCollector`\n4. **New Filters**: Extend URL filtering logic\n\n### Testing\n\n```bash\n# Test configuration\npython main.py --dry-run\n\n# Test with limited scope\npython main.py --max-pages 10 --config test-config.yaml\n```\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n**Redis Connection Failed**\n```bash\n# Check Redis is running\nredis-cli ping\n# Should return PONG\n```\n\n**High Memory Usage**\n- Reduce `max_concurrent_requests`\n- Enable duplicate detection\n- Check for memory leaks in parsing\n\n**Slow Crawling**\n- Increase `max_concurrent_requests`\n- Reduce `politeness_delay`\n- Check network connectivity\n\n**Storage Errors**\n- Verify database connectivity\n- Check disk space\n- Review database configuration\n\n### Performance Tuning\n\n1. **Adjust concurrency** based on target site capacity\n2. **Tune politeness delay** for respectful crawling\n3. **Optimize duplicate detection** for your use case\n4. **Configure appropriate timeouts**\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n### Development Guidelines\n\n- Follow PEP 8 style guidelines\n- Add comprehensive logging\n- Include error handling\n- Write clear documentation\n- Add appropriate tests\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- Inspired by \"System Design Interview\" by Alex Xu\n- Built with modern Python async/await patterns\n- Uses industry-standard tools and practices\n\n## 📞 Support\n\n- **Issues**: [GitHub Issues](https://github.com/alexnthnz/web-crawler/issues)\n- **Documentation**: This README and inline code documentation\n- **Community**: Contributions and discussions welcome","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexnthnz%2Fweb-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexnthnz%2Fweb-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexnthnz%2Fweb-crawler/lists"}