Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xenoswarlocks/multicrawl
MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.
https://github.com/xenoswarlocks/multicrawl
python threading web webcrawler
Last synced: 12 days ago
JSON representation
MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.
- Host: GitHub
- URL: https://github.com/xenoswarlocks/multicrawl
- Owner: XenosWarlocks
- License: apache-2.0
- Created: 2024-12-15T12:48:24.000Z (26 days ago)
- Default Branch: main
- Last Pushed: 2024-12-15T12:50:58.000Z (26 days ago)
- Last Synced: 2024-12-15T13:40:48.210Z (26 days ago)
- Topics: python, threading, web, webcrawler
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MultiCrawl πΈοΈ
## Overview
MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.
## Features
- π **Multiple Crawling Strategies**
- Sequential Crawling: Simple, straightforward approach
- Threaded Crawling: Improved performance with concurrent processing
- Asynchronous Crawling: High-performance, non-blocking I/O operations- π **Advanced Data Processing**
- Intelligent parsing for HTML, JSON, and text content
- Metadata extraction
- Keyword detection
- Language identification- π‘οΈ **Robust Error Handling**
- URL validation
- Retry mechanisms
- Rate limiting
- Comprehensive error logging- π **Performance Benchmarking**
- Built-in benchmarking tools
- Performance comparison across different crawling strategies## Installation
```bash
# Clone the repository
git clone https://github.com/XenosWarlocks/MultiCrawl.git# Install dependencies
pip install -r requirements.txt
```## Quick Start
```python
from src.web_crawler_app import WebCrawlerApp
import asyncioasync def main():
urls = [
'https://example.com/jobs',
'https://another-jobs-site.com/listings'
]
app = WebCrawlerApp(urls, mode='async')
results = await app.run()
print(results['report'])asyncio.run(main())
```## Crawling Strategies
### 1. Sequential Crawler
- Simple, single-threaded approach
- Best for small datasets or when order matters
- Lowest computational overhead### 2. Threaded Crawler
- Uses multiple threads for concurrent processing
- Good balance between complexity and performance
- Suitable for I/O-bound tasks### 3. Async Crawler
- Non-blocking, event-driven architecture
- Highest performance for large numbers of URLs
- Minimal resource consumption## Running Benchmarks
```bash
python benchmark.py
```This will generate performance metrics and a visualization comparing different crawling strategies.
## Running Tests
```bash
pytest tests/
```## Project structure
```
MultiCrawl/
β
βββ src/
β βββ __init__.py
β βββ crawler/
β β βββ __init__.py
β β βββ base_crawler.py
β β βββ sequential_crawler.py
β β βββ threaded_crawler.py
β β βββ async_crawler.py
β β
β βββ data_processing/
β β βββ __init__.py
β β βββ parser.py
β β βββ aggregator.py
β β βββ report_generator.py
β β
β βββ utils/
β β βββ __init__.py
β β βββ rate_limiter.py
β β βββ error_handler.py
β β βββ config.py
β β
β βββ main.py
β
βββ tests/
β βββ __init__.py
β βββ test_crawlers.py
β βββ test_parsers.py
β βββ test_aggregators.py
β
βββ requirements.txt
βββ README.md
βββ benchmark.py
```## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Disclaimer
Respect website terms of service and robots.txt when using this crawler. Always ensure you have permission to crawl a website.