Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/xenoswarlocks/multicrawl

MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.
https://github.com/xenoswarlocks/multicrawl

python threading web webcrawler

Last synced: 12 days ago
JSON representation

MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.

Awesome Lists containing this project

README

        

# MultiCrawl πŸ•ΈοΈ

## Overview

MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.

## Features

- πŸš€ **Multiple Crawling Strategies**
- Sequential Crawling: Simple, straightforward approach
- Threaded Crawling: Improved performance with concurrent processing
- Asynchronous Crawling: High-performance, non-blocking I/O operations

- πŸ“Š **Advanced Data Processing**
- Intelligent parsing for HTML, JSON, and text content
- Metadata extraction
- Keyword detection
- Language identification

- πŸ›‘οΈ **Robust Error Handling**
- URL validation
- Retry mechanisms
- Rate limiting
- Comprehensive error logging

- πŸ“ˆ **Performance Benchmarking**
- Built-in benchmarking tools
- Performance comparison across different crawling strategies

## Installation

```bash
# Clone the repository
git clone https://github.com/XenosWarlocks/MultiCrawl.git

# Install dependencies
pip install -r requirements.txt
```

## Quick Start

```python
from src.web_crawler_app import WebCrawlerApp
import asyncio

async def main():
urls = [
'https://example.com/jobs',
'https://another-jobs-site.com/listings'
]

app = WebCrawlerApp(urls, mode='async')
results = await app.run()
print(results['report'])

asyncio.run(main())
```

## Crawling Strategies

### 1. Sequential Crawler
- Simple, single-threaded approach
- Best for small datasets or when order matters
- Lowest computational overhead

### 2. Threaded Crawler
- Uses multiple threads for concurrent processing
- Good balance between complexity and performance
- Suitable for I/O-bound tasks

### 3. Async Crawler
- Non-blocking, event-driven architecture
- Highest performance for large numbers of URLs
- Minimal resource consumption

## Running Benchmarks

```bash
python benchmark.py
```

This will generate performance metrics and a visualization comparing different crawling strategies.

## Running Tests

```bash
pytest tests/
```

## Project structure
```
MultiCrawl/
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ crawler/
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ base_crawler.py
β”‚ β”‚ β”œβ”€β”€ sequential_crawler.py
β”‚ β”‚ β”œβ”€β”€ threaded_crawler.py
β”‚ β”‚ └── async_crawler.py
β”‚ β”‚
β”‚ β”œβ”€β”€ data_processing/
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ parser.py
β”‚ β”‚ β”œβ”€β”€ aggregator.py
β”‚ β”‚ └── report_generator.py
β”‚ β”‚
β”‚ β”œβ”€β”€ utils/
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ rate_limiter.py
β”‚ β”‚ β”œβ”€β”€ error_handler.py
β”‚ β”‚ └── config.py
β”‚ β”‚
β”‚ └── main.py
β”‚
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ test_crawlers.py
β”‚ β”œβ”€β”€ test_parsers.py
β”‚ └── test_aggregators.py
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── benchmark.py
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Disclaimer

Respect website terms of service and robots.txt when using this crawler. Always ensure you have permission to crawl a website.