https://github.com/xenoswarlocks/multicrawl

MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.
https://github.com/xenoswarlocks/multicrawl

python threading web webcrawler

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/xenoswarlocks/multicrawl
Owner: XenosWarlocks
License: apache-2.0
Created: 2024-12-15T12:48:24.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-15T12:50:58.000Z (7 months ago)
Last Synced: 2024-12-29T02:03:18.846Z (7 months ago)
Topics: python, threading, web, webcrawler
Language: Python
Homepage:
Size: 30.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MultiCrawl 🕸️

## Overview

## Features

- 🚀 **Multiple Crawling Strategies**
- Sequential Crawling: Simple, straightforward approach
- Threaded Crawling: Improved performance with concurrent processing
- Asynchronous Crawling: High-performance, non-blocking I/O operations

- 📊 **Advanced Data Processing**
- Intelligent parsing for HTML, JSON, and text content
- Metadata extraction
- Keyword detection
- Language identification

- 🛡️ **Robust Error Handling**
- URL validation
- Retry mechanisms
- Rate limiting
- Comprehensive error logging

- 📈 **Performance Benchmarking**
- Built-in benchmarking tools
- Performance comparison across different crawling strategies

## Installation

```bash
# Clone the repository
git clone https://github.com/XenosWarlocks/MultiCrawl.git

# Install dependencies
pip install -r requirements.txt
```

## Quick Start

```python
from src.web_crawler_app import WebCrawlerApp
import asyncio

async def main():
urls = [
'https://example.com/jobs',
'https://another-jobs-site.com/listings'
]

app = WebCrawlerApp(urls, mode='async')
results = await app.run()
print(results['report'])

asyncio.run(main())
```

## Crawling Strategies

### 1. Sequential Crawler
- Simple, single-threaded approach
- Best for small datasets or when order matters
- Lowest computational overhead

### 2. Threaded Crawler
- Uses multiple threads for concurrent processing
- Good balance between complexity and performance
- Suitable for I/O-bound tasks

### 3. Async Crawler
- Non-blocking, event-driven architecture
- Highest performance for large numbers of URLs
- Minimal resource consumption

## Running Benchmarks

```bash
python benchmark.py
```

This will generate performance metrics and a visualization comparing different crawling strategies.

## Running Tests

```bash
pytest tests/
```

## Project structure
```
MultiCrawl/
│
├── src/
│ ├── __init__.py
│ ├── crawler/
│ │ ├── __init__.py
│ │ ├── base_crawler.py
│ │ ├── sequential_crawler.py
│ │ ├── threaded_crawler.py
│ │ └── async_crawler.py
│ │
│ ├── data_processing/
│ │ ├── __init__.py
│ │ ├── parser.py
│ │ ├── aggregator.py
│ │ └── report_generator.py
│ │
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── rate_limiter.py
│ │ ├── error_handler.py
│ │ └── config.py
│ │
│ └── main.py
│
├── tests/
│ ├── __init__.py
│ ├── test_crawlers.py
│ ├── test_parsers.py
│ └── test_aggregators.py
│
├── requirements.txt
├── README.md
└── benchmark.py
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Disclaimer

Respect website terms of service and robots.txt when using this crawler. Always ensure you have permission to crawl a website.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xenoswarlocks/multicrawl

Awesome Lists containing this project

README