{"id":23659044,"url":"https://github.com/xenoswarlocks/multicrawl","last_synced_at":"2025-10-14T22:11:44.012Z","repository":{"id":268240828,"uuid":"903739282","full_name":"XenosWarlocks/MultiCrawl","owner":"XenosWarlocks","description":"MultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.","archived":false,"fork":false,"pushed_at":"2025-03-30T13:30:36.000Z","size":581,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-20T13:56:08.412Z","etag":null,"topics":["python","threading","web","webcrawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/XenosWarlocks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-15T12:48:24.000Z","updated_at":"2025-03-30T13:30:39.000Z","dependencies_parsed_at":"2024-12-15T13:40:51.931Z","dependency_job_id":"8b50ac33-4696-417b-9643-31b2c6a55e82","html_url":"https://github.com/XenosWarlocks/MultiCrawl","commit_stats":null,"previous_names":["xenoswarlocks/multicrawl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/XenosWarlocks/MultiCrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XenosWarlocks%2FMultiCrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XenosWarlocks%2FMultiCrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XenosWarlocks%2FMultiCrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XenosWarlocks%2FMultiCrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/XenosWarlocks","download_url":"https://codeload.github.com/XenosWarlocks/MultiCrawl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/XenosWarlocks%2FMultiCrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279021742,"owners_count":26087053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","threading","web","webcrawler"],"created_at":"2024-12-29T02:02:19.162Z","updated_at":"2025-10-14T22:11:44.007Z","avatar_url":"https://github.com/XenosWarlocks.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\r\n# MultiCrawl 🕸️\r\n\r\n## Overview\r\n\r\nMultiCrawl is a powerful and flexible web crawling framework that provides multiple crawling strategies to suit different use cases and performance requirements. The library supports sequential, threaded, and asynchronous crawling methods, making it adaptable to various data extraction needs.\r\n\r\n## Features\r\n\r\n- 🚀 **Multiple Crawling Strategies**\r\n  - Sequential Crawling: Simple, straightforward approach\r\n  - Threaded Crawling: Improved performance with concurrent processing\r\n  - Asynchronous Crawling: High-performance, non-blocking I/O operations\r\n\r\n- 📊 **Advanced Data Processing**\r\n  - Intelligent parsing for HTML, JSON, and text content\r\n  - Metadata extraction\r\n  - Keyword detection\r\n  - Language identification\r\n\r\n- 🛡️ **Robust Error Handling**\r\n  - URL validation\r\n  - Retry mechanisms\r\n  - Rate limiting\r\n  - Comprehensive error logging\r\n\r\n- 📈 **Performance Benchmarking**\r\n  - Built-in benchmarking tools\r\n  - Performance comparison across different crawling strategies\r\n\r\n## Installation\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/XenosWarlocks/MultiCrawl.git\r\n\r\n# Install dependencies\r\npip install -r requirements.txt\r\n```\r\n\r\n## Quick Start\r\n\r\n```python\r\nfrom src.web_crawler_app import WebCrawlerApp\r\nimport asyncio\r\n\r\nasync def main():\r\n    urls = [\r\n        'https://example.com/jobs',\r\n        'https://another-jobs-site.com/listings'\r\n    ]\r\n    \r\n    app = WebCrawlerApp(urls, mode='async')\r\n    results = await app.run()\r\n    print(results['report'])\r\n\r\nasyncio.run(main())\r\n```\r\n\r\n## Crawling Strategies\r\n\r\n### 1. Sequential Crawler\r\n- Simple, single-threaded approach\r\n- Best for small datasets or when order matters\r\n- Lowest computational overhead\r\n\r\n### 2. Threaded Crawler\r\n- Uses multiple threads for concurrent processing\r\n- Good balance between complexity and performance\r\n- Suitable for I/O-bound tasks\r\n\r\n### 3. Async Crawler\r\n- Non-blocking, event-driven architecture\r\n- Highest performance for large numbers of URLs\r\n- Minimal resource consumption\r\n\r\n## Running Benchmarks\r\n\r\n```bash\r\npython benchmark.py\r\n```\r\n\r\nThis will generate performance metrics and a visualization comparing different crawling strategies.\r\n\r\n## Running Tests\r\n\r\n```bash\r\npytest tests/\r\n```\r\n\r\n## Project structure\r\n```\r\nMultiCrawl/\r\n│\r\n├── src/\r\n│   ├── __init__.py\r\n│   ├── crawler/\r\n│   │   ├── __init__.py\r\n│   │   ├── base_crawler.py\r\n│   │   ├── sequential_crawler.py\r\n│   │   ├── threaded_crawler.py\r\n│   │   └── async_crawler.py\r\n│   │\r\n│   ├── data_processing/\r\n│   │   ├── __init__.py\r\n│   │   ├── parser.py\r\n│   │   ├── aggregator.py\r\n│   │   └── report_generator.py\r\n│   │\r\n│   ├── utils/\r\n│   │   ├── __init__.py\r\n│   │   ├── rate_limiter.py\r\n│   │   ├── error_handler.py\r\n│   │   └── config.py\r\n│   │\r\n│   └── main.py\r\n│\r\n├── tests/\r\n│   ├── __init__.py\r\n│   ├── test_crawlers.py\r\n│   ├── test_parsers.py\r\n│   └── test_aggregators.py\r\n│\r\n├── requirements.txt\r\n├── README.md\r\n└── benchmark.py\r\n```\r\n\r\n## WorkFlow\r\n\r\n![Diagram](Assets/diagram.png)\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n\r\n## Disclaimer\r\n\r\nRespect website terms of service and robots.txt when using this crawler. Always ensure you have permission to crawl a website.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxenoswarlocks%2Fmulticrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxenoswarlocks%2Fmulticrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxenoswarlocks%2Fmulticrawl/lists"}