Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mohammadrezaamani/squirrel
Squirrel is a web crawler designed to collect all pages from Iranian websites, enabling you to download and store web page content in a structured format.
https://github.com/mohammadrezaamani/squirrel
crawler iran python
Last synced: 16 days ago
JSON representation
Squirrel is a web crawler designed to collect all pages from Iranian websites, enabling you to download and store web page content in a structured format.
- Host: GitHub
- URL: https://github.com/mohammadrezaamani/squirrel
- Owner: MohammadrezaAmani
- License: mit
- Created: 2024-11-03T09:45:59.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-03T10:14:30.000Z (2 months ago)
- Last Synced: 2024-12-21T17:12:39.256Z (16 days ago)
- Topics: crawler, iran, python
- Language: Python
- Homepage:
- Size: 534 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🐿️ Squirrel: Iran Web Crawler
Squirrel is a web crawler that allows you to collect all pages from Iranian websites. Using this crawler, you can download the content of web pages and store it in a specified structure.
## Features
- Support for concurrent requests to optimize crawling speed
- Ability to set batch size for processing URLs
- Test mode to verify URLs without downloading content
- Content storage in a specified path## Requirements
- Python 3.11 or higher
- Libraries: `aiohttp` and `aiofiles`## Installation
To install this squirrel, simply use pip:
```bash
pip install aiohttp aiofiles
```## Usage
To run the crawler, use the following command:
```bash
python -m squirrel [OPTIONS]
```### Command-line Parameters
- `--max-concurrent`: Maximum number of concurrent requests (default: 50)
- `--batch-size`: Number of URLs processed in each batch (default: 10)
- `--skip-test`: Skip Running the crawler in test mode (default: disabled)
- `--base-path`: Storage path for downloaded content (default: `./data`)### Example
To run the crawler with 20 concurrent requests and save data in the `my_data` directory, use the following command:
```bash
python -m squirrel --max-concurrent 20 --batch-size 5 --skip-test --base-path "./my_data"
```or
```python
from asyncio import runfrom squirrel import main
if __name__ == "__main__":
run(
main(
max_concurrent=50,
batch_size=5,
skip_test=False,
base_path="./data",
)
)
```## Contributing
If you would like to contribute to the development of this project, please create an issue or submit a pull request. All feedback and suggestions are welcome!
## License
This project is licensed under the MIT License. For more information, please refer to the LICENSE file.