https://github.com/mohammadrezaamani/squirrel

Squirrel is a web crawler designed to collect all pages from Iranian websites, enabling you to download and store web page content in a structured format.
https://github.com/mohammadrezaamani/squirrel

crawler iran python

Last synced: 4 months ago
JSON representation

Squirrel is a web crawler designed to collect all pages from Iranian websites, enabling you to download and store web page content in a structured format.

Host: GitHub
URL: https://github.com/mohammadrezaamani/squirrel
Owner: MohammadrezaAmani
License: mit
Created: 2024-11-03T09:45:59.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-03T10:14:30.000Z (7 months ago)
Last Synced: 2024-12-21T17:12:39.256Z (5 months ago)
Topics: crawler, iran, python
Language: Python
Homepage:
Size: 534 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # 🐿️ Squirrel: Iran Web Crawler

Squirrel is a web crawler that allows you to collect all pages from Iranian websites. Using this crawler, you can download the content of web pages and store it in a specified structure.

## Features

- Support for concurrent requests to optimize crawling speed

- Ability to set batch size for processing URLs

- Test mode to verify URLs without downloading content

- Content storage in a specified path

## Requirements

- Python 3.11 or higher

- Libraries: `aiohttp` and `aiofiles`

## Installation

To install this squirrel, simply use pip:

```bash

pip install aiohttp aiofiles

```

## Usage

To run the crawler, use the following command:

```bash

python -m squirrel [OPTIONS]

```

### Command-line Parameters

- `--max-concurrent`: Maximum number of concurrent requests (default: 50)

- `--batch-size`: Number of URLs processed in each batch (default: 10)

- `--skip-test`: Skip Running the crawler in test mode (default: disabled)

- `--base-path`: Storage path for downloaded content (default: `./data`)

### Example

To run the crawler with 20 concurrent requests and save data in the `my_data` directory, use the following command:

```bash

python -m squirrel --max-concurrent 20 --batch-size 5 --skip-test --base-path "./my_data"

```

or

```python

from asyncio import run

from squirrel import main

if __name__ == "__main__":

    run(

        main(

            max_concurrent=50,

            batch_size=5,

            skip_test=False,

            base_path="./data",

        )

    )

```

## Contributing

If you would like to contribute to the development of this project, please create an issue or submit a pull request. All feedback and suggestions are welcome!

## License

This project is licensed under the MIT License. For more information, please refer to the LICENSE file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mohammadrezaamani/squirrel

Awesome Lists containing this project

README