https://github.com/raj457036/simple-web-crawler

A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.
https://github.com/raj457036/simple-web-crawler

asynchronous parallel-computing python scale web-crawler

Last synced: 10 months ago
JSON representation

A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.

Host: GitHub
URL: https://github.com/raj457036/simple-web-crawler
Owner: raj457036
License: mit
Created: 2023-09-19T16:49:51.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-09-19T18:13:55.000Z (over 2 years ago)
Last Synced: 2025-01-21T06:24:34.362Z (12 months ago)
Topics: asynchronous, parallel-computing, python, scale, web-crawler
Language: Python
Homepage: https://github.com/raj457036/Simple-Web-Crawler
Size: 30.3 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # Simple Web Crawler

A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.

## Installation

```bash

pip install --upgrade git+https://github.com/raj457036/Simple-Web-Crawler.git@main

```

## Usage

```python

from pydantic import HttpUrl

from simple_crawler import SimpleSyncCrawler, CrawlerConfig, InMemoryPageStorage

# Create a crawler with a config and storage

config = CrawlerConfig(

    entrypoint=HttpUrl(

        "https://example.com"

    ),

    content_type="md",

    from_root=True,

)

# In memory storage

# you can also try `DiskPageStorage` for persistent storage and heavy volume.

storage = InMemoryPageStorage()

# Run the crawler

crawler = SimpleSyncCrawler(config=config, storage=storage)

crawler.run()

# Print the results or do something else with them

print(*crawler.page_storage.keys, sep="\n")

```

- Async Example: [test/simple_async_crawler.py](test/simple_async_crawler.py)

- Sync Example: [test/simple_sync_crawler.py](test/simple_sync_crawler.py)

## Features

- [x] Sync and Async requests

- [x] In memory and disk storage

- [x] Configurable

- [x] Pydantic and type annotated

- [x] Extensible

- [x] Customizable

- [x] High performance

- [x] Easy to use

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raj457036/simple-web-crawler

Awesome Lists containing this project

README