https://github.com/raj457036/simple-web-crawler
A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.
https://github.com/raj457036/simple-web-crawler
asynchronous parallel-computing python scale web-crawler
Last synced: 10 months ago
JSON representation
A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.
- Host: GitHub
- URL: https://github.com/raj457036/simple-web-crawler
- Owner: raj457036
- License: mit
- Created: 2023-09-19T16:49:51.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-19T18:13:55.000Z (over 2 years ago)
- Last Synced: 2025-01-21T06:24:34.362Z (12 months ago)
- Topics: asynchronous, parallel-computing, python, scale, web-crawler
- Language: Python
- Homepage: https://github.com/raj457036/Simple-Web-Crawler
- Size: 30.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Simple Web Crawler
A performant but simple web crawler thats easy to use and extend, supports async and sync requests with in memory and disk caching for high performance.
## Installation
```bash
pip install --upgrade git+https://github.com/raj457036/Simple-Web-Crawler.git@main
```
## Usage
```python
from pydantic import HttpUrl
from simple_crawler import SimpleSyncCrawler, CrawlerConfig, InMemoryPageStorage
# Create a crawler with a config and storage
config = CrawlerConfig(
entrypoint=HttpUrl(
"https://example.com"
),
content_type="md",
from_root=True,
)
# In memory storage
# you can also try `DiskPageStorage` for persistent storage and heavy volume.
storage = InMemoryPageStorage()
# Run the crawler
crawler = SimpleSyncCrawler(config=config, storage=storage)
crawler.run()
# Print the results or do something else with them
print(*crawler.page_storage.keys, sep="\n")
```
- Async Example: [test/simple_async_crawler.py](test/simple_async_crawler.py)
- Sync Example: [test/simple_sync_crawler.py](test/simple_sync_crawler.py)
## Features
- [x] Sync and Async requests
- [x] In memory and disk storage
- [x] Configurable
- [x] Pydantic and type annotated
- [x] Extensible
- [x] Customizable
- [x] High performance
- [x] Easy to use