https://github.com/datek/web-crawler
Performant, lean, highly customizable python web crawler
https://github.com/datek/web-crawler
Last synced: 8 months ago
JSON representation
Performant, lean, highly customizable python web crawler
- Host: GitHub
- URL: https://github.com/datek/web-crawler
- Owner: DAtek
- License: mit
- Created: 2025-05-06T15:10:41.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-05-08T15:33:07.000Z (about 1 year ago)
- Last Synced: 2025-05-21T21:13:42.306Z (about 1 year ago)
- Language: Python
- Size: 40 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://codecov.io/gh/DAtek/web-crawler)
# Web Crawler
Performant, extensible and lean web crawler, utilizes all available CPUs by default.
Uses event loop for I/O and processes for analyzing the pages.
## Batteries included
- Basic `httpx` page downloader
- `S3` page storage
- Local filesystem page storage
## Usage
- Have a look at `tests/integration/test_crawl.py`
- Implement your own `PageAnalyzer` and `PageDownloader` classes
- Optionally customize `structlog` logging, see [configuration](https://www.structlog.org/en/stable/configuration.html)
- Have fun!
## Customization
All classes in the modules folder can be replaced with your custom implementation.