https://github.com/mazzasaverio/doccrawl
Simple document crawler that harvests PDFs and documents from configured web sources.
https://github.com/mazzasaverio/doccrawl
asyncpg data-engineering docker logfire playwright postgresql pydantic-v2 python3 scrapegraphai
Last synced: 16 days ago
JSON representation
Simple document crawler that harvests PDFs and documents from configured web sources.
- Host: GitHub
- URL: https://github.com/mazzasaverio/doccrawl
- Owner: mazzasaverio
- License: mit
- Created: 2024-10-27T11:23:24.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-11-12T10:00:49.000Z (over 1 year ago)
- Last Synced: 2025-08-14T13:42:11.323Z (9 months ago)
- Topics: asyncpg, data-engineering, docker, logfire, playwright, postgresql, pydantic-v2, python3, scrapegraphai
- Language: Python
- Homepage:
- Size: 502 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrapy Frontier Crawler
A configurable web crawler built with Scrapy and Playwright for handling both static and dynamic content. The crawler can process different types of URLs and store results in a PostgreSQL database.
## Features
- 🔍 Three types of URL processing:
- Type 0: Direct target URL processing
- Type 1: Static page scanning for target URLs
- Type 2: Dynamic page scanning with depth navigation
- 🎭 Playwright integration for JavaScript-rendered content
- 📊 PostgreSQL storage for crawled URLs and stats
- 🔧 YAML-based configuration
- 📝 Structured logging with Logfire
- 🐳 Docker support
- ☁️ Azure deployment ready with Terraform
## Prerequisites
- Python 3.11+
- PostgreSQL database
- [uv](https://github.com/astral-sh/uv) for package management
- Docker (optional)
## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request