Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mazzasaverio/doccrawl
Simple document crawler that harvests PDFs and documents from configured web sources.
https://github.com/mazzasaverio/doccrawl
asyncpg data-engineering docker logfire playwright postgresql pydantic-v2 python3 scrapegraphai
Last synced: about 2 months ago
JSON representation
Simple document crawler that harvests PDFs and documents from configured web sources.
- Host: GitHub
- URL: https://github.com/mazzasaverio/doccrawl
- Owner: mazzasaverio
- License: mit
- Created: 2024-10-27T11:23:24.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2024-11-12T10:00:49.000Z (3 months ago)
- Last Synced: 2024-11-12T11:17:57.990Z (3 months ago)
- Topics: asyncpg, data-engineering, docker, logfire, playwright, postgresql, pydantic-v2, python3, scrapegraphai
- Language: Python
- Homepage:
- Size: 502 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrapy Frontier Crawler
A configurable web crawler built with Scrapy and Playwright for handling both static and dynamic content. The crawler can process different types of URLs and store results in a PostgreSQL database.
## Features
- 🔍 Three types of URL processing:
- Type 0: Direct target URL processing
- Type 1: Static page scanning for target URLs
- Type 2: Dynamic page scanning with depth navigation
- 🎭 Playwright integration for JavaScript-rendered content
- 📊 PostgreSQL storage for crawled URLs and stats
- 🔧 YAML-based configuration
- 📝 Structured logging with Logfire
- 🐳 Docker support
- ☁️ Azure deployment ready with Terraform## Prerequisites
- Python 3.11+
- PostgreSQL database
- [uv](https://github.com/astral-sh/uv) for package management
- Docker (optional)## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request