Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mazzasaverio/doccrawl

Simple document crawler that harvests PDFs and documents from configured web sources.
https://github.com/mazzasaverio/doccrawl

asyncpg data-engineering docker logfire playwright postgresql pydantic-v2 python3 scrapegraphai

Last synced: about 2 months ago
JSON representation

Simple document crawler that harvests PDFs and documents from configured web sources.

Host: GitHub
URL: https://github.com/mazzasaverio/doccrawl
Owner: mazzasaverio
License: mit
Created: 2024-10-27T11:23:24.000Z (3 months ago)
Default Branch: master
Last Pushed: 2024-11-12T10:00:49.000Z (3 months ago)
Last Synced: 2024-11-12T11:17:57.990Z (3 months ago)
Topics: asyncpg, data-engineering, docker, logfire, playwright, postgresql, pydantic-v2, python3, scrapegraphai
Language: Python
Homepage:
Size: 502 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Scrapy Frontier Crawler

A configurable web crawler built with Scrapy and Playwright for handling both static and dynamic content. The crawler can process different types of URLs and store results in a PostgreSQL database.

## Features

- 🔍 Three types of URL processing:

  - Type 0: Direct target URL processing

  - Type 1: Static page scanning for target URLs

  - Type 2: Dynamic page scanning with depth navigation

- 🎭 Playwright integration for JavaScript-rendered content

- 📊 PostgreSQL storage for crawled URLs and stats 

- 🔧 YAML-based configuration

- 📝 Structured logging with Logfire

- 🐳 Docker support

- ☁️ Azure deployment ready with Terraform

## Prerequisites

- Python 3.11+

- PostgreSQL database

- [uv](https://github.com/astral-sh/uv) for package management

- Docker (optional)

## Contributing

1. Fork the repository

2. Create a feature branch

3. Commit your changes

4. Push to the branch

5. Create a Pull Request