https://github.com/mazzasaverio/dagster-uv-docker-aws
Dagster pipeline for extracting text from PDFs, generating structured data via OpenAI, and storing in PostgreSQL.
https://github.com/mazzasaverio/dagster-uv-docker-aws
aws dagster docker openai structured-output terraform unstructured uv
Last synced: 6 months ago
JSON representation
Dagster pipeline for extracting text from PDFs, generating structured data via OpenAI, and storing in PostgreSQL.
- Host: GitHub
- URL: https://github.com/mazzasaverio/dagster-uv-docker-aws
- Owner: mazzasaverio
- Created: 2025-04-01T08:59:44.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2025-04-02T15:40:52.000Z (6 months ago)
- Last Synced: 2025-04-02T16:31:02.789Z (6 months ago)
- Topics: aws, dagster, docker, openai, structured-output, terraform, unstructured, uv
- Language: Python
- Homepage:
- Size: 21.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## **Overview**
Template to implement a pipeline using Dagster to extract text from PDF files, generate structured data using OpenAI's API, and store the results in a PostgreSQL database. The project is designed with scalability, modularity, and best practices in mind, making it suitable for both local development and cloud deployments.
---
## **Features**
- **PDF Text Extraction**: Reads PDF files from local storage or S3 and extracts text using the `unstructured` library.
- **Structured Data Generation**: Processes extracted text with OpenAI to produce structured JSON data.
- **PostgreSQL Storage**: Stores structured data in a PostgreSQL database for querying and analysis.
- **Dagster Integration**: Leverages Dagster's software-defined assets (SDAs) for modular pipeline orchestration.
- **Cloud-Ready**: Supports AWS RDS for PostgreSQL and S3 for storage.
- **Extensible Design**: Easily add new steps or modify existing ones without disrupting the pipeline.---
## **Pipeline Workflow**
The pipeline consists of three sequential steps:
1. **PDF Text Extraction**:
- Reads PDF files from a configurable storage backend (local filesystem or S3).
- Extracts text using the `unstructured` library.
- Saves the extracted text as JSON files.
2. **Structured Data Generation**:
- Processes the extracted text with OpenAI's API.
- Generates structured data based on a predefined schema.
- Saves the structured data as JSON files.
3. **PostgreSQL Storage**:
- Ingests the structured JSON files into a PostgreSQL database.
- Creates tables dynamically based on the schema if they do not exist.---
## **Setup Instructions**
### Prerequisites
1. Python 3.12+ installed.
2. PostgreSQL installed locally or an AWS RDS instance configured.
3. AWS CLI configured (if using S3 or RDS).
4. Docker installed (optional for containerized deployments).