https://github.com/mazzasaverio/dagster-uv-docker-aws

Dagster pipeline for extracting text from PDFs, generating structured data via OpenAI, and storing in PostgreSQL.
https://github.com/mazzasaverio/dagster-uv-docker-aws

aws dagster docker openai structured-output terraform unstructured uv

Last synced: 6 months ago
JSON representation

Dagster pipeline for extracting text from PDFs, generating structured data via OpenAI, and storing in PostgreSQL.

Host: GitHub
URL: https://github.com/mazzasaverio/dagster-uv-docker-aws
Owner: mazzasaverio
Created: 2025-04-01T08:59:44.000Z (6 months ago)
Default Branch: master
Last Pushed: 2025-04-02T15:40:52.000Z (6 months ago)
Last Synced: 2025-04-02T16:31:02.789Z (6 months ago)
Topics: aws, dagster, docker, openai, structured-output, terraform, unstructured, uv
Language: Python
Homepage:
Size: 21.3 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## **Overview**

Template to implement a pipeline using Dagster to extract text from PDF files, generate structured data using OpenAI's API, and store the results in a PostgreSQL database. The project is designed with scalability, modularity, and best practices in mind, making it suitable for both local development and cloud deployments.

---

## **Features**

- **PDF Text Extraction**: Reads PDF files from local storage or S3 and extracts text using the `unstructured` library.
- **Structured Data Generation**: Processes extracted text with OpenAI to produce structured JSON data.
- **PostgreSQL Storage**: Stores structured data in a PostgreSQL database for querying and analysis.
- **Dagster Integration**: Leverages Dagster's software-defined assets (SDAs) for modular pipeline orchestration.
- **Cloud-Ready**: Supports AWS RDS for PostgreSQL and S3 for storage.
- **Extensible Design**: Easily add new steps or modify existing ones without disrupting the pipeline.

---

## **Pipeline Workflow**

The pipeline consists of three sequential steps:

1. **PDF Text Extraction**:
- Reads PDF files from a configurable storage backend (local filesystem or S3).
- Extracts text using the `unstructured` library.
- Saves the extracted text as JSON files.
2. **Structured Data Generation**:
- Processes the extracted text with OpenAI's API.
- Generates structured data based on a predefined schema.
- Saves the structured data as JSON files.
3. **PostgreSQL Storage**:
- Ingests the structured JSON files into a PostgreSQL database.
- Creates tables dynamically based on the schema if they do not exist.

---

## **Setup Instructions**

### Prerequisites

1. Python 3.12+ installed.
2. PostgreSQL installed locally or an AWS RDS instance configured.
3. AWS CLI configured (if using S3 or RDS).
4. Docker installed (optional for containerized deployments).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mazzasaverio/dagster-uv-docker-aws

Awesome Lists containing this project

README