https://github.com/reezuleanu/pdf_deconstructor

Decompose a PDF file based on its headers for RAG ingestion.
https://github.com/reezuleanu/pdf_deconstructor

pdf-document-processor rag

Last synced: 6 months ago
JSON representation

Decompose a PDF file based on its headers for RAG ingestion.

Host: GitHub
URL: https://github.com/reezuleanu/pdf_deconstructor
Owner: reezuleanu
License: mit
Created: 2025-05-02T17:55:29.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-02T18:49:23.000Z (about 1 year ago)
Last Synced: 2025-05-02T19:37:43.388Z (about 1 year ago)
Topics: pdf-document-processor, rag
Language: Python
Homepage:
Size: 13.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PDF Deconstructor

Decompose a PDF file based on its headers for RAG ingestion.

## What it does

It takes a PDF file and converts it into a hierarchy of sections and subsections based on the headers inside the file, as well as extract all the links.

## Main use

The intended use is to be part of a larger ingestion pipeline of unstructured data for RAG purposes.

## How to use

```py

from pdf_deconstructor import Deconstructor as PDFDeconstructor

output = PDFDeconstructor.parse("file.pdf", start_page=1)

# to see extracted tree

for header in output.content:

    print(header.tree())

```

Each header contains the raw text, markdown syntax text, and extracted links, as well as sub headers

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/reezuleanu/pdf_deconstructor

Awesome Lists containing this project

README