https://github.com/reezuleanu/pdf_deconstructor
Decompose a PDF file based on its headers for RAG ingestion.
https://github.com/reezuleanu/pdf_deconstructor
pdf-document-processor rag
Last synced: 2 months ago
JSON representation
Decompose a PDF file based on its headers for RAG ingestion.
- Host: GitHub
- URL: https://github.com/reezuleanu/pdf_deconstructor
- Owner: reezuleanu
- License: mit
- Created: 2025-05-02T17:55:29.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-05-02T18:49:23.000Z (11 months ago)
- Last Synced: 2025-05-02T19:37:43.388Z (11 months ago)
- Topics: pdf-document-processor, rag
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Deconstructor
Decompose a PDF file based on its headers for RAG ingestion.
## What it does
It takes a PDF file and converts it into a hierarchy of sections and subsections based on the headers inside the file, as well as extract all the links.
## Main use
The intended use is to be part of a larger ingestion pipeline of unstructured data for RAG purposes.
## How to use
```py
from pdf_deconstructor import Deconstructor as PDFDeconstructor
output = PDFDeconstructor.parse("file.pdf", start_page=1)
# to see extracted tree
for header in output.content:
print(header.tree())
```
Each header contains the raw text, markdown syntax text, and extracted links, as well as sub headers