Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vijdaancoding/wreck-it-rag
A tool to deconstruct unstructured data in PDFs into JSON for RAG
https://github.com/vijdaancoding/wreck-it-rag
Last synced: 3 days ago
JSON representation
A tool to deconstruct unstructured data in PDFs into JSON for RAG
- Host: GitHub
- URL: https://github.com/vijdaancoding/wreck-it-rag
- Owner: vijdaancoding
- License: mit
- Created: 2024-08-20T11:54:14.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2024-08-24T13:00:56.000Z (3 months ago)
- Last Synced: 2024-08-25T14:03:20.935Z (3 months ago)
- Language: Python
- Size: 209 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# **Wreck-it-RAG**
The repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON
## **To-Do List**
📝 Add OpenAI API Key Support
📝 Switch to Django
📝 ~~Make streamlit editable to choose between OCR or LLM summaries~~
📝 Concatenate JSON blocks for page-by-page chunking
📝 ~~Use a package manager for requirements.txt~~
📝 ~~Convert Tables from HTML to JSON~~
📝 Integrate SQL database to store JSON
📝 Look into Apache Spark or Hadoop## **Downloading UNSTRUCTURED.IO Dependancies**
Follow [UNSTRUCTURED.IO's](https://docs.unstructured.io/open-source/installation/full-installation) own installation guide to download all dependancies
## Quick Summary of Installation Guide
## **Windows**
### 1. libmagic-dev
Use WSL to enter the following commands
```
sudo apt update
sudo apt install libmagic-dev
```### 2. Poppler
Check out the [pdf2image docs](https://pdf2image.readthedocs.io/en/latest/installation.html) on how to install Poppler on various devices
### 3. libreoffice
Check out the official page of [libreoffice](https://www.libreoffice.org/download/download-libreoffice/) for download guides.
Once the `.msi` or `.exe` file is downloaded follow the on-screen instructions
### 4. Tesseract
The latest installer for Tesseract on windows can be found [here](https://github.com/UB-Mannheim/tesseract/wiki)
Make sure to add the `C:\Program Files\Tesseract-OCR` to your Path.
## **2. Installing pip Requirements**
Enter the following code to install all python libraries
```
pip install -r requirements.txt
```## **3. Create .env File**
Create an .env file with the following variable
```
GEMINI_API_KEY = your-gemini-api-key-here
```## **4. Run Streamlit App**
Run the streamlit app using the following command
```
streamlit run app.py
```