{"id":19700196,"url":"https://github.com/vijdaancoding/wreck-it-rag","last_synced_at":"2025-02-27T12:39:32.084Z","repository":{"id":253964922,"uuid":"845004722","full_name":"vijdaancoding/wreck-it-rag","owner":"vijdaancoding","description":"A tool to deconstruct unstructured data in PDFs into JSON for RAG ","archived":false,"fork":false,"pushed_at":"2024-08-24T13:00:56.000Z","size":214,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-10T10:46:01.391Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vijdaancoding.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-20T11:54:14.000Z","updated_at":"2024-09-06T18:12:11.000Z","dependencies_parsed_at":"2024-08-22T23:16:35.748Z","dependency_job_id":null,"html_url":"https://github.com/vijdaancoding/wreck-it-rag","commit_stats":null,"previous_names":["vijdaancoding/wreck-it-rag"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vijdaancoding%2Fwreck-it-rag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vijdaancoding%2Fwreck-it-rag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vijdaancoding%2Fwreck-it-rag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vijdaancoding%2Fwreck-it-rag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vijdaancoding","download_url":"https://codeload.github.com/vijdaancoding/wreck-it-rag/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241014178,"owners_count":19894206,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T21:04:40.302Z","updated_at":"2025-02-27T12:39:32.063Z","avatar_url":"https://github.com/vijdaancoding.png","language":"Python","readme":"# **Wreck-it-RAG**\r\n\r\n\u003cimg src=\"Other/file.png\" width=\"150\" height=\"auto\" alt=\"Wreck-it-RAG Logo\"\u003e\r\n\r\n\r\nThe repo is an attempt to create an automated pipleine for extracting infromation from different documents and converting them into JSON\r\n\r\n## **To-Do List**\r\n📝 Add OpenAI API Key Support\u003cbr\u003e\r\n📝 Switch to Django\u003cbr\u003e\r\n📝 ~~Make streamlit editable to choose between OCR or LLM summaries~~\u003cbr\u003e\r\n📝 Concatenate JSON blocks for page-by-page chunking\u003cbr\u003e\r\n📝 ~~Use a package manager for requirements.txt~~\u003cbr\u003e\r\n📝 ~~Convert Tables from HTML to JSON~~\u003cbr\u003e\r\n📝 Integrate SQL database to store JSON\u003cbr\u003e\r\n📝 Look into Apache Spark or Hadoop\u003cbr\u003e\r\n\r\n## **Downloading UNSTRUCTURED.IO Dependancies**\r\n\r\nFollow [UNSTRUCTURED.IO's](https://docs.unstructured.io/open-source/installation/full-installation) own installation guide to download all dependancies\r\n\r\n## Quick Summary of Installation Guide\r\n\r\n## **Windows**\r\n\r\n### 1. libmagic-dev\r\n\r\nUse WSL to enter the following commands\r\n```\r\nsudo apt update\r\nsudo apt install libmagic-dev\r\n```\r\n\r\n### 2. Poppler\r\n\r\nCheck out the [pdf2image docs](https://pdf2image.readthedocs.io/en/latest/installation.html) on how to install Poppler on various devices\r\n\r\n### 3. libreoffice\r\n\r\nCheck out the official page of [libreoffice](https://www.libreoffice.org/download/download-libreoffice/) for download guides.\r\n\r\nOnce the `.msi` or `.exe` file is downloaded follow the on-screen instructions\r\n\r\n### 4. Tesseract\r\n\r\nThe latest installer for Tesseract on windows can be found [here](https://github.com/UB-Mannheim/tesseract/wiki)\r\n\r\nMake sure to add the `C:\\Program Files\\Tesseract-OCR` to your Path.\r\n\r\n## **2. Installing pip Requirements**\r\n\r\nEnter the following code to install all python libraries\r\n```\r\npip install -r requirements.txt\r\n```\r\n\r\n## **3. Create .env File**\r\n\r\nCreate an .env file with the following variable\r\n```\r\nGEMINI_API_KEY = your-gemini-api-key-here\r\n```\r\n\r\n## **4. Run Streamlit App**\r\n\r\nRun the streamlit app using the following command\r\n```\r\nstreamlit run app.py\r\n```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvijdaancoding%2Fwreck-it-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvijdaancoding%2Fwreck-it-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvijdaancoding%2Fwreck-it-rag/lists"}