https://github.com/liamca/gpt4ocontentextraction
Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown
https://github.com/liamca/gpt4ocontentextraction
Last synced: 10 months ago
JSON representation
Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown
- Host: GitHub
- URL: https://github.com/liamca/gpt4ocontentextraction
- Owner: liamca
- License: mit
- Created: 2024-06-13T18:30:22.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-15T16:09:16.000Z (over 1 year ago)
- Last Synced: 2024-10-17T20:56:24.343Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 4.62 MB
- Stars: 16
- Watchers: 2
- Forks: 13
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Azure OpenAI GPT-4o Content Extraction
Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents (PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, etc) to Markdown.
There is a lot if information contained within documents such as PDF's, PPT's, and Excel Spreadsheets beyond just text, such as images, tables and charts. The goal of this repo is to show how Azure OpenAI GPT 4o can be used to extract all of this information into a Markdown file to be used for downstream processes such as RAG (Chat on your Data) or Workflows.
Here is an example slide from the included [PPT](https://github.com/liamca/GPT4oContentExtraction/raw/main/MicrosoftSlidesFY24Q3.pptx).
When converted to Markdown, notice how the charts are converted to Markdown tables which are easily understandable by Azure OpenAI GPT4.
## Requirements
* Azure OpenAI with GPT 4o enabled
* Linux (Ubuntu) based Jupyter Notebook
* (Optional) Azure AI Search - To test the ability to answer questions
* (Optional) LibreOffice - IF you wish to support file types other than PDF
## Processing Pipeline
## Geting Started
1) Ensure you have installed requirements.txt
```code
pip install -r requirements.txt
```
2) Install LibreOffice by running [libreoffice.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/install-libreoffice.ipynb)
3) Configure [config.json](https://github.com/liamca/GPT4oContentExtraction/blob/main/config.json) with your Azure Service settings
4) Convert the included sample PPT file by running [convert-doc-to-markdown.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/convert-doc-to-markdown.ipynb). This will convert each page to a set of Markdown files.
***(Optional Steps)***
5) Create an Azure AI Search Index to use for RAG based Chat over this content by running [index-to-azure-ai-search.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/index-to-azure-ai-search.ipynb)
6) Perform a test RAG query by running [test-query.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/test-query.ipynb)