{"id":20944896,"url":"https://github.com/liamca/gpt4ocontentextraction","last_synced_at":"2025-08-09T03:16:50.526Z","repository":{"id":244422677,"uuid":"814803096","full_name":"liamca/GPT4oContentExtraction","owner":"liamca","description":"Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown","archived":false,"fork":false,"pushed_at":"2024-10-15T16:09:16.000Z","size":4845,"stargazers_count":16,"open_issues_count":0,"forks_count":13,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-17T20:56:24.343Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liamca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-13T18:30:22.000Z","updated_at":"2024-10-15T16:09:20.000Z","dependencies_parsed_at":"2024-06-14T16:45:55.542Z","dependency_job_id":"b5737983-24e9-4a17-bfb6-7a1cae6f6207","html_url":"https://github.com/liamca/GPT4oContentExtraction","commit_stats":null,"previous_names":["liamca/gpt4ocontentextraction"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamca%2FGPT4oContentExtraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamca%2FGPT4oContentExtraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamca%2FGPT4oContentExtraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamca%2FGPT4oContentExtraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liamca","download_url":"https://codeload.github.com/liamca/GPT4oContentExtraction/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225269247,"owners_count":17447497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T23:45:51.929Z","updated_at":"2024-11-18T23:45:52.544Z","avatar_url":"https://github.com/liamca.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Azure OpenAI GPT-4o Content Extraction\nUsing Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents (PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, etc) to Markdown.\n\nThere is a lot if information contained within documents such as PDF's, PPT's, and Excel Spreadsheets beyond just text, such as images, tables and charts. The goal of this repo is to show how Azure OpenAI GPT 4o can be used to extract all of this information into a Markdown file to be used for downstream processes such as RAG (Chat on your Data) or Workflows.\n\nHere is an example slide from the included [PPT](https://github.com/liamca/GPT4oContentExtraction/raw/main/MicrosoftSlidesFY24Q3.pptx).\n\n\u003ckbd\u003e\n\u003cimg src= \"https://github.com/liamca/GPT4oContentExtraction/assets/3432973/8b42c1d7-3e3e-457b-b08b-ba8be8d8540e\" alt=\"Original Slide\"\u003e\n\u003c/kbd\u003e\n\nWhen converted to Markdown, notice how the charts are converted to Markdown tables which are easily understandable by Azure OpenAI GPT4.\n\u003ckbd\u003e\n\u003cimg src= \"https://github.com/liamca/GPT4oContentExtraction/assets/3432973/f7f21e21-150d-4194-a3b3-a1f499ce44b3\" alt=\"Output Markdown\"\u003e\n\u003c/kbd\u003e\n\n\n## Requirements\n\n* Azure OpenAI with GPT 4o enabled\n* Linux (Ubuntu) based Jupyter Notebook\n* (Optional) Azure AI Search - To test the ability to answer questions\n* (Optional) LibreOffice - IF you wish to support file types other than PDF\n\n## Processing Pipeline\n\u003ckbd\u003e\n\u003cimg src= \"https://github.com/liamca/GPT4oContentExtraction/assets/3432973/8db4eee3-6a9a-4cdd-9c7b-07ad8effd419\" alt=\"Processing Pipeline\"\u003e\n\u003c/kbd\u003e\n\n\n## Geting Started\n\n1) Ensure you have installed requirements.txt\n```code\npip install -r requirements.txt\n```\n\n2) Install LibreOffice by running [libreoffice.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/install-libreoffice.ipynb)\n   \n3) Configure [config.json](https://github.com/liamca/GPT4oContentExtraction/blob/main/config.json) with your Azure Service settings\n   \n4) Convert the included sample PPT file by running [convert-doc-to-markdown.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/convert-doc-to-markdown.ipynb). This will convert each page to a set of Markdown files.\n\n***(Optional Steps)***\n\n5) Create an Azure AI Search Index to use for RAG based Chat over this content by running [index-to-azure-ai-search.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/index-to-azure-ai-search.ipynb)\n\n6) Perform a test RAG query by running [test-query.ipynb](https://github.com/liamca/GPT4oContentExtraction/blob/main/test-query.ipynb)\n\u003ckbd\u003e\n\u003cimg src= \"https://github.com/liamca/GPT4oContentExtraction/assets/3432973/39cd41d4-9257-4ec0-869b-29df558e2415\" alt=\"Test Query\"\u003e\n\u003c/kbd\u003e\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliamca%2Fgpt4ocontentextraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliamca%2Fgpt4ocontentextraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliamca%2Fgpt4ocontentextraction/lists"}