{"id":26896942,"url":"https://github.com/lazyFrogLOL/llmdocparser","last_synced_at":"2025-04-01T04:02:29.571Z","repository":{"id":250274557,"uuid":"833978334","full_name":"lazyFrogLOL/llmdocparser","owner":"lazyFrogLOL","description":"A package for parsing PDFs and analyzing their content using LLMs.","archived":false,"fork":false,"pushed_at":"2024-08-06T07:11:36.000Z","size":1270,"stargazers_count":252,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-09T15:46:10.729Z","etag":null,"topics":["chunking","document-analysis","llm","nlp","ocr","pdf-parser","pdfparser","rag","text-chunking"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lazyFrogLOL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-26T06:59:44.000Z","updated_at":"2025-01-09T14:12:17.000Z","dependencies_parsed_at":"2024-08-05T03:17:51.243Z","dependency_job_id":null,"html_url":"https://github.com/lazyFrogLOL/llmdocparser","commit_stats":null,"previous_names":["lazyfroglol/llmdocparser"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lazyFrogLOL%2Fllmdocparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lazyFrogLOL%2Fllmdocparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lazyFrogLOL%2Fllmdocparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lazyFrogLOL%2Fllmdocparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lazyFrogLOL","download_url":"https://codeload.github.com/lazyFrogLOL/llmdocparser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246580468,"owners_count":20800111,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","document-analysis","llm","nlp","ocr","pdf-parser","pdfparser","rag","text-chunking"],"created_at":"2025-04-01T04:02:24.496Z","updated_at":"2025-04-01T04:02:29.552Z","avatar_url":"https://github.com/lazyFrogLOL.png","language":"Python","readme":"# LLMDocParser\n\nA package for parsing PDFs and analyzing their content using LLMs.\n\nThis package is an improvement based on the concept of [gptpdf](https://github.com/CosmosShadow/gptpdf/tree/main).\n\n## Method\ngptpdf uses PyMuPDF to parse PDFs, identifying both text and non-text regions. It then merges or filters the text regions based on certain rules, and inputs the final results into a multimodal model for parsing. This method is particularly effective.\n\nBased on this concept, I made some minor improvements.\n\n### Main Process\nUsing a layout analysis model, each page of the PDF is parsed to identify the type of each region, which includes Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, and Equation. The coordinates of each region are also obtained.\n\nLayout Analysis Result Example:\n```\n[{'header': ((101, 66, 436, 102), 0)},\n {'header': ((1038, 81, 1088, 95), 1)},\n {'title': ((106, 215, 947, 284), 2)},\n {'text': ((101, 319, 835, 390), 3)},\n {'text': ((100, 565, 579, 933), 4)},\n {'text': ((100, 967, 573, 1025), 5)},\n {'text': ((121, 1055, 276, 1091), 6)},\n {'reference': ((101, 1124, 562, 1429), 7)},\n {'text': ((610, 565, 1089, 930), 8)},\n {'text': ((613, 976, 1006, 1045), 9)},\n {'title': ((612, 1114, 726, 1129), 10)},\n {'text': ((611, 1165, 1089, 1431), 11)},\n {'title': ((1011, 1471, 1084, 1492), 12)}]\n```\nThis result includes the type, coordinates, and reading order of each region. By using this result, more precise rules can be set to parse the PDF.\n\nFinally, input the images of the corresponding regions into a multimodal model, such as GPT-4o or Qwen-VL, to directly obtain text blocks that are friendly to RAG solutions.\n\n| img_path                                  | type            | page_no | filename                  | content               |      filepath                 |\n|-------------------------------------------|-----------------|---------|---------------------------|-----------------------|-----------------------|\n| {absolute_path}/page_1_title.png                   | Title           | 1       | attention is all you need | [Text Block 1]        |     {file_absolute_path}         |\n| {absolute_path}/page_1_text.png                    | Text            | 1       | attention is all you need | [Text Block 2]        |        {file_absolute_path}        |\n| {absolute_path}/page_2_figure.png                  | Figure          | 2       | attention is all you need | [Text Block 3]        |         {file_absolute_path}       |\n| {absolute_path}/page_2_figure_caption.png          | Figure caption  | 2       | attention is all you need | [Text Block 4]        |         {file_absolute_path}       |\n| {absolute_path}/page_3_table.png                   | Table           | 3       | attention is all you need | [Text Block 5]        |        {file_absolute_path}        |\n| {absolute_path}/page_3_table_caption.png           | Table caption   | 3       | attention is all you need | [Text Block 6]        |        {file_absolute_path}        |\n| {absolute_path}/page_1_header.png                  | Header          | 1       | attention is all you need | [Text Block 7]        |        {file_absolute_path}        |\n| {absolute_path}/page_2_footer.png                  | Footer          | 2       | attention is all you need | [Text Block 8]        |          {file_absolute_path}      | \n| {absolute_path}/page_3_reference.png               | Reference       | 3       | attention is all you need | [Text Block 9]        |         {file_absolute_path}       |\n| {absolute_path}/page_1_equation.png                | Equation        | 1       | attention is all you need | [Text Block 10]       |         {file_absolute_path}       |\n\nSee more in llm_parser.py main function.\n\n## Installation\n\n```commandline\npip install llmdocparser\n```\n\n### Installation from Source\n\nTo install this project from source, follow these steps:\n\n1. **Clone the Repository:**\n\n   First, clone the repository to your local machine. Open your terminal and run the following commands:\n\n   ```bash\n   git clone https://github.com/lazyFrogLOL/llmdocparser.git\n   cd llmdocparser\n   ```\n\n2. **Install Dependencies:**\n\n   This project uses Poetry for dependency management. Make sure you have Poetry installed. If not, you can follow the instructions in the [Poetry Installation Guide](https://python-poetry.org/docs/#installation).\n\n   Once Poetry is installed, run the following command in the project's root directory to install the dependencies:\n\n   ```bash\n   poetry install\n   ```\n\n   This will read the `pyproject.toml` file and install all the required dependencies for the project.\n\n\n## Usage\n\n```python\nfrom llmdocparser.llm_parser import get_image_content\n\ncontent, cost = get_image_content(\n    llm_type=\"azure\",\n    pdf_path=\"path/to/your/pdf\",\n    output_dir=\"path/to/output/directory\",\n    max_concurrency=5,\n    azure_deployment=\"azure-gpt-4o\",\n    azure_endpoint=\"your_azure_endpoint\",\n    api_key=\"your_api_key\",\n    api_version=\"your_api_version\"\n)\nprint(content)\nprint(cost)\n```\n\n**Parameters**\n\n* llm_type: str\n  \n  The options are azure, openai, dashscope.\n* pdf_path: str\n  \n  Path to the PDF file.\n* output_dir: str\n  \n  Output directory to store all parsed images.\n\n* max_concurrency: int\n  \n  Number of GPT parsing worker threads. Batch calling details: [Batch Support](https://python.langchain.com/v0.2/docs/integrations/llms/#features-natively-supported)\n\nIf using Azure, the azure_deployment and azure_endpoint parameters need to be passed; otherwise, only the API key needs to be provided.\n\n* base_url: str\n  \n  OpenAI Compatible Server url. Detail: [OpenAI-Compatible Server](https://python.langchain.com/v0.2/docs/integrations/llms/vllm/#openai-compatible-server)\n\n## Cost\n\nUsing the 'Attention Is All You Need' paper for analysis, the model chosen is GPT-4o, costing as follows:\n```\nTotal Tokens: 44063\nPrompt Tokens: 33812\nCompletion Tokens: 10251\nTotal Cost (USD): $0.322825\n```\nAverage cost per page: $0.0215\n\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=lazyFrogLOL/llmdocparser\u0026type=Date)](https://star-history.com/#lazyFrogLOL/llmdocparser\u0026Date)","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FlazyFrogLOL%2Fllmdocparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FlazyFrogLOL%2Fllmdocparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FlazyFrogLOL%2Fllmdocparser/lists"}