{"id":19056953,"url":"https://github.com/decisionfacts/df-extract","last_synced_at":"2025-04-24T05:20:33.206Z","repository":{"id":199448920,"uuid":"670121068","full_name":"decisionfacts/df-extract","owner":"decisionfacts","description":"DF Extract Lib","archived":false,"fork":false,"pushed_at":"2024-04-03T16:19:51.000Z","size":30,"stargazers_count":14,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-18T13:09:54.207Z","etag":null,"topics":["asyncio","document-parser","docx","extraction","jpeg","jpg","pdf","png","pptx","python3"],"latest_commit_sha":null,"homepage":"https://github.com/decisionfacts/df-extract","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/decisionfacts.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-24T10:42:26.000Z","updated_at":"2024-08-12T16:30:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"806e3c28-cc1a-4c7f-9b09-ee8e6e63084b","html_url":"https://github.com/decisionfacts/df-extract","commit_stats":null,"previous_names":["decisionfacts/df-extract"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/decisionfacts%2Fdf-extract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/decisionfacts%2Fdf-extract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/decisionfacts%2Fdf-extract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/decisionfacts%2Fdf-extract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/decisionfacts","download_url":"https://codeload.github.com/decisionfacts/df-extract/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250567510,"owners_count":21451448,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","document-parser","docx","extraction","jpeg","jpg","pdf","png","pptx","python3"],"created_at":"2024-11-08T23:52:51.960Z","updated_at":"2025-04-24T05:20:33.181Z","avatar_url":"https://github.com/decisionfacts.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# DF Extract Lib\n\n[![PyPI version](https://badge.fury.io/py/df-extract.svg)](https://badge.fury.io/py/df-extract) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n## Requirements\n\nPython 3.10+ asyncio\n\n## Installation\n\n```shell\n# Using pip\n$ python -m pip install df-extract\n\n# Manual install\n$ python -m pip install .\n```\n\n### 1. To extract content from `PDF`\n\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path)\n\n# By default, output as text\nawait extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`\n\n# Output as json\nawait extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`\n```\n\n\u003e You can change the output directory with simply pass `output_dir` param\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path, output_dir=\"/home/test/output\")\nawait extract_pdf.extract()\n```\n\n#### Extract content from `PDF` with image data\n\u003e This requires [`easyocr`](https://github.com/jaidedai/easyocr)\n\n```python\nfrom df_extract.base import ImageExtract\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nimage_extract = ImageExtract(model_download_enabled=True)\nextract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)\nawait extract_pdf.extract()\n```\n\n### 2. To extract content from `PPT` and `PPTx`\n\n```python\nfrom df_extract.pptx import ExtractPPTx\n\n\npath = \"/home/test/DEF.pptx\"\n\nextract_pptx = ExtractPPTx(file_path=path)\n\n# By default, output as text\nawait extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`\n\n# Output as json\nawait extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`\n```\n\n### 3. To extract content from `Doc` and `Docx`\n\n```python\nfrom df_extract.docx import ExtractDocx\n\n\npath = \"/home/test/GHI.docx\"\n\nextract_docx = ExtractDocx(file_path=path)\n\n# By default, output as text\nawait extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`\n\n# Output as json\nawait extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`\n```\n\n### 4. To extract content from `PNG`, `JPEG` and `JPG`\n\n```python\nfrom df_extract.image import ExtractImage\n\n\npath = \"/home/test/JKL.png\"\n\nextract_png = ExtractImage(file_path=path)\n\n# By default, output as text\nawait extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`\n\n# Output as json\nawait extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdecisionfacts%2Fdf-extract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdecisionfacts%2Fdf-extract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdecisionfacts%2Fdf-extract/lists"}