https://github.com/decisionfacts/df-extract
DF Extract Lib
https://github.com/decisionfacts/df-extract
asyncio document-parser docx extraction jpeg jpg pdf png pptx python3
Last synced: 11 months ago
JSON representation
DF Extract Lib
- Host: GitHub
- URL: https://github.com/decisionfacts/df-extract
- Owner: decisionfacts
- License: apache-2.0
- Created: 2023-07-24T10:42:26.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-04-03T16:19:51.000Z (almost 2 years ago)
- Last Synced: 2025-04-18T13:09:54.207Z (11 months ago)
- Topics: asyncio, document-parser, docx, extraction, jpeg, jpg, pdf, png, pptx, python3
- Language: Python
- Homepage: https://github.com/decisionfacts/df-extract
- Size: 29.3 KB
- Stars: 14
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# DF Extract Lib
[](https://badge.fury.io/py/df-extract) [](https://opensource.org/licenses/Apache-2.0)
## Requirements
Python 3.10+ asyncio
## Installation
```shell
# Using pip
$ python -m pip install df-extract
# Manual install
$ python -m pip install .
```
### 1. To extract content from `PDF`
```python
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path)
# By default, output as text
await extract_pdf.extract() # Output will be located `/home/test/ABC.pdf.txt`
# Output as json
await extract_pdf.extract(as_json=True) # Output will be located `/home/test/ABC.pdf.json`
```
> You can change the output directory with simply pass `output_dir` param
```python
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
await extract_pdf.extract()
```
#### Extract content from `PDF` with image data
> This requires [`easyocr`](https://github.com/jaidedai/easyocr)
```python
from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
await extract_pdf.extract()
```
### 2. To extract content from `PPT` and `PPTx`
```python
from df_extract.pptx import ExtractPPTx
path = "/home/test/DEF.pptx"
extract_pptx = ExtractPPTx(file_path=path)
# By default, output as text
await extract_pptx.extract() # Output will be located `/home/test/DEF.pptx.txt`
# Output as json
await extract_pptx.extract(as_json=True) # Output will be located `/home/test/DEF.pptx.json`
```
### 3. To extract content from `Doc` and `Docx`
```python
from df_extract.docx import ExtractDocx
path = "/home/test/GHI.docx"
extract_docx = ExtractDocx(file_path=path)
# By default, output as text
await extract_docx.extract() # Output will be located `/home/test/GHI.docx.txt`
# Output as json
await extract_docx.extract(as_json=True) # Output will be located `/home/test/GHI.docx.json`
```
### 4. To extract content from `PNG`, `JPEG` and `JPG`
```python
from df_extract.image import ExtractImage
path = "/home/test/JKL.png"
extract_png = ExtractImage(file_path=path)
# By default, output as text
await extract_png.extract() # Output will be located `/home/test/JKL.png.txt`
# Output as json
await extract_png.extract(as_json=True) # Output will be located `/home/test/JKL.png.json`
```