https://github.com/decisionfacts/df-extract

DF Extract Lib
https://github.com/decisionfacts/df-extract

asyncio document-parser docx extraction jpeg jpg pdf png pptx python3

Last synced: about 1 year ago
JSON representation

DF Extract Lib

Host: GitHub
URL: https://github.com/decisionfacts/df-extract
Owner: decisionfacts
License: apache-2.0
Created: 2023-07-24T10:42:26.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2024-04-03T16:19:51.000Z (about 2 years ago)
Last Synced: 2025-04-18T13:09:54.207Z (about 1 year ago)
Topics: asyncio, document-parser, docx, extraction, jpeg, jpg, pdf, png, pptx, python3
Language: Python
Homepage: https://github.com/decisionfacts/df-extract
Size: 29.3 KB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          
# DF Extract Lib

[![PyPI version](https://badge.fury.io/py/df-extract.svg)](https://badge.fury.io/py/df-extract) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

## Requirements

Python 3.10+ asyncio

## Installation

```shell

# Using pip

$ python -m pip install df-extract

# Manual install

$ python -m pip install .

```

### 1. To extract content from `PDF`

```python

from df_extract.pdf import ExtractPDF

path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path)

# By default, output as text

await extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`

# Output as json

await extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`

```

> You can change the output directory with simply pass `output_dir` param

```python

from df_extract.pdf import ExtractPDF

path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")

await extract_pdf.extract()

```

#### Extract content from `PDF` with image data

> This requires [`easyocr`](https://github.com/jaidedai/easyocr)

```python

from df_extract.base import ImageExtract

from df_extract.pdf import ExtractPDF

path = "/home/test/ABC.pdf"

image_extract = ImageExtract(model_download_enabled=True)

extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)

await extract_pdf.extract()

```

### 2. To extract content from `PPT` and `PPTx`

```python

from df_extract.pptx import ExtractPPTx

path = "/home/test/DEF.pptx"

extract_pptx = ExtractPPTx(file_path=path)

# By default, output as text

await extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`

# Output as json

await extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`

```

### 3. To extract content from `Doc` and `Docx`

```python

from df_extract.docx import ExtractDocx

path = "/home/test/GHI.docx"

extract_docx = ExtractDocx(file_path=path)

# By default, output as text

await extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`

# Output as json

await extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`

```

### 4. To extract content from `PNG`, `JPEG` and `JPG`

```python

from df_extract.image import ExtractImage

path = "/home/test/JKL.png"

extract_png = ExtractImage(file_path=path)

# By default, output as text

await extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`

# Output as json

await extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/decisionfacts/df-extract

Awesome Lists containing this project

README