https://github.com/stabrise/scaledp
ScaleDP is an Open-Source extension of Apache Spark for Document Processing
https://github.com/stabrise/scaledp
doctrocr easyocr huggingface-models machine-learning nlp nlp-machine-learning ocr ocr-python ocr-recognition pdf pdf-document-processor spark suryaocr
Last synced: 8 months ago
JSON representation
ScaleDP is an Open-Source extension of Apache Spark for Document Processing
- Host: GitHub
- URL: https://github.com/stabrise/scaledp
- Owner: StabRise
- License: agpl-3.0
- Created: 2024-10-25T11:01:37.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-12-02T10:30:12.000Z (over 1 year ago)
- Last Synced: 2024-12-02T11:35:14.128Z (over 1 year ago)
- Topics: doctrocr, easyocr, huggingface-models, machine-learning, nlp, nlp-machine-learning, ocr, ocr-python, ocr-recognition, pdf, pdf-document-processor, spark, suryaocr
- Language: Python
- Homepage: https://stabrise.com/scaledp/
- Size: 5.64 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
An Open-Source Library for Processing Documents using AI/ML in Apache Spark.
---
**Source Code**: https://github.com/StabRise/ScaleDP
**Quickstart**: 1.QuickStart.ipynb
**Tutorials**: https://github.com/StabRise/ScaleDP-Tutorials
---
# Welcome to the ScaleDP library
ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.
**LLM** (Large Language Models) and **VLM** (Vision Language Models) models are used to extract data from text and images in combination with **OCR** engines.
Discover pre-trained models for your projects or play with the thousands of models hosted on the [Hugging Face Hub](https://huggingface.co/).
## Key features
### Document processing:
- ✅ Loading PDF documents/Images to the Spark DataFrame (using [Spark PDF Datasource](https://github.com/stabrise/spark-pdf) and as `binaryFile`)
- ✅ Extraction text/images from PDF documents/Images
- ✅ Zero-Shot extraction **structured data** from text/images using LLM and ML models
- ✅ Possibility run as REST API service without Spark Session for have minimum processing latency
- ✅ Support Streaming mode for processing documents in real-time
### LLM:
Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)
- OCR Images/PDF documents using Vision LLM models
- Extract data from the image using Vision LLM models
- Extract data from the text/images using LLM models
- Extract data using DSPy framework
- NER using LLM's
- Visualize results
### NLP:
- Extract data from the text/images using NLP models from the Hugging Face Hub
- NER using classical ML models
### OCR:
Support various open-source OCR engines:
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [Easy OCR](https://github.com/JaidedAI/EasyOCR)
- [Surya OCR](https://github.com/VikParuchuri/surya)
- [DocTR](https://github.com/mindee/doctr)
- Vision LLM models
### CV:
- Object detection on images using YOLO models
- Text detection on images
## Installation
### Prerequisites
- Python 3.10 or higher
- Apache Spark 3.5 or higher
- Java 8
### Installation using pip
Install the `ScaleDP` package with [pip](https://pypi.org/project/scaledp/):
```bash
pip install scaledp
```
### Installation using Docker
Build image:
```bash
docker build -t scaledp .
```
Run container:
```bash
docker run -p 8888:8888 scaledp:latest
```
Open Jupyter Notebook in your browser:
```bash
http://localhost:8888
```
## Qiuckstart
Start a Spark session with ScaleDP:
```python
from scaledp import *
spark = ScaleDPSession()
spark
```
Read example image file:
```python
image_example = files('resources/images/Invoice.png')
df = spark.read.format("binaryFile") \
.load(image_example)
df.show_image("content")
```
Output:

## Zero-Shot data Extraction from the Image:
```python
from pydantic import BaseModel
import json
class Items(BaseModel):
date: str
item: str
note: str
debit: str
class InvoiceSchema(BaseModel):
hospital: str
tax_id: str
address: str
email: str
phone: str
items: list[Items]
total: str
pipeline = PipelineModel(stages=[
DataToImage(
inputCol="content",
outputCol="image"
),
LLMVisualExtractor(
inputCol="image",
outputCol="invoice",
model="gemini-1.5-flash",
apiKey="",
apiBase="https://generativelanguage.googleapis.com/v1beta/",
schema=json.dumps(InvoiceSchema.model_json_schema())
)
])
result = pipeline.transform(df).cache()
```
Show the extracted json:
```python
result.show_json("invoice")
```

Let's show Invoice as Structured Data in Data Frame
```python
result.select("invoice.data.*").show()
```
Output:
```text
+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+
| hospital| tax_id| address| email| phone| items| total|
+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+
|Hope Haven Hospital|26-123123|855 Howard Street...|hopedutton@hopeha...|(123) 456-1238|[{10/21/2022, App...|1024.50|
+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+
```
Schema:
```python
result.printSchema()
```
```text
root
|-- path: string (nullable = true)
|-- modificationTime: timestamp (nullable = true)
|-- length: long (nullable = true)
|-- image: struct (nullable = true)
| |-- path: string (nullable = false)
| |-- resolution: integer (nullable = false)
| |-- data: binary (nullable = false)
| |-- imageType: string (nullable = false)
| |-- exception: string (nullable = false)
| |-- height: integer (nullable = false)
| |-- width: integer (nullable = false)
|-- invoice: struct (nullable = true)
| |-- path: string (nullable = false)
| |-- json_data: string (nullable = true)
| |-- type: string (nullable = false)
| |-- exception: string (nullable = false)
| |-- processing_time: double (nullable = false)
| |-- data: struct (nullable = true)
| | |-- hospital: string (nullable = false)
| | |-- tax_id: string (nullable = false)
| | |-- address: string (nullable = false)
| | |-- email: string (nullable = false)
| | |-- phone: string (nullable = false)
| | |-- items: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- date: string (nullable = false)
| | | | |-- item: string (nullable = false)
| | | | |-- note: string (nullable = false)
| | | | |-- debit: string (nullable = false)
| | |-- total: string (nullable = false)
```
## NER using model from the HuggingFace models Hub
Define pipeline for extract text from the image and run NER:
```python
pipeline = PipelineModel(stages=[
DataToImage(inputCol="content", outputCol="image"),
TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),
Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),
ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3,
padding=5, displayDataList=['entity_group'])
])
result = pipeline.transform(df).cache()
result.show_text("text")
```
Output:

Show NER results:
```python
result.show_ner(limit=20)
```
Output:
```text
+------------+-------------------+----------+-----+---+--------------------+
|entity_group| score| word|start|end| boxes|
+------------+-------------------+----------+-----+---+--------------------+
| HOSP| 0.991257905960083| Hospital| 0| 8|[{Hospital:, 0.94...|
| LOC| 0.999171257019043| Dutton| 10| 16|[{Dutton,, 0.9609...|
| LOC| 0.9992585778236389| MI| 18| 20|[{MI, 0.93335297,...|
| ID| 0.6838774085044861| 26| 29| 31|[{26-123123, 0.90...|
| PHONE| 0.4669836759567261| -| 31| 32|[{26-123123, 0.90...|
| PHONE| 0.7790696024894714| 123123| 32| 38|[{26-123123, 0.90...|
| HOSP|0.37445762753486633| HOPE| 39| 43|[{HOPE, 0.9525460...|
| HOSP| 0.9503226280212402| HAVEN| 44| 49|[{HAVEN, 0.952546...|
| LOC| 0.9975488185882568|855 Howard| 59| 69|[{855, 0.94682700...|
| LOC| 0.9984399676322937| Street| 70| 76|[{Street, 0.95823...|
| HOSP| 0.3670221269130707| HOSPITAL| 77| 85|[{HOSPITAL, 0.959...|
| LOC| 0.9990363121032715| Dutton| 86| 92|[{Dutton,, 0.9647...|
| LOC| 0.999313473701477| MI 49316| 94|102|[{MI, 0.94589012,...|
| PHONE| 0.9830010533332825| ( 123 )| 110|115|[{(123), 0.595334...|
| PHONE| 0.9080978035926819| 456| 116|119|[{456-1238, 0.955...|
| PHONE| 0.9378324151039124| -| 119|120|[{456-1238, 0.955...|
| PHONE| 0.8746233582496643| 1238| 120|124|[{456-1238, 0.955...|
| PATIENT|0.45354968309402466|hopedutton| 132|142|[{hopedutton@hope...|
| EMAIL|0.17805588245391846| hopehaven| 143|152|[{hopedutton@hope...|
| HOSP| 0.505658745765686| INVOICE| 157|164|[{INVOICE, 0.9661...|
+------------+-------------------+----------+-----+---+--------------------+
```
Visualize NER results:
```python
result.visualize_ner(labels_list=["DATE", "LOC"])
```

Original image with NER results:
```python
result.show_image("image_with_boxes")
```

## Ocr engines
| | Bbox level | Support GPU | Separate model for text detection | Processing time 1 page (CPU/GPU) secs | Support Handwritten Text |
|-------------------|-------------|-------------|------------------------------------|---------------------------------------|--------------------------|
| [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) | character | no | no | 0.2/no | not good |
| Tesseract OCR CLI | character | no | no | 0.2/no | not good |
| [Easy OCR](https://github.com/JaidedAI/EasyOCR) | word | yes | yes | | |
| [Surya OCR](https://github.com/VikParuchuri/surya) | line | yes | yes | | |
| [DocTR](https://github.com/mindee/doctr) | word | yes | yes | | |
## Projects based on the ScaleDP
- [PDF Redaction](https://pdf-redaction.com/) - Free AI-powered tool for redact PDF files (remove sensitive information) online.
## Disclaimer
This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.
