https://github.com/stabrise/scaledp

ScaleDP is an Open-Source extension of Apache Spark for Document Processing
https://github.com/stabrise/scaledp

doctrocr easyocr huggingface-models machine-learning nlp nlp-machine-learning ocr ocr-python ocr-recognition pdf pdf-document-processor spark suryaocr

Last synced: 10 months ago
JSON representation

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

Host: GitHub
URL: https://github.com/stabrise/scaledp
Owner: StabRise
License: agpl-3.0
Created: 2024-10-25T11:01:37.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-12-02T10:30:12.000Z (over 1 year ago)
Last Synced: 2024-12-02T11:35:14.128Z (over 1 year ago)
Topics: doctrocr, easyocr, huggingface-models, machine-learning, nlp, nlp-machine-learning, ocr, ocr-python, ocr-recognition, pdf, pdf-document-processor, spark, suryaocr
Language: Python
Homepage: https://stabrise.com/scaledp/
Size: 5.64 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  


    

  






    An Open-Source Library for Processing Documents using AI/ML in Apache Spark.





    

    

    

    

    



---

**Source Code**: https://github.com/StabRise/ScaleDP

**Quickstart**: 1.QuickStart.ipynb

**Tutorials**: https://github.com/StabRise/ScaleDP-Tutorials

---

# Welcome to the ScaleDP library

ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

**LLM** (Large Language Models) and **VLM** (Vision Language Models) models are used to extract data from text and images in combination with **OCR** engines.

Discover pre-trained models for your projects or play with the thousands of models hosted on the [Hugging Face Hub](https://huggingface.co/).

## Key features

### Document processing:

- ✅ Loading PDF documents/Images to the Spark DataFrame (using [Spark PDF Datasource](https://github.com/stabrise/spark-pdf) and as `binaryFile`)

- ✅ Extraction text/images from PDF documents/Images

- ✅ Zero-Shot extraction **structured data** from text/images using LLM and ML models

- ✅ Possibility run as REST API service without Spark Session for have minimum processing latency

- ✅ Support Streaming mode for processing documents in real-time

### LLM:

Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)

- OCR Images/PDF documents using Vision LLM models

- Extract data from the image using Vision LLM models

- Extract data from the text/images using LLM models

- Extract data using DSPy framework

- NER using LLM's

- Visualize results

### NLP:

- Extract data from the text/images using NLP models from the Hugging Face Hub

- NER using classical ML models

### OCR:

Support various open-source OCR engines:

 - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) 

 - [Easy OCR](https://github.com/JaidedAI/EasyOCR)   

 - [Surya OCR](https://github.com/VikParuchuri/surya) 

 - [DocTR](https://github.com/mindee/doctr)

 - Vision LLM models

### CV:

- Object detection on images using YOLO models

- Text detection on images

## Installation

### Prerequisites

- Python 3.10 or higher

- Apache Spark 3.5 or higher

- Java 8

### Installation using pip

Install the `ScaleDP` package with [pip](https://pypi.org/project/scaledp/):

```bash

pip install scaledp

```

### Installation using Docker

Build image:

```bash

  docker build -t scaledp .

```

Run container:

```bash

  docker run -p 8888:8888 scaledp:latest

```

Open Jupyter Notebook in your browser:

```bash

  http://localhost:8888

```

## Qiuckstart

Start a Spark session with ScaleDP:

```python

from scaledp import *

spark = ScaleDPSession()

spark

```

Read example image file:

```python

image_example = files('resources/images/Invoice.png')

df = spark.read.format("binaryFile") \

    .load(image_example)

df.show_image("content")

```

Output:



## Zero-Shot data Extraction from the Image:

```python

from pydantic import BaseModel

import json

class Items(BaseModel):

    date: str

    item: str

    note: str

    debit: str

class InvoiceSchema(BaseModel):

    hospital: str

    tax_id: str

    address: str

    email: str

    phone: str

    items: list[Items]

    total: str

    

pipeline = PipelineModel(stages=[

    DataToImage(

        inputCol="content",

        outputCol="image"

    ),

    LLMVisualExtractor(

        inputCol="image",

        outputCol="invoice",

        model="gemini-1.5-flash",

        apiKey="",

        apiBase="https://generativelanguage.googleapis.com/v1beta/",

        schema=json.dumps(InvoiceSchema.model_json_schema())

    )

])

result = pipeline.transform(df).cache()

```

Show the extracted json:

```python

result.show_json("invoice")

```



Let's show Invoice as Structured Data in Data Frame

```python

result.select("invoice.data.*").show()

```

Output:

```text

+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+

|           hospital|   tax_id|             address|               email|         phone|               items|  total|

+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+

|Hope Haven Hospital|26-123123|855 Howard Street...|hopedutton@hopeha...|(123) 456-1238|[{10/21/2022, App...|1024.50|

+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+

```

Schema:

```python

result.printSchema()

```

```text

root

 |-- path: string (nullable = true)

 |-- modificationTime: timestamp (nullable = true)

 |-- length: long (nullable = true)

 |-- image: struct (nullable = true)

 |    |-- path: string (nullable = false)

 |    |-- resolution: integer (nullable = false)

 |    |-- data: binary (nullable = false)

 |    |-- imageType: string (nullable = false)

 |    |-- exception: string (nullable = false)

 |    |-- height: integer (nullable = false)

 |    |-- width: integer (nullable = false)

 |-- invoice: struct (nullable = true)

 |    |-- path: string (nullable = false)

 |    |-- json_data: string (nullable = true)

 |    |-- type: string (nullable = false)

 |    |-- exception: string (nullable = false)

 |    |-- processing_time: double (nullable = false)

 |    |-- data: struct (nullable = true)

 |    |    |-- hospital: string (nullable = false)

 |    |    |-- tax_id: string (nullable = false)

 |    |    |-- address: string (nullable = false)

 |    |    |-- email: string (nullable = false)

 |    |    |-- phone: string (nullable = false)

 |    |    |-- items: array (nullable = false)

 |    |    |    |-- element: struct (containsNull = false)

 |    |    |    |    |-- date: string (nullable = false)

 |    |    |    |    |-- item: string (nullable = false)

 |    |    |    |    |-- note: string (nullable = false)

 |    |    |    |    |-- debit: string (nullable = false)

 |    |    |-- total: string (nullable = false)

```

## NER using model from the HuggingFace models Hub

Define pipeline for extract text from the image and run NER:

```python

pipeline = PipelineModel(stages=[

    DataToImage(inputCol="content", outputCol="image"),

    TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),

    Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),

    ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3, 

                   padding=5, displayDataList=['entity_group'])

])

result = pipeline.transform(df).cache()

result.show_text("text")

```

Output:



Show NER results:

```python

result.show_ner(limit=20)

```

Output:

```text

+------------+-------------------+----------+-----+---+--------------------+

|entity_group|              score|      word|start|end|               boxes|

+------------+-------------------+----------+-----+---+--------------------+

|        HOSP|  0.991257905960083|  Hospital|    0|  8|[{Hospital:, 0.94...|

|         LOC|  0.999171257019043|    Dutton|   10| 16|[{Dutton,, 0.9609...|

|         LOC| 0.9992585778236389|        MI|   18| 20|[{MI, 0.93335297,...|

|          ID| 0.6838774085044861|        26|   29| 31|[{26-123123, 0.90...|

|       PHONE| 0.4669836759567261|         -|   31| 32|[{26-123123, 0.90...|

|       PHONE| 0.7790696024894714|    123123|   32| 38|[{26-123123, 0.90...|

|        HOSP|0.37445762753486633|      HOPE|   39| 43|[{HOPE, 0.9525460...|

|        HOSP| 0.9503226280212402|     HAVEN|   44| 49|[{HAVEN, 0.952546...|

|         LOC| 0.9975488185882568|855 Howard|   59| 69|[{855, 0.94682700...|

|         LOC| 0.9984399676322937|    Street|   70| 76|[{Street, 0.95823...|

|        HOSP| 0.3670221269130707|  HOSPITAL|   77| 85|[{HOSPITAL, 0.959...|

|         LOC| 0.9990363121032715|    Dutton|   86| 92|[{Dutton,, 0.9647...|

|         LOC|  0.999313473701477|  MI 49316|   94|102|[{MI, 0.94589012,...|

|       PHONE| 0.9830010533332825|   ( 123 )|  110|115|[{(123), 0.595334...|

|       PHONE| 0.9080978035926819|       456|  116|119|[{456-1238, 0.955...|

|       PHONE| 0.9378324151039124|         -|  119|120|[{456-1238, 0.955...|

|       PHONE| 0.8746233582496643|      1238|  120|124|[{456-1238, 0.955...|

|     PATIENT|0.45354968309402466|hopedutton|  132|142|[{hopedutton@hope...|

|       EMAIL|0.17805588245391846| hopehaven|  143|152|[{hopedutton@hope...|

|        HOSP|  0.505658745765686|   INVOICE|  157|164|[{INVOICE, 0.9661...|

+------------+-------------------+----------+-----+---+--------------------+

```

Visualize NER results:

```python

result.visualize_ner(labels_list=["DATE", "LOC"])

```



Original image with NER results:

```python

result.show_image("image_with_boxes")

```



## Ocr engines

|                   | Bbox  level | Support GPU | Separate model  for text detection | Processing time 1 page (CPU/GPU) secs | Support Handwritten Text |

|-------------------|-------------|-------------|------------------------------------|---------------------------------------|--------------------------|

| [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)     | character   | no          | no                                 | 0.2/no                                | not good                 |

| Tesseract OCR CLI | character   | no          | no                                 | 0.2/no                                | not good                 |

| [Easy OCR](https://github.com/JaidedAI/EasyOCR)          | word        | yes         | yes                                |                                       |                          |

| [Surya OCR](https://github.com/VikParuchuri/surya)         | line        | yes         | yes                                |                                       |                          |

| [DocTR](https://github.com/mindee/doctr)       | word        | yes         | yes                                |                                       |                          |

## Projects based on the ScaleDP

 - [PDF Redaction](https://pdf-redaction.com/) - Free AI-powered tool for redact PDF files (remove sensitive information) online.



## Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stabrise/scaledp

Awesome Lists containing this project

README