{"id":22301793,"url":"https://github.com/stabrise/scaledp","last_synced_at":"2025-07-29T03:32:36.950Z","repository":{"id":266061922,"uuid":"878417574","full_name":"StabRise/ScaleDP","owner":"StabRise","description":"ScaleDP is an Open-Source extension of Apache Spark for Document Processing","archived":false,"fork":false,"pushed_at":"2024-12-02T10:30:12.000Z","size":5917,"stargazers_count":1,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-12-02T11:35:14.128Z","etag":null,"topics":["doctrocr","easyocr","huggingface-models","machine-learning","nlp","nlp-machine-learning","ocr","ocr-python","ocr-recognition","pdf","pdf-document-processor","spark","suryaocr"],"latest_commit_sha":null,"homepage":"https://stabrise.com/scaledp/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StabRise.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-25T11:01:37.000Z","updated_at":"2024-12-02T10:30:17.000Z","dependencies_parsed_at":"2024-12-02T18:49:54.504Z","dependency_job_id":null,"html_url":"https://github.com/StabRise/ScaleDP","commit_stats":null,"previous_names":["stabrise/scaledp"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2FScaleDP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2FScaleDP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2FScaleDP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2FScaleDP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StabRise","download_url":"https://codeload.github.com/StabRise/ScaleDP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227976219,"owners_count":17850175,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["doctrocr","easyocr","huggingface-models","machine-learning","nlp","nlp-machine-learning","ocr","ocr-python","ocr-recognition","pdf","pdf-document-processor","spark","suryaocr"],"created_at":"2024-12-03T18:31:05.696Z","updated_at":"2025-07-29T03:32:36.930Z","avatar_url":"https://github.com/StabRise.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cp align=\"center\"\u003e\n  \u003cbr/\u003e\n    \u003ca href=\"https://stabrise.com/scaledp/\" target=\"_blank\"\u003e\u003cimg alt=\"ScaleDP\" src=\"https://stabrise.com/static/images/projects/scaledp.webp\" width=\"450\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n  \u003cbr/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ci\u003eAn Open-Source Library for Processing Documents using AI/ML in Apache Spark.\u003c/i\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/scaledp/\" alt=\"Package on PyPI\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/scaledp.svg\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/stabrise/spark-pdf/blob/main/LICENSE\"\u003e\u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/stabrise/spark-pdf.svg?color=blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://stabrise.com\"\u003e\u003cimg alt=\"StabRise\" src=\"https://img.shields.io/badge/powered%20by-StabRise-orange.svg?style=flat\u0026colorA=E1523D\u0026colorB=007D8A\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://app.codacy.com/gh/StabRise/ScaleDP/dashboard?utm_source=gh\u0026utm_medium=referral\u0026utm_content=\u0026utm_campaign=Badge_grade\"\u003e\n    \u003cimg src=\"https://app.codacy.com/project/badge/Grade/98570508281140c2a33e616a4f749c20\" alt=\"Codacy Badge\" /\u003e\n\u003c/a\u003e\u003c/p\u003e\n\n---\n\n**Source Code**: \u003ca href=\"https://github.com/StabRise/ScaleDP/\" target=\"_blank\"\u003ehttps://github.com/StabRise/ScaleDP\u003c/a\u003e\n\n**Quickstart**: \u003ca href=\"https://colab.research.google.com/github/StabRise/scaledp-tutorials/blob/master/1.QuickStart.ipynb\" target=\"_blank\"\u003e1.QuickStart.ipynb\u003c/a\u003e\n\n**Tutorials**: \u003ca href=\"https://github.com/StabRise/ScaleDP-Tutorials/\" target=\"_blank\"\u003ehttps://github.com/StabRise/ScaleDP-Tutorials\u003c/a\u003e\n\n---\n\n# Welcome to the ScaleDP library\n\nScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.\n\n**LLM** (Large Language Models) and **VLM** (Vision Language Models) models are used to extract data from text and images in combination with **OCR** engines.\n\nDiscover pre-trained models for your projects or play with the thousands of models hosted on the [Hugging Face Hub](https://huggingface.co/).\n\n## Key features\n\n### Document processing:\n\n- ✅ Loading PDF documents/Images to the Spark DataFrame (using [Spark PDF Datasource](https://github.com/stabrise/spark-pdf) and as `binaryFile`)\n- ✅ Extraction text/images from PDF documents/Images\n- ✅ Zero-Shot extraction **structured data** from text/images using LLM and ML models\n- ✅ Possibility run as REST API service without Spark Session for have minimum processing latency\n- ✅ Support Streaming mode for processing documents in real-time\n\n### LLM:\n\nSupport OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)\n\n- OCR Images/PDF documents using Vision LLM models\n- Extract data from the image using Vision LLM models\n- Extract data from the text/images using LLM models\n- Extract data using DSPy framework\n- NER using LLM's\n- Visualize results\n\n### NLP:\n\n- Extract data from the text/images using NLP models from the Hugging Face Hub\n- NER using classical ML models\n\n### OCR:\n\nSupport various open-source OCR engines:\n\n - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) \n - [Easy OCR](https://github.com/JaidedAI/EasyOCR)   \n - [Surya OCR](https://github.com/VikParuchuri/surya) \n - [DocTR](https://github.com/mindee/doctr)\n - Vision LLM models\n\n### CV:\n- Object detection on images using YOLO models\n- Text detection on images\n\n\n## Installation\n\n### Prerequisites\n\n- Python 3.10 or higher\n- Apache Spark 3.5 or higher\n- Java 8\n\n### Installation using pip\n\nInstall the `ScaleDP` package with [pip](https://pypi.org/project/scaledp/):\n\n```bash\npip install scaledp\n```\n\n### Installation using Docker\n\nBuild image:\n\n```bash\n  docker build -t scaledp .\n```\n\nRun container:\n```bash\n  docker run -p 8888:8888 scaledp:latest\n```\n\nOpen Jupyter Notebook in your browser:\n```bash\n  http://localhost:8888\n```\n\n## Qiuckstart\n\nStart a Spark session with ScaleDP:\n\n```python\nfrom scaledp import *\nspark = ScaleDPSession()\nspark\n```\n\nRead example image file:\n\n```python\nimage_example = files('resources/images/Invoice.png')\ndf = spark.read.format(\"binaryFile\") \\\n    .load(image_example)\n\ndf.show_image(\"content\")\n```\nOutput:\n\n\u003cimg src=\"https://github.com/StabRise/ScaleDP/blob/master/images/ImageOutput.png?raw=true\" width=\"400\"\u003e\n\n\n## Zero-Shot data Extraction from the Image:\n\n```python\nfrom pydantic import BaseModel\nimport json\n\nclass Items(BaseModel):\n    date: str\n    item: str\n    note: str\n    debit: str\n\nclass InvoiceSchema(BaseModel):\n    hospital: str\n    tax_id: str\n    address: str\n    email: str\n    phone: str\n    items: list[Items]\n    total: str\n    \n\npipeline = PipelineModel(stages=[\n    DataToImage(\n        inputCol=\"content\",\n        outputCol=\"image\"\n    ),\n    LLMVisualExtractor(\n        inputCol=\"image\",\n        outputCol=\"invoice\",\n        model=\"gemini-1.5-flash\",\n        apiKey=\"\",\n        apiBase=\"https://generativelanguage.googleapis.com/v1beta/\",\n        schema=json.dumps(InvoiceSchema.model_json_schema())\n    )\n])\n\nresult = pipeline.transform(df).cache()\n```\n\nShow the extracted json:\n\n```python\nresult.show_json(\"invoice\")\n```\n\n\u003cimg src=\"https://github.com/StabRise/ScaleDP/blob/master/images/LLMVisualExtractorJson.png?raw=true\" width=\"400\"\u003e\n\nLet's show Invoice as Structured Data in Data Frame\n\n```python\nresult.select(\"invoice.data.*\").show()\n```\n\nOutput:\n\n```text\n+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+\n|           hospital|   tax_id|             address|               email|         phone|               items|  total|\n+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+\n|Hope Haven Hospital|26-123123|855 Howard Street...|hopedutton@hopeha...|(123) 456-1238|[{10/21/2022, App...|1024.50|\n+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+\n```\n\nSchema:\n\n```python\nresult.printSchema()\n```\n\n```text\nroot\n |-- path: string (nullable = true)\n |-- modificationTime: timestamp (nullable = true)\n |-- length: long (nullable = true)\n |-- image: struct (nullable = true)\n |    |-- path: string (nullable = false)\n |    |-- resolution: integer (nullable = false)\n |    |-- data: binary (nullable = false)\n |    |-- imageType: string (nullable = false)\n |    |-- exception: string (nullable = false)\n |    |-- height: integer (nullable = false)\n |    |-- width: integer (nullable = false)\n |-- invoice: struct (nullable = true)\n |    |-- path: string (nullable = false)\n |    |-- json_data: string (nullable = true)\n |    |-- type: string (nullable = false)\n |    |-- exception: string (nullable = false)\n |    |-- processing_time: double (nullable = false)\n |    |-- data: struct (nullable = true)\n |    |    |-- hospital: string (nullable = false)\n |    |    |-- tax_id: string (nullable = false)\n |    |    |-- address: string (nullable = false)\n |    |    |-- email: string (nullable = false)\n |    |    |-- phone: string (nullable = false)\n |    |    |-- items: array (nullable = false)\n |    |    |    |-- element: struct (containsNull = false)\n |    |    |    |    |-- date: string (nullable = false)\n |    |    |    |    |-- item: string (nullable = false)\n |    |    |    |    |-- note: string (nullable = false)\n |    |    |    |    |-- debit: string (nullable = false)\n |    |    |-- total: string (nullable = false)\n```\n\n## NER using model from the HuggingFace models Hub\n\nDefine pipeline for extract text from the image and run NER:\n\n```python\npipeline = PipelineModel(stages=[\n    DataToImage(inputCol=\"content\", outputCol=\"image\"),\n    TesseractOcr(inputCol=\"image\", outputCol=\"text\", psm=PSM.AUTO, keepInputData=True),\n    Ner(model=\"obi/deid_bert_i2b2\", inputCol=\"text\", outputCol=\"ner\", keepInputData=True),\n    ImageDrawBoxes(inputCols=[\"image\", \"ner\"], outputCol=\"image_with_boxes\", lineWidth=3, \n                   padding=5, displayDataList=['entity_group'])\n])\n\nresult = pipeline.transform(df).cache()\n\nresult.show_text(\"text\")\n```\n\nOutput:\n\n\u003cimg src=\"https://github.com/StabRise/ScaleDP/blob/master/images/TextOutput.png?raw=true\" width=\"400\"\u003e\n\nShow NER results:\n\n```python\nresult.show_ner(limit=20)\n```\n\nOutput:\n```text\n+------------+-------------------+----------+-----+---+--------------------+\n|entity_group|              score|      word|start|end|               boxes|\n+------------+-------------------+----------+-----+---+--------------------+\n|        HOSP|  0.991257905960083|  Hospital|    0|  8|[{Hospital:, 0.94...|\n|         LOC|  0.999171257019043|    Dutton|   10| 16|[{Dutton,, 0.9609...|\n|         LOC| 0.9992585778236389|        MI|   18| 20|[{MI, 0.93335297,...|\n|          ID| 0.6838774085044861|        26|   29| 31|[{26-123123, 0.90...|\n|       PHONE| 0.4669836759567261|         -|   31| 32|[{26-123123, 0.90...|\n|       PHONE| 0.7790696024894714|    123123|   32| 38|[{26-123123, 0.90...|\n|        HOSP|0.37445762753486633|      HOPE|   39| 43|[{HOPE, 0.9525460...|\n|        HOSP| 0.9503226280212402|     HAVEN|   44| 49|[{HAVEN, 0.952546...|\n|         LOC| 0.9975488185882568|855 Howard|   59| 69|[{855, 0.94682700...|\n|         LOC| 0.9984399676322937|    Street|   70| 76|[{Street, 0.95823...|\n|        HOSP| 0.3670221269130707|  HOSPITAL|   77| 85|[{HOSPITAL, 0.959...|\n|         LOC| 0.9990363121032715|    Dutton|   86| 92|[{Dutton,, 0.9647...|\n|         LOC|  0.999313473701477|  MI 49316|   94|102|[{MI, 0.94589012,...|\n|       PHONE| 0.9830010533332825|   ( 123 )|  110|115|[{(123), 0.595334...|\n|       PHONE| 0.9080978035926819|       456|  116|119|[{456-1238, 0.955...|\n|       PHONE| 0.9378324151039124|         -|  119|120|[{456-1238, 0.955...|\n|       PHONE| 0.8746233582496643|      1238|  120|124|[{456-1238, 0.955...|\n|     PATIENT|0.45354968309402466|hopedutton|  132|142|[{hopedutton@hope...|\n|       EMAIL|0.17805588245391846| hopehaven|  143|152|[{hopedutton@hope...|\n|        HOSP|  0.505658745765686|   INVOICE|  157|164|[{INVOICE, 0.9661...|\n+------------+-------------------+----------+-----+---+--------------------+\n```\n\nVisualize NER results:\n\n```python\nresult.visualize_ner(labels_list=[\"DATE\", \"LOC\"])\n```\n\u003cimg src=\"https://github.com/StabRise/ScaleDP/blob/master/images/NerVisual.png?raw=true\" width=\"400\"\u003e\n\nOriginal image with NER results:\n\n```python\nresult.show_image(\"image_with_boxes\")\n```\n\u003cimg src=\"https://github.com/StabRise/ScaleDP/blob/master/images/NerVisualOnImage.png?raw=true\" width=\"400\"\u003e\n\n## Ocr engines\n\n|                   | Bbox  level | Support GPU | Separate model  for text detection | Processing time 1 page (CPU/GPU) secs | Support Handwritten Text |\n|-------------------|-------------|-------------|------------------------------------|---------------------------------------|--------------------------|\n| [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)     | character   | no          | no                                 | 0.2/no                                | not good                 |\n| Tesseract OCR CLI | character   | no          | no                                 | 0.2/no                                | not good                 |\n| [Easy OCR](https://github.com/JaidedAI/EasyOCR)          | word        | yes         | yes                                |                                       |                          |\n| [Surya OCR](https://github.com/VikParuchuri/surya)         | line        | yes         | yes                                |                                       |                          |\n| [DocTR](https://github.com/mindee/doctr)       | word        | yes         | yes                                |                                       |                          |\n\n\n## Projects based on the ScaleDP\n\n - [PDF Redaction](https://pdf-redaction.com/) - Free AI-powered tool for redact PDF files (remove sensitive information) online.\n\n\n\u003ca href=\"https://pdf-redaction.com/\"\u003e\u003cimg alt=\"pdf-redaction\" src=\"https://media.licdn.com/dms/image/v2/D4D22AQGhRpexOnAbyA/feedshare-shrink_800/B4DZVmbKWPHIAg-/0/1741180153002?e=1744243200\u0026v=beta\u0026t=lRQXyJ5nHYvdU4uF6LJuq69oKs72yPBs1xts2IrJgxc\"/\u003e\u003c/a\u003e\n\n\n## Disclaimer\n\nThis project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstabrise%2Fscaledp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstabrise%2Fscaledp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstabrise%2Fscaledp/lists"}