{"id":22301776,"url":"https://github.com/stabrise/spark-pdf","last_synced_at":"2025-04-09T16:04:30.827Z","repository":{"id":264587758,"uuid":"892965977","full_name":"StabRise/spark-pdf","owner":"StabRise","description":"PDF DataSource for Apache Spark","archived":false,"fork":false,"pushed_at":"2025-03-19T11:24:33.000Z","size":7925,"stargazers_count":45,"open_issues_count":5,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-09T16:04:22.465Z","etag":null,"topics":["big-data","data-engineering","data-extraction","data-science","ocr","ocr-recognition","pdf","pdf-document","pdf-document-processor","spark","spark-datasource","tesseract","tesseract-ocr"],"latest_commit_sha":null,"homepage":"https://stabrise.com/spark-pdf/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StabRise.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-23T07:00:06.000Z","updated_at":"2025-03-28T22:00:48.000Z","dependencies_parsed_at":"2025-02-22T07:20:36.868Z","dependency_job_id":"009ebb48-0e57-4738-9bff-ba948db6173d","html_url":"https://github.com/StabRise/spark-pdf","commit_stats":null,"previous_names":["stabrise/spark-pdf"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2Fspark-pdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2Fspark-pdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2Fspark-pdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StabRise%2Fspark-pdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StabRise","download_url":"https://codeload.github.com/StabRise/spark-pdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248065290,"owners_count":21041871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-engineering","data-extraction","data-science","ocr","ocr-recognition","pdf","pdf-document","pdf-document-processor","spark","spark-datasource","tesseract","tesseract-ocr"],"created_at":"2024-12-03T18:30:57.396Z","updated_at":"2025-04-09T16:04:30.802Z","avatar_url":"https://github.com/StabRise.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cbr/\u003e\n    \u003ca hreh=\"https://stabrise.com/spark-pdf/\"\u003e\u003cimg alt=\"Spark Pdf\" src=\"https://stabrise.com/static/images/projects/sparkpdf.webp\" width=\"450\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n  \u003cbr/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb\"\u003e\n      \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab Qick Start\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/StabRise/spark-pdf/actions/\"\u003e\n        \u003cimg alt=\"Test\" src=\"https://github.com/StabRise/spark-pdf/actions/workflows/scala.yml/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://search.maven.org/artifact/com.stabrise/spark-pdf-spark35_2.12\"\u003e\n        \u003cimg alt=\"Maven Central Version\" src=\"https://img.shields.io/maven-central/v/com.stabrise/spark-pdf-spark35_2.12\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/StabRise/spark-pdf/blob/master/LICENSE\" \u003e\n        \u003cimg src=\"https://img.shields.io/badge/License-AGPL%203-blue.svg\" alt=\"License\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://app.codacy.com/gh/StabRise/spark-pdf/dashboard?utm_source=gh\u0026utm_medium=referral\u0026utm_content=\u0026utm_campaign=Badge_grade\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://app.codacy.com/project/badge/Grade/2fde782d0c754df1b60b389799f46f0f\" alt=\"Codacy Badge\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://x.com/intent/tweet?text=Check%20out%20this%20project%20on%20GitHub:%20https://github.com/StabRise/spark-pdf%20%23OpenIDConnect%20%23Security%20%23Authentication\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/share-000000?logo=x\u0026logoColor=white\" alt=\"Share on X\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.linkedin.com/sharing/share-offsite/?url=https://github.com/StabRise/spark-pdf\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/share-0A66C2?logo=linkedin\u0026logoColor=white\" alt=\"Share on LinkedIn\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.reddit.com/submit?title=Check%20out%20this%20project%20on%20GitHub:%20https://github.com/StabRise/spark-pdf\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/share-FF4500?logo=reddit\u0026logoColor=white\" alt=\"Share on Reddit\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n⭐ Star us on GitHub — it motivates us a lot!\n\n**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)\n\n**Quick Start Jupyter Notebook Spark 3.5.x on Databricks**: [PdfDataSourceDatabricks.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceDatabricks.ipynb)\n\n**Quick Start Jupyter Notebook Spark 3.x.x**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)\n\n**Quick Start Jupyter Notebook Spark 4.0.x**: [PdfDataSourceSpark4.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceSpark4.ipynb)\n\n**With Spark Connect**: [PdfDataSourceSparkConnect.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceSparkConnect.ipynb)\n\n---\n\n## Welcome to the Spark PDF\n\nThe project provides a custom data source for the [Apache Spark](https://spark.apache.org/) that allows you to read PDF files into the Spark DataFrame.\n\nIf you found useful this project, please give a star to the repository.\n\n👉 Works on Databricks now. See the [Databricks example](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceDatabricks.ipynb).\n\n## Key features:\n\n- Read PDF documents to the Spark DataFrame\n- Support efficient read PDF files lazy per page\n- Support big files, up to 10k pages\n- Support scanned PDF files (call OCR for text recognition from the images)\n- No need to install Tesseract OCR, it's included in the package\n- 👉 Compatible with [ScaleDP](https://github.com/StabRise/ScaleDP), an Open-Source Library for Processing Documents using AI/ML in Apache Spark.\n- Works with Spark Connect\n\n\n## Requirements\n\n- Java 8, 11, 17\n- Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0\n- Ghostscript 9.50 or later (only for the GhostScript reader)\n\nSpark 4.0.0 is supported in the version `0.1.11` and later (need Java 17 and Scala 2.13).\n\n## Installation\n\nBinary package is available in the Maven Central Repository.\n\n\n- **Spark 3.5.***: com.stabrise:spark-pdf-spark35_2.12:0.1.15\n- **Spark 3.4.***: com.stabrise:spark-pdf-spark34_2.12:0.1.11 (issue with publishing fresh version)\n- **Spark 3.3.***: com.stabrise:spark-pdf-spark33_2.12:0.1.15\n- **Spark 4.0.***: com.stabrise:spark-pdf-spark40_2.13:0.1.15\n\n## Options for the data source:\n\n- `imageType`: Oputput image type. Can be: \"BINARY\", \"GREY\", \"RGB\". Default: \"RGB\".\n- `resolution`: Resolution for rendering PDF page to the image. Default: \"300\" dpi.\n- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: \"5\".\n- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)\n- `ocrConfig`: Tesseract OCR configuration. Default: \"psm=3\". For more information see [Tesseract OCR Params](TesseractParams.md)\n\n## Output Columns in the DataFrame:\n\nThe DataFrame contains the following columns:\n\n- `path`: path to the file\n- `page_number`: page number of the document\n- `text`: extracted text from the text layer of the PDF page\n- `image`: image representation of the page\n- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)\n- `partition_number`: partition number\n\nOutput Schema:\n\n```agsl\nroot\n |-- path: string (nullable = true)\n |-- filename: string (nullable = true)\n |-- page_number: integer (nullable = true)\n |-- partition_number: integer (nullable = true)\n |-- text: string (nullable = true)\n |-- image: struct (nullable = true)\n |    |-- path: string (nullable = true)\n |    |-- resolution: integer (nullable = true)\n |    |-- data: binary (nullable = true)\n |    |-- imageType: string (nullable = true)\n |    |-- exception: string (nullable = true)\n |    |-- height: integer (nullable = true)\n |    |-- width: integer (nullable = true)\n |-- document: struct (nullable = true)\n |    |-- path: string (nullable = true)\n |    |-- text: string (nullable = true)\n |    |-- outputType: string (nullable = true)\n |    |-- bBoxes: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- text: string (nullable = true)\n |    |    |    |-- score: float (nullable = true)\n |    |    |    |-- x: integer (nullable = true)\n |    |    |    |-- y: integer (nullable = true)\n |    |    |    |-- width: integer (nullable = true)\n |    |    |    |-- height: integer (nullable = true)\n |    |-- exception: string (nullable = true)\n```\n## Example of usage\n\n### Scala\n\n```scala\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder()\n  .appName(\"Spark PDF Example\")\n  .master(\"local[*]\")\n  .config(\"spark.jars.packages\", \"com.stabrise:spark-pdf-spark35_2.12:0.1.15\")\n  .getOrCreate()\n  \nval df = spark.read.format(\"pdf\")\n  .option(\"imageType\", \"BINARY\")\n  .option(\"resolution\", \"200\")\n  .option(\"pagePerPartition\", \"2\")\n  .option(\"reader\", \"pdfBox\")\n  .option(\"ocrConfig\", \"psm=11\")\n  .load(\"path to the pdf file(s)\")\n\ndf.select(\"path\", \"document\").show()\n```\n\n### Python\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \\\n    .master(\"local[*]\") \\\n    .appName(\"SparkPdf\") \\\n    .config(\"spark.jars.packages\", \"com.stabrise:spark-pdf-spark35_2.12:0.1.15\") \\\n    .getOrCreate()\n\ndf = spark.read.format(\"pdf\") \\\n    .option(\"imageType\", \"BINARY\") \\\n    .option(\"resolution\", \"200\") \\\n    .option(\"pagePerPartition\", \"2\") \\\n    .option(\"reader\", \"pdfBox\") \\\n    .option(\"ocrConfig\", \"psm=11\") \\\n    .load(\"path to the pdf file(s)\")\n\ndf.select(\"path\", \"document\").show()\n```\n\n## Disclaimer\n\nThis project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstabrise%2Fspark-pdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstabrise%2Fspark-pdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstabrise%2Fspark-pdf/lists"}