https://github.com/stabrise/spark-pdf
PDF DataSource for Apache Spark
https://github.com/stabrise/spark-pdf
big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr
Last synced: 6 months ago
JSON representation
PDF DataSource for Apache Spark
- Host: GitHub
- URL: https://github.com/stabrise/spark-pdf
- Owner: StabRise
- License: agpl-3.0
- Created: 2024-11-23T07:00:06.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-19T11:24:33.000Z (7 months ago)
- Last Synced: 2025-04-09T16:04:22.465Z (6 months ago)
- Topics: big-data, data-engineering, data-extraction, data-science, ocr, ocr-recognition, pdf, pdf-document, pdf-document-processor, spark, spark-datasource, tesseract, tesseract-ocr
- Language: Scala
- Homepage: https://stabrise.com/spark-pdf/
- Size: 7.56 MB
- Stars: 45
- Watchers: 2
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
⭐ Star us on GitHub — it motivates us a lot!
**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
**Quick Start Jupyter Notebook Spark 3.5.x on Databricks**: [PdfDataSourceDatabricks.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceDatabricks.ipynb)
**Quick Start Jupyter Notebook Spark 3.x.x**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
**Quick Start Jupyter Notebook Spark 4.0.x**: [PdfDataSourceSpark4.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceSpark4.ipynb)
**With Spark Connect**: [PdfDataSourceSparkConnect.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceSparkConnect.ipynb)
---
## Welcome to the Spark PDF
The project provides a custom data source for the [Apache Spark](https://spark.apache.org/) that allows you to read PDF files into the Spark DataFrame.
If you found useful this project, please give a star to the repository.
👉 Works on Databricks now. See the [Databricks example](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSourceDatabricks.ipynb).
## Key features:
- Read PDF documents to the Spark DataFrame
- Support efficient read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR for text recognition from the images)
- No need to install Tesseract OCR, it's included in the package
- 👉 Compatible with [ScaleDP](https://github.com/StabRise/ScaleDP), an Open-Source Library for Processing Documents using AI/ML in Apache Spark.
- Works with Spark Connect## Requirements
- Java 8, 11, 17
- Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
- Ghostscript 9.50 or later (only for the GhostScript reader)Spark 4.0.0 is supported in the version `0.1.11` and later (need Java 17 and Scala 2.13).
## Installation
Binary package is available in the Maven Central Repository.
- **Spark 3.5.***: com.stabrise:spark-pdf-spark35_2.12:0.1.15
- **Spark 3.4.***: com.stabrise:spark-pdf-spark34_2.12:0.1.11 (issue with publishing fresh version)
- **Spark 3.3.***: com.stabrise:spark-pdf-spark33_2.12:0.1.15
- **Spark 4.0.***: com.stabrise:spark-pdf-spark40_2.13:0.1.15## Options for the data source:
- `imageType`: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.
- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".
- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)
- `ocrConfig`: Tesseract OCR configuration. Default: "psm=3". For more information see [Tesseract OCR Params](TesseractParams.md)## Output Columns in the DataFrame:
The DataFrame contains the following columns:
- `path`: path to the file
- `page_number`: page number of the document
- `text`: extracted text from the text layer of the PDF page
- `image`: image representation of the page
- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)
- `partition_number`: partition numberOutput Schema:
```agsl
root
|-- path: string (nullable = true)
|-- filename: string (nullable = true)
|-- page_number: integer (nullable = true)
|-- partition_number: integer (nullable = true)
|-- text: string (nullable = true)
|-- image: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- resolution: integer (nullable = true)
| |-- data: binary (nullable = true)
| |-- imageType: string (nullable = true)
| |-- exception: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
|-- document: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- text: string (nullable = true)
| |-- outputType: string (nullable = true)
| |-- bBoxes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- text: string (nullable = true)
| | | |-- score: float (nullable = true)
| | | |-- x: integer (nullable = true)
| | | |-- y: integer (nullable = true)
| | | |-- width: integer (nullable = true)
| | | |-- height: integer (nullable = true)
| |-- exception: string (nullable = true)
```
## Example of usage### Scala
```scala
import org.apache.spark.sql.SparkSessionval spark = SparkSession.builder()
.appName("Spark PDF Example")
.master("local[*]")
.config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.15")
.getOrCreate()
val df = spark.read.format("pdf")
.option("imageType", "BINARY")
.option("resolution", "200")
.option("pagePerPartition", "2")
.option("reader", "pdfBox")
.option("ocrConfig", "psm=11")
.load("path to the pdf file(s)")df.select("path", "document").show()
```### Python
```python
from pyspark.sql import SparkSessionspark = SparkSession.builder \
.master("local[*]") \
.appName("SparkPdf") \
.config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.15") \
.getOrCreate()df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.option("ocrConfig", "psm=11") \
.load("path to the pdf file(s)")df.select("path", "document").show()
```## Disclaimer
This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.