Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stabrise/spark-pdf
PDF DataSource for Apache Spark
https://github.com/stabrise/spark-pdf
big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr
Last synced: about 2 months ago
JSON representation
PDF DataSource for Apache Spark
- Host: GitHub
- URL: https://github.com/stabrise/spark-pdf
- Owner: StabRise
- License: agpl-3.0
- Created: 2024-11-23T07:00:06.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-03T08:22:32.000Z (about 2 months ago)
- Last Synced: 2024-12-03T08:25:07.101Z (about 2 months ago)
- Topics: big-data, data-engineering, data-extraction, data-science, ocr, ocr-recognition, pdf, pdf-document, pdf-document-processor, spark, spark-datasource, tesseract, tesseract-ocr
- Language: Scala
- Homepage: https://stabrise.com/spark-pdf/
- Size: 16.5 MB
- Stars: 20
- Watchers: 2
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
⭐ Star us on GitHub — it motivates us a lot!
**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
**Quick Start Jupyter Notebook**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
---
## Welcome to the Spark PDF
The project provides a custom data source for the [Apache Spark](https://spark.apache.org/) that allows you to read PDF files into the Spark DataFrame.
If you found useful this project, please give a star to the repository.
## Key features:
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package## Requirements
- Java 8, 11
- Apache Spark 3.4.0 (for request build for another version please file issue)
- Ghostscript 9.50 or later (only for the GhostScript reader)## Installation
Binary package is available in the Maven Central Repository.
```
groupId: com.stabrise
artifactId: spark-pdf_2.12
```## Options for the data source:
- `imageType`: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.
- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".
- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)
- `ocrConfig`: Tesseract OCR configuration. Default: "psm=3". For more information see [Tesseract OCR Params](TesseractParams.md)## Output Columns in the DataFrame:
The DataFrame contains the following columns:
- `path`: path to the file
- `page_number`: page number of the document
- `text`: extracted text from the text layer of the PDF page
- `image`: image representation of the page
- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)
- `partition_number`: partition numberOutput Schema:
```agsl
root
|-- path: string (nullable = true)
|-- filename: string (nullable = true)
|-- page_number: integer (nullable = true)
|-- partition_number: integer (nullable = true)
|-- text: string (nullable = true)
|-- image: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- resolution: integer (nullable = true)
| |-- data: binary (nullable = true)
| |-- imageType: string (nullable = true)
| |-- exception: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
|-- document: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- text: string (nullable = true)
| |-- outputType: string (nullable = true)
| |-- bBoxes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- text: string (nullable = true)
| | | |-- score: float (nullable = true)
| | | |-- x: integer (nullable = true)
| | | |-- y: integer (nullable = true)
| | | |-- width: integer (nullable = true)
| | | |-- height: integer (nullable = true)
| |-- exception: string (nullable = true)
```
## Example of usage### Scala
```scala
import org.apache.spark.sql.SparkSessionval spark = SparkSession.builder()
.appName("Spark PDF Example")
.master("local[*]")
.config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7")
.getOrCreate()
val df = spark.read.format("pdf")
.option("imageType", "BINARY")
.option("resolution", "200")
.option("pagePerPartition", "2")
.option("reader", "pdfBox")
.load("path to the pdf file(s)")df.select("path", "document").show()
```### Python
```python
from pyspark.sql import SparkSessionspark = SparkSession.builder \
.master("local[*]") \
.appName("SparkPdf") \
.config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") \
.getOrCreate()df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")df.select("path", "document").show()
```## Disclaimer
This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.