Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stabrise/spark-pdf

PDF DataSource for Apache Spark
https://github.com/stabrise/spark-pdf

big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr

Last synced: about 2 months ago
JSON representation

PDF DataSource for Apache Spark

Host: GitHub
URL: https://github.com/stabrise/spark-pdf
Owner: StabRise
License: agpl-3.0
Created: 2024-11-23T07:00:06.000Z (2 months ago)
Default Branch: main
Last Pushed: 2024-12-03T08:22:32.000Z (about 2 months ago)
Last Synced: 2024-12-03T08:25:07.101Z (about 2 months ago)
Topics: big-data, data-engineering, data-extraction, data-science, ocr, ocr-recognition, pdf, pdf-document, pdf-document-processor, spark, spark-datasource, tesseract, tesseract-ocr
Language: Scala
Homepage: https://stabrise.com/spark-pdf/
Size: 16.5 MB
Stars: 20
Watchers: 2
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  


    

  






    

      

    

    

        

    

    

        

    

    

        

    

    

        

    





    

        

    

    

        

    

    

        

    



---

⭐ Star us on GitHub — it motivates us a lot!

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Quick Start Jupyter Notebook**: [PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

## Welcome to the Spark PDF

The project provides a custom data source for the [Apache Spark](https://spark.apache.org/) that allows you to read PDF files into the Spark DataFrame.

If you found useful this project, please give a star to the repository.

## Key features:

- Read PDF documents to the Spark DataFrame

- Support read PDF files lazy per page

- Support big files, up to 10k pages

- Support scanned PDF files (call OCR)

- No need to install Tesseract OCR, it's included in the package

## Requirements

- Java 8, 11

- Apache Spark 3.4.0 (for request build for another version please file issue)

- Ghostscript 9.50 or later (only for the GhostScript reader)

## Installation

Binary package is available in the Maven Central Repository.

```

groupId: com.stabrise

artifactId: spark-pdf_2.12

```

## Options for the data source:

- `imageType`: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".

- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.

- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".

- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)

- `ocrConfig`: Tesseract OCR configuration. Default: "psm=3". For more information see [Tesseract OCR Params](TesseractParams.md)

## Output Columns in the DataFrame:

The DataFrame contains the following columns:

- `path`: path to the file

- `page_number`: page number of the document

- `text`: extracted text from the text layer of the PDF page

- `image`: image representation of the page

- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)

- `partition_number`: partition number

Output Schema:

```agsl

root

 |-- path: string (nullable = true)

 |-- filename: string (nullable = true)

 |-- page_number: integer (nullable = true)

 |-- partition_number: integer (nullable = true)

 |-- text: string (nullable = true)

 |-- image: struct (nullable = true)

 |    |-- path: string (nullable = true)

 |    |-- resolution: integer (nullable = true)

 |    |-- data: binary (nullable = true)

 |    |-- imageType: string (nullable = true)

 |    |-- exception: string (nullable = true)

 |    |-- height: integer (nullable = true)

 |    |-- width: integer (nullable = true)

 |-- document: struct (nullable = true)

 |    |-- path: string (nullable = true)

 |    |-- text: string (nullable = true)

 |    |-- outputType: string (nullable = true)

 |    |-- bBoxes: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- text: string (nullable = true)

 |    |    |    |-- score: float (nullable = true)

 |    |    |    |-- x: integer (nullable = true)

 |    |    |    |-- y: integer (nullable = true)

 |    |    |    |-- width: integer (nullable = true)

 |    |    |    |-- height: integer (nullable = true)

 |    |-- exception: string (nullable = true)

```

## Example of usage

### Scala

```scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()

  .appName("Spark PDF Example")

  .master("local[*]")

  .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7")

  .getOrCreate()

  

val df = spark.read.format("pdf")

  .option("imageType", "BINARY")

  .option("resolution", "200")

  .option("pagePerPartition", "2")

  .option("reader", "pdfBox")

  .load("path to the pdf file(s)")

df.select("path", "document").show()

```

### Python

```python

from pyspark.sql import SparkSession

spark = SparkSession.builder \

    .master("local[*]") \

    .appName("SparkPdf") \

    .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") \

    .getOrCreate()

df = spark.read.format("pdf") \

    .option("imageType", "BINARY") \

    .option("resolution", "200") \

    .option("pagePerPartition", "2") \

    .option("reader", "pdfBox") \

    .load("path to the pdf file(s)")

df.select("path", "document").show()

```

## Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.