An open API service indexing awesome lists of open source software.

https://github.com/stabrise/.github

Document processing solutions
https://github.com/stabrise/.github

anonymization deidentification deidentify ocr pdf

Last synced: 6 months ago
JSON representation

Document processing solutions

Awesome Lists containing this project

README

          

# Hi there 👋

StabRise - Document Processing Solutions

# Our projects

## PDF DataSource for the Apache Spark

Spark Pdf

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

## Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## ScaleDP

ScaleDP

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

### Key features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results

## De-Identify

De-Identify

De-Identify is tool for de-identification/anonymization data

### Supported formats
- text
- images
- pdf documents
- DICOM files