https://github.com/stabrise/.github

Document processing solutions
https://github.com/stabrise/.github

anonymization deidentification deidentify ocr pdf

Last synced: 6 months ago
JSON representation

Document processing solutions

Host: GitHub
URL: https://github.com/stabrise/.github
Owner: StabRise
Created: 2024-11-16T11:44:26.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-03-19T11:30:29.000Z (10 months ago)
Last Synced: 2025-03-30T10:15:48.303Z (9 months ago)
Topics: anonymization, deidentification, deidentify, ocr, pdf
Homepage: https://stabrise.com
Size: 27.3 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hi there 👋

StabRise - Document Processing Solutions

# Our projects

## PDF DataSource for the Apache Spark

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

## Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## ScaleDP

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

### Key features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results

## De-Identify

De-Identify is tool for de-identification/anonymization data

### Supported formats
- text
- images
- pdf documents
- DICOM files

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stabrise/.github

Awesome Lists containing this project

README