https://github.com/stabrise/.github
Document processing solutions
https://github.com/stabrise/.github
anonymization deidentification deidentify ocr pdf
Last synced: 6 months ago
JSON representation
Document processing solutions
- Host: GitHub
- URL: https://github.com/stabrise/.github
- Owner: StabRise
- Created: 2024-11-16T11:44:26.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-19T11:30:29.000Z (10 months ago)
- Last Synced: 2025-03-30T10:15:48.303Z (9 months ago)
- Topics: anonymization, deidentification, deidentify, ocr, pdf
- Homepage: https://stabrise.com
- Size: 27.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hi there 👋
StabRise - Document Processing Solutions
# Our projects
## PDF DataSource for the Apache Spark
---
**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)
**Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
---
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
## Key features:
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
## ScaleDP
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
### Key features:
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results
## De-Identify
De-Identify is tool for de-identification/anonymization data
### Supported formats
- text
- images
- pdf documents
- DICOM files