https://github.com/jcaperella29/document-cleaning-pipeline
A python script for cleaning documents using a mix of machine learning and rules.
https://github.com/jcaperella29/document-cleaning-pipeline
Last synced: 3 months ago
JSON representation
A python script for cleaning documents using a mix of machine learning and rules.
- Host: GitHub
- URL: https://github.com/jcaperella29/document-cleaning-pipeline
- Owner: jcaperella29
- Created: 2025-01-18T16:19:52.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-01-18T16:29:20.000Z (4 months ago)
- Last Synced: 2025-01-18T17:25:36.415Z (4 months ago)
- Language: Python
- Size: 3.94 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Document Cleaning Pipeline ๐งน๐
This repository contains a Python-based pipeline for cleaning scanned document images. The pipeline leverages a **DnCNN-based convolutional neural network** for denoising, coupled with adaptive thresholding and post-processing, to generate clean, readable outputs that are ideal for both **human readability** and **text mining**.
---
## **Features** โจ
- **Denoising with DnCNN**:
Uses a pre-trained DnCNN (Deep Convolutional Neural Network) to remove noise while preserving important text details.
- **Adaptive Thresholding**:
Sharpens text, enhances contrast, and creates uniform backgrounds for better readability and machine processing.- **PDF Conversion**:
Converts cleaned images into grayscale, high-resolution PDFs for archival and text mining.- **Batch Processing**:
Processes all images in a folder and generates cleaned images and PDFs in bulk.---
## **Example Output**
### **Input (Noisy Image)**
### **Output (Cleaned Image)**
### **Output (Thresholded Binary)**
---
## **Installation** ๐ ๏ธ
### 1. **Clone the Repository**
```bash
git clone https://github.com/jcaperella29/document-cleaning-pipeline.git
cd document-cleaning-pipeline