https://github.com/PrincySinghal/Document-classification-and-Data-extraction

Splitting and classifying documents from a pdf or image consisting of 5 classes of documents like Aadhar card,Pan etc followed by information retrieval from each document.
https://github.com/PrincySinghal/Document-classification-and-Data-extraction

cnn deep-neural-networks ocr python sequential-models

Last synced: 3 months ago
JSON representation

Splitting and classifying documents from a pdf or image consisting of 5 classes of documents like Aadhar card,Pan etc followed by information retrieval from each document.

Host: GitHub
URL: https://github.com/PrincySinghal/Document-classification-and-Data-extraction
Owner: PrincySinghal
Created: 2023-03-13T20:01:49.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-07-28T13:57:24.000Z (almost 2 years ago)
Last Synced: 2024-11-06T00:39:31.307Z (8 months ago)
Topics: cnn, deep-neural-networks, ocr, python, sequential-models
Language: Jupyter Notebook
Homepage:
Size: 11.7 MB
Stars: 8
Watchers: 1
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Document-Classification-and-Data-Extraction

Table of Contents

About The Project

Salient Features

Description

Data Preprocessing

Document Classification Model

Results

Information extraction model

Team

## About the project
We put out a model that can recognise the collection of papers contained in a pdf or image made up of numerous documents. To accomplish this, the input PDF is divided into individual pages. The CNN model is used to categorise each page into the appropriate document category. After that, each document's data is extracted using OCR (optical character recognition). This is being recommended for five documents: voter identification, driver's licence, PAN, and Aadhar.
Except for the front and back of the same document, the input pdf must include a single document on a single page.
Our data classification model obtained 0.7342 accuracy on the training set and 0.7736 accuracy on the validation set, with gains of 0.6923 and losses of 0.8340.

### Salient Features
Hyperparameter tuning, regularization(early stopping), document split
### Tech stack used
* models: CNN and OCR
* Framework-Keras

### Methodology

### Data Description
When we began searching for an appropriate dataset, we observed that there is no publicly available dataset of identity documents as they hold sensitive and personal information. But we came across a dataset on Kaggle that consisted of six folders, i.e., Aadhar Card, PAN Card, Voter ID, single-page Gas Bill, Passport, and Driver's License. We added a few more images to each folder. These were our own documents that we manually scanned, with the rest coming from Google Images.
Thus, these are the five documents we are classifying and extracting information from.

### Data Preprocessing
Before model training, we applied horizontal and vertical data augmentation using random flips. This further increased the size and diversity of the dataset. The categorical values of the labels column were converted to numerical values using one-hot encoding.

### Document Classification Model

Various hyperparameters like the number of layers, neurons in each layer, number of filters, kernel size, value of p in dropout layers, number of epochs, batch size, etc. were changed until satisfactory training and validation accuracy was achieved.

### The final Model and results

### Information extraction model
Following are the steps of OCR done on images:

### Team
- Kanika Kanojia [GitHub](https://github.com) [Linkedin](https://www.linkedin.com/in/kanika-kanojia-348620207/)
- Deepali Thakur [GitHub](https://github.com/deepalii05) [Linkedin](https://www.linkedin.com/in/deepali-thakur/)
- Princy Singhal [GitHub](https://github.com/PrincySinghal) [Linkedin](https://www.linkedin.com/in/princy-singhal-047414224/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/PrincySinghal/Document-classification-and-Data-extraction

Awesome Lists containing this project

README