Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/definitelynotchirag/invoice-doppelganger
InvoiceDoppelganger is an advanced tool designed to compare and identify similarities between invoices
https://github.com/definitelynotchirag/invoice-doppelganger
Last synced: about 1 month ago
JSON representation
InvoiceDoppelganger is an advanced tool designed to compare and identify similarities between invoices
- Host: GitHub
- URL: https://github.com/definitelynotchirag/invoice-doppelganger
- Owner: definitelynotchirag
- Created: 2024-07-30T19:53:02.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-07-31T15:34:43.000Z (5 months ago)
- Last Synced: 2024-09-07T02:32:41.597Z (4 months ago)
- Language: Python
- Size: 1.2 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Invoice Doppelganger
**InvoiceDoppelganger** is a tool designed to compare and find similarities between invoices. It leverages advanced text processing and image analysis techniques to identify duplicate or closely related invoices, ensuring accuracy and efficiency in document management.
It's A Program which takes an input invoice in the form of PDF and compares it to a database of existing invoices based on Content and Similarity.
### Similarity Metrics Used:
1. Structure or Style of Tables in Invoices
2. PDF Metadata
3. Invoice Number
4. PDF Name
5. Image Similarity
6. Cosine Similarity of all Metrics### Key Features:
- Highly Accurate
- Creation of Models to improve Performance## 3xSteps Process:
### Step-1: Feature Extraction
- Extract Text from PDF using PyPDF2
- Features Extracted:
**Text, metadata, table styles from html, invoice number, date, pdf-name**
- Analyze layout and structure using table styles
- Add the training data to the database(list of feature vectors)- #### Libraries Used :
1. PyPDF2
2. pdfminer
3. sklearn
4. re
5. io### Step-2: Calculate Similarity
- Using Cosine Similarity between features that have being extraced between two extracted feature vectors- Using Image Similarity converting PDF into image and comparing them getting the similarities.
- Combine Both the Similarities and return the result
- #### Libraries Used :
1. sklearn
2. imagehash
3. numpy
4. pdf2image### Step-3: Compare with training data
- Compare the invoice with each training data and get the most similar invoice and return the similarity### Final Step:
- Create A Frontend Using Streamlit## Steps to Run into your local:
1. First Clone the Repository into your local machine using git
```bash
git clone https://github.com/definitelynotchirag/Invoice-Doppelganger
```2. Install the required dependencies using
```bash
pip3 install -r requirements.txt
```
3. You are ready to run the Program4. **Before running make sure to add the test data(testing pdf invoices) and training data to the respective folders**
**There are two ways of running this program:**
### 1. GUI(Streamlit)
- Run the '**frontend.py**' through streamlit```bash
streamlit run frontend.py
```
- Input select the the invoice which you want to predict on.
- Run the '**Find Most Similar Invoice**' Button
- You will get the Most similar invoice as well as the similarity from 0 to 1### 2. Command-Line(Python)
- Run the '**main.py**' using
```bash
python3 main.py
```
- It will display the input invoice, most similar invoice and the similarity score into the shell