https://github.com/dashroshan/data-extractor

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.
https://github.com/dashroshan/data-extractor

document-extraction form-analysis key-value-pairs ocr-python table-extraction

Last synced: 9 months ago
JSON representation

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

Host: GitHub
URL: https://github.com/dashroshan/data-extractor
Owner: dashroshan
Created: 2023-06-17T14:16:45.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-06-17T14:39:33.000Z (over 2 years ago)
Last Synced: 2025-02-12T21:17:32.763Z (11 months ago)
Topics: document-extraction, form-analysis, key-value-pairs, ocr-python, table-extraction
Language: JavaScript
Homepage:
Size: 503 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Extractor

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

## Tech stack

| Technology | Used for |
| -------------------------- | ----------------------------- |
| Flask | Backend |
| React + Tailwind + DaisyUI | Frontend |
| Azure FormRecognizer | Extracting data from document |
| Azure BlobStorage | Storing uploaded documents |

## Usage (as a webapp)

1. Run `npm i` in frontend folder followed by `npm run build`
2. Run `pip install -r requirements.txt` in root folder
3. Create a `.env` file with the below content:

Create a Azure FormRecognizer service and copy the `Endpoint` and `KEY1` from `Keys and Endpoint`. These will be the ENDPOINT and KEY respectively. Next create an azure storage account, and create a container in it. Go to `Shared access tokens` and click `Generate SAS token and URL`. Copy the `Blod SAS URL`. The part to the left of `?` goes in `BLOB_ENDPOINT` and the part to the right goes in `BLOB_QUERY`

```
ENDPOINT = "https://xyz.cognitiveservices.azure.com"
KEY = "12345something"
BLOB_ENDPOINT = "https://xyz.blob.core.windows.net/containerName/"
BLOB_QUERY = "?xyz=xyz&xyz=xyz..."
```

4. Run with `py main.py`

## Usage (as a script)

Run `py extract.py -i "input/file/path.pdf" -o "output/file/path.csv"`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dashroshan/data-extractor

Awesome Lists containing this project

README