https://github.com/olk/ki-ocr-spreadsheet
convert a spreadsheet exported as JPEG into a CSV file containing spreadsheet's data by using a DNN (Tensorflow)
https://github.com/olk/ki-ocr-spreadsheet
artificial-intelligence cpp csv ctc-loss deep-learning dnn dnn-model jpeg keras neural-network opencv python spreadsheet tensorflow
Last synced: 3 months ago
JSON representation
convert a spreadsheet exported as JPEG into a CSV file containing spreadsheet's data by using a DNN (Tensorflow)
- Host: GitHub
- URL: https://github.com/olk/ki-ocr-spreadsheet
- Owner: olk
- License: gpl-3.0
- Created: 2019-11-03T14:48:09.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-12-01T10:01:22.000Z (over 5 years ago)
- Last Synced: 2025-01-31T23:29:28.194Z (4 months ago)
- Topics: artificial-intelligence, cpp, csv, ctc-loss, deep-learning, dnn, dnn-model, jpeg, keras, neural-network, opencv, python, spreadsheet, tensorflow
- Language: Python
- Homepage:
- Size: 52.6 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
ki-ocr-spreadsheet
==============================The projects aims to convert a spreadsheet exported as JPEG into a CSV file containing spreadsheet's data by using a DNN.
Project Structure
------------├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- Top-level README for developers using this project
├── data
│ ├── processed <- Final, canonical data sets for modeling
| | ├── train <- data used for training
| | └── test <- data used for testing
| | └── val <- data used for validation
| |
│ └── raw <- Original, immutable data dump
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
└── src <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
├── data <- Scripts and binaries to generate data
│ ├── generate.cpp <- generate spreadsheets (PDF + CSV)
│ ├── generate_data.py <- generate raw data (JPEG + CSV)
| ├── offapi.rdb <- db required by LibreOffice runtime
| ├── template.ods <- LibreOffice template required for data generation
| └── trim.cpp <- trim JPEG
│
├── features <- Scripts and binaries to turn raw data into features for modeling
│ ├── build_features.py <- extract features from data - generate training, validation and testing data sets
│ └── split.cpp <- split JPEG containing spreadsheet
│
└── models <- Scripts to train models and then use trained models to make
│ predictions
|── find_lr.py <- find best learning rate
|── generator.py <- data generator iterates over data set
|── lrf.py <- learning rate finder
|── model.py <- DNN model
|── test.py <- test model with testing data set
└── train.py <- train DNN model; find best learning rateProject based on the cookiecutter data science project template. #cookiecutterdatascience
Parameters
----------
- package `dotenv` is used to manage parameters
- file `.env` contains parameter:| symbol | value | description |
|-------------------|-------------------------|--------------------------------------------------------------|
| PATH_RAW | "data/raw" | relative ptah to generated raw data (PDF+CSV) |
| PATH_PROCESSED | "data/processed" | relative path to features (JPEGs+CSVs) |
| PATH_MODELS | "models" | relative path to DNN models + data |
| PATH_OFFAPI | "src/data/offapi.rdb" | relative path to LibreOffice RDB |
| PATH_OFFTMPL | "src/data/template.ods" | relative path to LibreOffice template file |
| BATCH_SIZE | 64 | |
| CHUNK_SIZE | 100 | amount of parallel jobs used for feature generation |
| DATA_SIZE | 10000 | amount of generated spreadsheets |
| DOWNSAMPLE_FACTOR | 4 | |
| EPOCHS | 150 | |
| IMAGE_HEIGHT | 32 | |
| IMAGE_WIDTH | 128 | |
| LEARNING_RATE | 1e-6 | learning rate found by learning rate finder |
| MAX_LEARNING_RATE | 1e+1 | maximal learning rate used by learning rate finder |
| MIN_LEARNING_RATE | 1e-10 | minimal learning rate used by learning rate finder |
| NGPUS | 2 | GPUs used (1 < NGPUS ? MirroredStrategy : OneDeviceStrategy) |
| TRAIN_FRAC | 0.8 | fraction of data used for training |
| VAL_FRAC | 0.1 | fraction oof data used for validation |Data Generation
---------------
- `make data` instruments LibreOffice to generate spreadsheets (parameter `DATASETS`)
- each spreadsheet is stored as PDF and CSV
- the spreadsheets contains random data that mimics financial data
- each PDF is converted to JPEG via `convert` (part of `ImageMagic`)
- each JPEG is trimmed to the size of the table
- the raw data are stored in `PATH_RAWA`Feature Extraction
------------------
- `make features` generates the features
- each JPEG is splitted into multiple JPEGs, each containing one spreadsheet cell
- the program above are forked
- in order to prevent resource exhaustion, the feature extraction is executed in batches of `CHUNK_SIZE` size
- each splitted JPEG is resized to 32x128 pixel by finding the contours of the text with `cv2.findContours()`
and resizing according to `cv2.copyMakeBoarder()` nd `cvs.resize()`
- the JPEGs are stored in directories `train`, `test` and `val` (below diretory `PATH_PROCESSED`)
- each of these folders contains a `labels.csv` file containg the mapping between the image (spreadsheet cell) and its content (text)DNN Model
---------
- `make train` trains the model; if 1 < `NCORS` the model is trained on multiple GPUs
- the model is serialized to file `model.h5` in `PATH_MODLES`- `make test` tests the accuracy of the model using data from directory `test`
- `make find-lr` finds the optimal learning rate using class `LRFinder`
- `LRFinder` measures the loss from `MIN_LEARNING_RATE` till `MAX_LEARNING_RATE`
- two plots are generated and stored in directory `PATH_MODELS`
- the plot shows shows that the steepest descent is around `1e-6`

- the Minimum (strongest change of loss/descent) achieves a learning rate at `1e-6`
Converting
----------
- `python predict.py --file --outdir ` converts a spreadsheet given as JPEG into a CSV file containing spreasheet's dataOpenOffice spreadsheed exported as JPEG:

CSV converted from JPEG via DNN:
| Symbol | High | Low | Now | Sell Target Potential | Worst-Case Drawdowns | Range Index | Win Odds/100 | % Payoff | Days Held | Annual Rate of Return | Sample Size | | Creadible Ratio | Rwd~Rsk Ratio | Wghted |
|--------|--------|--------|--------|-----------------------|----------------------|-------------|--------------|----------|-----------|-----------------------|-------------|------|-----------------|---------------|--------|
| CJOP | $12.65 | $9.74 | $11.19 | 13.05% | -13.30% | 85 | 17 | 5.58% | 42 | 606.63% | 530 | 461 | 0.01 | 2.2 | 63.7 |
| RXKX | $17.49 | $14.31 | $15.90 | 10.00% | -7.88% | 32 | 53 | 1.44% | 63 | 444.62% | 110 | 418 | 1.76 | 0.2 | 28.6 |
| GSWS | $29.20 | $26.95 | $28.08 | 3.99% | -14.88% | 24 | 91 | 48.27% | 73 | 327.94% | 144 | 998 | 0.85 | 2.7 | 11.7 |
| KUZ | $1.61 | $.90 | $1.26 | 27.78% | -11.74% | -12 | 95 | 19.28% | 13 | 135.83% | 224 | 1233 | 0.63 | 3.8 | 23.8 |
| FAG | $23.62 | $16.41 | $20.01 | 18.04% | -14.08% | 78 | 27 | 7.14% | 89 | 102.13% | 207 | 311 | 4.33 | 2.9 | 9.0 |
| SFRX | $26.04 | $18.09 | $22.07 | 17.99% | -13.56% | 54 | 62 | 11.58% | 18 | 330.45% | 397 | 1215 | 3.99 | 4.4 | 27.6 |
| AID | $9.67 | $6.86 | $8.27 | 16.93% | -19.57% | 86 | 74 | 44.92% | 23 | 463.69% | 194 | 599 | 3.85 | 4.3 | 98.1 |
| DHL | $17.02 | $11.11 | $14.07 | 20.97% | -18.08% | 25 | 47 | 39.44% | 47 | 206.28% | 519 | 483 | 2.19 | 0.4 | 40.4 |
| KLS | $6.46 | $4.67 | $5.56 | 16.19% | -17.76% | 43 | 82 | 3.88% | 49 | 458.18% | 190 | 829 | 2.26 | 0.9 | 58.9 |
| FEN | $33.32 | $19.56 | $26.44 | 26.02% | -13.87% | -12 | 68 | 46.09% | 64 | 576.61% | 314 | 1166 | 3.64 | 3.5 | 8.8 |
| QIO | $4.47 | $3.30 | $3.88 | 15.21% | -8.78% | -11 | 1 | 16.07% | 88 | 686.48% | 70 | 1066 | 0.15 | 1.2 | 31.5 |
| FBZH | $4.45 | $3.50 | $3.98 | 11.81% | -15.38% | 54 | 44 | 46.87% | 1 | 167.75% | 323 | 489 | 4.43 | 0.5 | 75.8 |
| XKRP | $31.23 | $24.04 | $27.64 | 12.99% | -6.62% | 53 | 76 | 30.83% | 99 | 458.96% | 392 | 945 | 0.57 | 1.6 | 15.9 |
| PDFI | $7.46 | $6.10 | $6.78 | 10.03% | -8.08% | -14 | 23 | 28.14% | 74 | 280.57% | 569 | 947 | 3.62 | 3.7 | 75.2 |
| IWR | $31.74 | $24.43 | $28.08 | 13.03% | -6.64% | 82 | 50 | 42.10% | 13 | 156.21% | 50 | 743 | 4.55 | 4.9 | 50.2 |
| HTVM | $12.78 | $8.88 | $10.83 | 18.01% | -7.10% | 75 | 92 | 17.29% | 85 | 275.33% | 254 | 547 | 4.27 | 3.7 | 89.7 |
| IFBC | $29.39 | $27.12 | $28.25 | 4.04% | -8.72% | 0 | 5 | 7.31% | 7 | 235.80% | 661 | 726 | 2.18 | 2.2 | 66.3 |
| DRN | $11.24 | $7.18 | $9.21 | 22.04% | -16.98% | -1 | 67 | 17.96% | 85 | 611.87% | 249 | 1213 | 4.88 | 4.7 | 18.2 |
| BHV | $7.51 | $4.50 | $6.01 | 24.96% | -11.35% | 27 | 87 | 39.28% | 21 | 630.71% | 593 | 1058 | 2.97 | 1.3 | 51.7 |
| TMJ | $32.56 | $23.58 | $28.07 | 16.00% | -11.96% | 14 | 83 | 22.54% | 53 | 372.78% | 516 | 308 | 1.68 | 1.9 | 97.0 |