{"id":17289683,"url":"https://github.com/olk/ki-ocr-spreadsheet","last_synced_at":"2026-04-09T11:41:51.037Z","repository":{"id":150515147,"uuid":"219316092","full_name":"olk/ki-ocr-spreadsheet","owner":"olk","description":"convert a spreadsheet exported as JPEG into a CSV file containing spreadsheet's data by using a DNN (Tensorflow)","archived":false,"fork":false,"pushed_at":"2019-12-01T10:01:22.000Z","size":55148,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-31T23:29:28.194Z","etag":null,"topics":["artificial-intelligence","cpp","csv","ctc-loss","deep-learning","dnn","dnn-model","jpeg","keras","neural-network","opencv","python","spreadsheet","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/olk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-03T14:48:09.000Z","updated_at":"2020-06-19T15:45:41.000Z","dependencies_parsed_at":"2023-06-25T23:07:53.544Z","dependency_job_id":null,"html_url":"https://github.com/olk/ki-ocr-spreadsheet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olk%2Fki-ocr-spreadsheet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olk%2Fki-ocr-spreadsheet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olk%2Fki-ocr-spreadsheet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olk%2Fki-ocr-spreadsheet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/olk","download_url":"https://codeload.github.com/olk/ki-ocr-spreadsheet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245708994,"owners_count":20659626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","cpp","csv","ctc-loss","deep-learning","dnn","dnn-model","jpeg","keras","neural-network","opencv","python","spreadsheet","tensorflow"],"created_at":"2024-10-15T10:35:30.006Z","updated_at":"2025-12-30T23:23:58.872Z","avatar_url":"https://github.com/olk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"ki-ocr-spreadsheet\n==============================\n\nThe projects aims to convert a spreadsheet exported as JPEG into a CSV file containing spreadsheet's data by using a DNN.\n\n\nProject Structure\n------------\n\n    ├── LICENSE\n    ├── Makefile                 \u003c- Makefile with commands like `make data` or `make train`\n    ├── README.md                \u003c- Top-level README for developers using this project\n    ├── data\n    │   ├── processed            \u003c- Final, canonical data sets for modeling\n    |   |   ├── train            \u003c- data used for training\n    |   |   └── test             \u003c- data used for testing\n    |   |   └── val              \u003c- data used for validation\n    |   |\n    │   └── raw                  \u003c- Original, immutable data dump\n    │\n    ├── models                   \u003c- Trained and serialized models, model predictions, or model summaries\n    │\n    └── src                      \u003c- Source code for use in this project.\n       ├── __init__.py           \u003c- Makes src a Python module\n       │\n       ├── data                  \u003c- Scripts and binaries to generate data\n       │   ├── generate.cpp      \u003c- generate spreadsheets (PDF + CSV)\n       │   ├── generate_data.py  \u003c- generate raw data (JPEG + CSV)\n       |   ├── offapi.rdb        \u003c- db required by LibreOffice runtime\n       |   ├── template.ods      \u003c- LibreOffice template required for data generation\n       |   └── trim.cpp          \u003c- trim JPEG\n       │\n       ├── features              \u003c- Scripts and binaries to turn raw data into features for modeling\n       │   ├── build_features.py \u003c- extract features from data - generate training, validation and testing data sets\n       │   └── split.cpp         \u003c- split JPEG containing spreadsheet\n       │\n       └── models                \u003c- Scripts to train models and then use trained models to make\n           │                        predictions\n           |── find_lr.py        \u003c- find best learning rate\n           |── generator.py      \u003c- data generator iterates over data set\n           |── lrf.py            \u003c- learning rate finder\n           |── model.py          \u003c- DNN model\n           |── test.py           \u003c- test model with testing data set\n           └── train.py          \u003c- train DNN model; find best learning rate\n\n\n\u003cp\u003e\u003csmall\u003eProject based on the \u003ca target=\"_blank\" href=\"https://drivendata.github.io/cookiecutter-data-science/\"\u003ecookiecutter data science project template\u003c/a\u003e. #cookiecutterdatascience\u003c/small\u003e\u003c/p\u003e\n\n\nParameters\n----------\n- package `dotenv` is used to manage parameters\n- file `.env` contains parameter:\n\n | symbol            | value                   | description                                                  |\n |-------------------|-------------------------|--------------------------------------------------------------|\n | PATH_RAW          | \"data/raw\"              | relative ptah to generated raw data (PDF+CSV)                |\n | PATH_PROCESSED    | \"data/processed\"        | relative path to features (JPEGs+CSVs)                       |\n | PATH_MODELS       | \"models\"                | relative path to DNN models + data                           |\n | PATH_OFFAPI       | \"src/data/offapi.rdb\"   | relative path to LibreOffice RDB                             |\n | PATH_OFFTMPL      | \"src/data/template.ods\" | relative path to LibreOffice template file                   |\n | BATCH_SIZE        | 64                      |                                                              |\n | CHUNK_SIZE        | 100                     | amount of parallel jobs used for feature generation          |\n | DATA_SIZE         | 10000                   | amount of generated spreadsheets                             |\n | DOWNSAMPLE_FACTOR | 4                       |                                                              |\n | EPOCHS            | 150                     |                                                              |\n | IMAGE_HEIGHT      | 32                      |                                                              |\n | IMAGE_WIDTH       | 128                     |                                                              |\n | LEARNING_RATE     | 1e-6                    | learning rate found by learning rate finder                  |\n | MAX_LEARNING_RATE | 1e+1                    | maximal learning rate used by learning rate finder           |\n | MIN_LEARNING_RATE | 1e-10                   | minimal learning rate used by learning rate finder           |\n | NGPUS             | 2                       | GPUs used (1 \u003c NGPUS ? MirroredStrategy : OneDeviceStrategy) |\n | TRAIN_FRAC        | 0.8                     | fraction of data used for training                           |\n | VAL_FRAC          | 0.1                     | fraction oof data used for validation                        |\n\n\nData Generation\n---------------\n- `make data` instruments LibreOffice to generate spreadsheets (parameter `DATASETS`)\n- each spreadsheet is stored as PDF and CSV\n- the spreadsheets contains random data that mimics financial data\n- each PDF is converted to JPEG via `convert` (part of `ImageMagic`)\n- each JPEG is trimmed to the size of the table\n- the raw data are stored in `PATH_RAWA`\n\n\nFeature Extraction\n------------------\n- `make features` generates the features\n- each JPEG is splitted into multiple JPEGs, each containing one spreadsheet cell\n- the program above are forked\n- in order to prevent resource exhaustion, the feature extraction is executed in batches of `CHUNK_SIZE` size\n- each splitted JPEG is resized to 32x128 pixel by finding the contours of the text with `cv2.findContours()`\n  and resizing according to `cv2.copyMakeBoarder()` nd `cvs.resize()`\n- the JPEGs are stored in directories `train`, `test` and `val` (below diretory `PATH_PROCESSED`)\n- each of these folders contains a `labels.csv` file containg the mapping between the image (spreadsheet cell) and its content (text)\n\n\nDNN Model\n---------\n![Model](doc/model.png  \"DNN model\")\n\n- `make train` trains the model; if 1 \u003c `NCORS` the model is trained on multiple GPUs\n- the model is serialized to file `model.h5` in `PATH_MODLES`\n\n- `make test` tests the accuracy of the model using data from directory `test`\n\n- `make find-lr` finds the optimal learning rate using class `LRFinder`\n- `LRFinder` measures the loss from `MIN_LEARNING_RATE` till `MAX_LEARNING_RATE`\n- two plots are generated and stored in directory `PATH_MODELS`\n\n![LR](doc/loss_plot.png  \"Learning Rate\")\n\n- the plot shows shows that the steepest descent is around `1e-6`\n\n![LRC](doc/loss_change_plot.png  \"Learning Rate Change\")\n\n- the Minimum (strongest change of loss/descent) achieves a learning rate at `1e-6`\n\n\nConverting\n----------\n- `python predict.py --file \u003cpath-to-jpeg\u003e --outdir \u003cpath-to-output\u003e` converts a spreadsheet given as JPEG into a CSV file containing spreasheet's data\n\nOpenOffice spreadsheed exported as JPEG:\n\n![JPEG](doc/1.jpg  \"spreadsheet as JPEG\")\n\n\nCSV converted from JPEG via DNN:\n\n| Symbol | High   | Low    | Now    | Sell Target Potential | Worst-Case Drawdowns | Range Index | Win Odds/100 | % Payoff | Days Held | Annual Rate of Return | Sample Size |      | Creadible Ratio | Rwd~Rsk Ratio | Wghted |\n|--------|--------|--------|--------|-----------------------|----------------------|-------------|--------------|----------|-----------|-----------------------|-------------|------|-----------------|---------------|--------|\n| CJOP   | $12.65 | $9.74  | $11.19 | 13.05%                | -13.30%              | 85          | 17           | 5.58%    | 42        | 606.63%               | 530         | 461  | 0.01            | 2.2           | 63.7   |\n| RXKX   | $17.49 | $14.31 | $15.90 | 10.00%                | -7.88%               | 32          | 53           | 1.44%    | 63        | 444.62%               | 110         | 418  | 1.76            | 0.2           | 28.6   |\n| GSWS   | $29.20 | $26.95 | $28.08 | 3.99%                 | -14.88%              | 24          | 91           | 48.27%   | 73        | 327.94%               | 144         | 998  | 0.85            | 2.7           | 11.7   |\n| KUZ    | $1.61  | $.90   | $1.26  | 27.78%                | -11.74%              | -12         | 95           | 19.28%   | 13        | 135.83%               | 224         | 1233 | 0.63            | 3.8           | 23.8   |\n| FAG    | $23.62 | $16.41 | $20.01 | 18.04%                | -14.08%              | 78          | 27           | 7.14%    | 89        | 102.13%               | 207         | 311  | 4.33            | 2.9           | 9.0    |\n| SFRX   | $26.04 | $18.09 | $22.07 | 17.99%                | -13.56%              | 54          | 62           | 11.58%   | 18        | 330.45%               | 397         | 1215 | 3.99            | 4.4           | 27.6   |\n| AID    | $9.67  | $6.86  | $8.27  | 16.93%                | -19.57%              | 86          | 74           | 44.92%   | 23        | 463.69%               | 194         | 599  | 3.85            | 4.3           | 98.1   |\n| DHL    | $17.02 | $11.11 | $14.07 | 20.97%                | -18.08%              | 25          | 47           | 39.44%   | 47        | 206.28%               | 519         | 483  | 2.19            | 0.4           | 40.4   |\n| KLS    | $6.46  | $4.67  | $5.56  | 16.19%                | -17.76%              | 43          | 82           | 3.88%    | 49        | 458.18%               | 190         | 829  | 2.26            | 0.9           | 58.9   |\n| FEN    | $33.32 | $19.56 | $26.44 | 26.02%                | -13.87%              | -12         | 68           | 46.09%   | 64        | 576.61%               | 314         | 1166 | 3.64            | 3.5           | 8.8    |\n| QIO    | $4.47  | $3.30  | $3.88  | 15.21%                | -8.78%               | -11         | 1            | 16.07%   | 88        | 686.48%               | 70          | 1066 | 0.15            | 1.2           | 31.5   |\n| FBZH   | $4.45  | $3.50  | $3.98  | 11.81%                | -15.38%              | 54          | 44           | 46.87%   | 1         | 167.75%               | 323         | 489  | 4.43            | 0.5           | 75.8   |\n| XKRP   | $31.23 | $24.04 | $27.64 | 12.99%                | -6.62%               | 53          | 76           | 30.83%   | 99        | 458.96%               | 392         | 945  | 0.57            | 1.6           | 15.9   |\n| PDFI   | $7.46  | $6.10  | $6.78  | 10.03%                | -8.08%               | -14         | 23           | 28.14%   | 74        | 280.57%               | 569         | 947  | 3.62            | 3.7           | 75.2   |\n| IWR    | $31.74 | $24.43 | $28.08 | 13.03%                | -6.64%               | 82          | 50           | 42.10%   | 13        | 156.21%               | 50          | 743  | 4.55            | 4.9           | 50.2   |\n| HTVM   | $12.78 | $8.88  | $10.83 | 18.01%                | -7.10%               | 75          | 92           | 17.29%   | 85        | 275.33%               | 254         | 547  | 4.27            | 3.7           | 89.7   |\n| IFBC   | $29.39 | $27.12 | $28.25 | 4.04%                 | -8.72%               | 0           | 5            | 7.31%    | 7         | 235.80%               | 661         | 726  | 2.18            | 2.2           | 66.3   |\n| DRN    | $11.24 | $7.18  | $9.21  | 22.04%                | -16.98%              | -1          | 67           | 17.96%   | 85        | 611.87%               | 249         | 1213 | 4.88            | 4.7           | 18.2   |\n| BHV    | $7.51  | $4.50  | $6.01  | 24.96%                | -11.35%              | 27          | 87           | 39.28%   | 21        | 630.71%               | 593         | 1058 | 2.97            | 1.3           | 51.7   |\n| TMJ    | $32.56 | $23.58 | $28.07 | 16.00%                | -11.96%              | 14          | 83           | 22.54%   | 53        | 372.78%               | 516         | 308  | 1.68            | 1.9           | 97.0   |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folk%2Fki-ocr-spreadsheet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Folk%2Fki-ocr-spreadsheet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folk%2Fki-ocr-spreadsheet/lists"}