{"id":25522270,"url":"https://github.com/stuartemiddleton/glosat_table_dataset","last_synced_at":"2025-04-11T01:32:23.452Z","repository":{"id":196068184,"uuid":"345948218","full_name":"stuartemiddleton/glosat_table_dataset","owner":"stuartemiddleton","description":"GloSAT Historical Measurement Table Dataset","archived":false,"fork":false,"pushed_at":"2025-02-14T14:08:45.000Z","size":14582,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-24T22:41:32.418Z","etag":null,"topics":["artificial-intelligence","dataset","document-layout-analysis","machine-learning","table-detection","table-structure-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stuartemiddleton.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-09T09:16:44.000Z","updated_at":"2025-02-14T14:08:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"9a38da59-1547-4312-882f-d8f29e48c0a2","html_url":"https://github.com/stuartemiddleton/glosat_table_dataset","commit_stats":null,"previous_names":["stuartemiddleton/glosat_table_dataset"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stuartemiddleton%2Fglosat_table_dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stuartemiddleton%2Fglosat_table_dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stuartemiddleton%2Fglosat_table_dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stuartemiddleton%2Fglosat_table_dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stuartemiddleton","download_url":"https://codeload.github.com/stuartemiddleton/glosat_table_dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248325117,"owners_count":21084870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","dataset","document-layout-analysis","machine-learning","table-detection","table-structure-recognition"],"created_at":"2025-02-19T18:19:05.794Z","updated_at":"2025-04-11T01:32:23.433Z","avatar_url":"https://github.com/stuartemiddleton.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## GloSAT Historical Measurement Table Dataset\nDataset containing scanned historical measurement table documents from ship logs and land measurement stations. Annotations provided in this dataset are designed to allow finergrained table detection and table structure recognition models to be trained and tested. Annotations are region boundaries for tables, cells, headings, headers and captions.\n\nThis dataset release includes code to train models on a training split, to use trained model checkpoints for inference and to evaluate interred results on a test split. Pretrained models used in the published HIP-2021 paper are included in the dataset so results can be easily reproduced without training the model checkpoints yourself.\n\nInstructions and code can be found on this github repository. Examples of some processed scanned pages from different model types can be seen in the examples folder.\n\nDataset and model checkpoint files can be downloaded from Zendo. Dataset files should be checked into the datasets dir. Model checkpoint files should be checked into the models dir. Zendo dataset https://doi.org/10.5281/zenodo.5363456\n\nA snapshot release of the github site can also be downloaded from Zendo.\n\nData sourced for a total of 500 annotated images. Original images sourced with permission from UK Met Office, US NOAA and weatheerrescue.org (University of Reading).\n\n| Source ID; Region; Timeframe | Images / Tables / Headers | Page Style; Table Style |\n| ---------------------------- | ------------------------- | ----------------------- |\n| 20cr_DWR_MO; India; 1970s | 24 / 31 / 31 | Printed; Borderless |\n| 20cr_DWR_NOAA; India; 1930s | 24 / 24 / 24 | Printed; Semi-bordered |\n| 20cr_Kubota; Philippines; 1900s | 24 / 28 / 28 | Printed; Semi-bordered |\n| 20cr_Natal_Witness; Africa; 1870s | 26 / 26 / 26 | Printed; Semi-bordered |\n| Ben Nevis; UK; 1890s | 97 / 137 / 82 | Printed; Semi-bordered |\n| DWR; UK and world; 1900s | 93 / 139 / 139 | Mixed; Semi-bordered |\n| WesTech Rodgers; Arctic; 1880s | 82 / 164 / 82 | Mixed; Semi-bordered |\n| WR_10_years; UK; 1830s to 1930s | 97 / 129 / 129 | Mixed; Bordered |\n| WR_Devon_Extern; UK; 1890s to 1940s | 33 / 33/ 33 | Mixed; Bordered |\n| Total | 500 / 710 / 573 | |\n\nThis work can be cited as:\n\nZiomek. J. Middleton, S.E. GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue, 6th International Workshop on Historical Document Imaging and Processing (HIP-2021), Sept 5-6, 2021, Lausanne, Switzerland\n\nA pre-print of the HIP-2021 paper can be found on the authors website https://www.southampton.ac.uk/~sem03/HIP_2021.pdf\n\nThis work is part of the GloSAT project https://www.glosat.org/ and supported by the Natural Environment Research Council (NE/S015604/1). The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work.\n\n# Installation under Ubuntu 20.04LTS\n\n```\ncd /data/glosat_table_dataset\n\n# install conda see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html\nwget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh\nchmod +x Anaconda3-2020.07-Linux-x86_64.sh\n./Anaconda3-2020.07-Linux-x86_64.sh\nconda list\n\n# create open-mmlab env in conda\nconda create --yes --use-local -n open-mmlab python=3.7 -y\nconda init bash\nconda config --set auto_activate_base false\nconda activate open-mmlab\n\n# check you have python 3.7 via conda\npython3 -V\n\n# install cuda and torch\nconda install --yes -c anaconda cudatoolkit=10.0\npython3 -m pip install --user torch==1.4.0 torchvision==0.5.0\n\n# install mmdetection\n# note: mmdetection tutorials are located at https://mmdetection.readthedocs.io/en/latest/index.html\n# note: delete the ./build dir if re-installing\npython3 -m pip install --user mmcv terminaltables\ngit clone --branch v1.2.0 https://github.com/open-mmlab/mmdetection.git\n\ncd /data/glosat_table_dataset/mmdetection\npython3 -m pip install --user -r requirements/optional.txt\nrm -rf build\npython3 -m pip install --user pillow==6.2.1\npython3 setup.py install --user\npython3 setup.py develop --user\npython3 -m pip install --user -r \"requirements.txt\"\npython3 -m pip install --user mmcv==0.4.3\n\n# In case, sklearn is not able to install use the following command\n# export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True\npython3 -m pip install --user sklearn #scikit-learn\n\npython3 -m pip install --user pycocotools\n\n# manually install the mmcv model file hrnetv2_w32-dc9eeb4f.pth as later versions of mmcv have changed the download link from AWS to aliyun cloud provider (so it breaks on download from old AWS link)\ncp /data/glosat_table_dataset/models/hrnetv2_w32-dc9eeb4f.pth /home/sem03/.cache/torch/checkpoints/hrnetv2_w32-dc9eeb4f.pth\n\n# install glosat dataset files into mmdetection\ncp -R /data/glosat_table_dataset/dla /data/glosat_table_dataset/mmdetection\ncd /data/glosat_table_dataset/mmdetection\npython3 dla/src/install.py\n\n# change default classes for voc to GloSAT table classes\nnano /data/glosat_table_dataset/mmdetection/mmdet/datasets/voc.py\n\tCLASSES = ('table_body','cell','full_table','header','heading')\n\n# prepare the VOC training data files for GloSAT dataset (course and fine)\ncd /data/glosat_table_dataset/datasets/GloSAT_dataset_coarse\nunzip 20cr_DWR_MO.zip\nunzip 20cr_DWR_NOAA.zip\nunzip 20cr_Kubota.zip\nunzip 20cr_Natal_Witnes.zip\nunzip Ben_Nevis.zip\nunzip DWR.zip\nunzip WesTech_Rodgers.zip\nunzip WR_10_years.zip\nunzip WR_Devon_Extern.zip\n\ncd /data/glosat_table_dataset/datasets/GloSAT_dataset_fine\nunzip 20cr_DWR_MO.zip\nunzip 20cr_DWR_NOAA.zip\nunzip 20cr_Kubota.zip\nunzip 20cr_Natal_Witnes.zip\nunzip Ben_Nevis.zip\nunzip DWR.zip\nunzip WesTech_Rodgers.zip\nunzip WR_10_years.zip\nunzip WR_Devon_Extern.zip\n```\n\n# Train models\n\nThis is only needed if you are not using the available pretrained model checkpoints.\n\n```\ncd /data/glosat_table_dataset/mmdetection\n\n#\n# Train \u003e\u003e Table Detection Model\n#\n\n\n# Table Detection Model \u003e\u003e full_table\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\n\nmkdir /data/glosat_table_dataset/dla_models\nnano dla/src/construct_VOC.py\n\tdataset_dir = '/data/glosat_table_dataset/datasets/GloSAT_dataset_coarse'\n\tmodel_dir = '/data/glosat_table_dataset/dla_models/model_table_det_full_table'\npython3 dla/src/construct_VOC.py\n\nls -la /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/VOC2007\nls -la /data/glosat_table_dataset/dla_models/model_table_det_full_table_test/VOC2007\n\nnano dla/config/cascadeRCNN_full_table_only.py\n\tmodel_dir='/data/glosat_table_dataset/dla_models/model_table_det_full_table_train'\n\tresume_from = None\n\ttotal_epochs = 601\n\t# do less epochs for testing\nnohup python3 tools/train.py dla/config/cascadeRCNN_full_table_only.py --work_dir /data/glosat_table_dataset/dla_models/model_table_det_full_table_train \u003e /data/glosat_table_dataset/mmdetection/dla_train.log 2\u003e\u00261 \u0026\n\nla -la /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/*.pth\ncp /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/epoch_601.pth /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/best_model.pth\nrm /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/epoch_*.pth\n\n# resume training if needed from a specific epoch\npython3 tools/train.py dla/config/cascadeRCNN_full_table_only.py --work_dir /data/glosat_table_dataset/dla_models/model_table_det_full_table_train --resume_from /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/epoch_50.pth\n\n# Table Detection Model \u003e\u003e full table, header, caption (enhanced)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\n\nmkdir /data/glosat_table_dataset/dla_models\nnano dla/src/construct_VOC.py\n\tdataset_dir = '/data/glosat_table_dataset/datasets/GloSAT_dataset_coarse'\n\tmodel_dir = '/data/glosat_table_dataset/dla_models/model_table_det_enhanced'\npython3 dla/src/construct_VOC.py\n\nls -la /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/VOC2007\nls -la /data/glosat_table_dataset/dla_models/model_table_det_enhanced_test/VOC2007\n\nnano dla/config/cascadeRCNN_ignore_cells.py\n\tmodel_dir='/data/glosat_table_dataset/dla_models/model_table_det_enhanced_train'\n\tresume_from = None\n\ttotal_epochs = 601\n\t# do less epochs for testing\nnohup  python3 tools/train.py dla/config/cascadeRCNN_ignore_cells.py --work_dir /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train \u003e /data/glosat_table_dataset/mmdetection/dla_train.log 2\u003e\u00261 \u0026\n\nla -la /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/*.pth\ncp /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/epoch_601.pth /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/best_model.pth\nrm /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/epoch_*.pth\n\n#\n# Train \u003e\u003e Table Structure Recognition Model\n#\n\n# Table Structure Recognition Model \u003e\u003e coarse segmentation cells\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\n\nnano dla/src/construct_VOC.py\n\tdataset_dir = '/data/glosat_table_dataset/datasets/GloSAT_dataset_coarse'\n\tmodel_dir = '/data/glosat_table_dataset/dla_models/model_table_struct_coarse'\npython3 dla/src/construct_VOC.py\n\nls -la /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/VOC2007\nls -la /data/glosat_table_dataset/dla_models/model_table_struct_coarse_test/VOC2007\n\nnano dla/config/cascadeRCNN_ignore_all_but_cells.py\n\tmodel_dir='/data/glosat_table_dataset/dla_models/model_table_struct_coarse_train'\n\tresume_from = None\n\ttotal_epochs = 601\n\t# do less epochs for testing\nnohup python3 tools/train.py dla/config/cascadeRCNN_ignore_all_but_cells.py --work_dir /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train \u003e /data/glosat_table_dataset/mmdetection/dla_train.log 2\u003e\u00261 \u0026\n\nla -la /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/*.pth\ncp /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/epoch_601.pth /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/best_model.pth\nrm /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/epoch_*.pth\n\n\n# Table Structure Recognition Model \u003e\u003e individual cells (needs reduced memory model)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\n\nnano dla/src/construct_VOC.py\n\tdataset_dir = '/data/glosat_table_dataset/datasets/GloSAT_dataset_fine'\n\tmodel_dir = '/data/glosat_table_dataset/dla_models/model_table_struct_fine'\npython3 dla/src/construct_VOC.py\n\nls -la /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/VOC2007\nls -la /data/glosat_table_dataset/dla_models/model_table_struct_fine_test/VOC2007\n\nnano dla/config/cascadeRCNN_ignore_all_but_cells.py\n\tmodel_dir='/data/glosat_table_dataset/dla_models/model_table_struct_fine_train'\n\tresume_from = None\n\ttotal_epochs = 601\n\t# do less epochs for testing\n\ttype='CascadeRCNNFrozenRPN'\nnohup python3 tools/train.py dla/config/cascadeRCNN_ignore_all_but_cells.py --work_dir /data/glosat_table_dataset/dla_models/model_table_struct_fine_train \u003e /data/glosat_table_dataset/mmdetection/dla_train.log 2\u003e\u00261 \u0026\n\nla -la /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/*.pth\ncp /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/epoch_601.pth /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/best_model.pth\nrm /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/epoch_*.pth\n\n```\n\nIn order to change more advanced training settings, one has to edit cascadeRCNN.py config file.\nTo change the total epoch number, please edit the total_epochs (line 247) variable.\nTo change the learning rate and optimiser settings, edit the optimizer dictionary (line 288).\nFull documentation of the training options can be found here: https://mmdetection.readthedocs.io/en/latest/getting_started.html#train-a-model.\n\nTo ignore class, set the dataset type (line 192) to \"IgnoringVOCDataset\". This give the option to add ignore keyword in data pipelines, eg. ignore = (\"cell\").\n\nTo reduce memory footprint, one can use CascadeRCNNFrozen or CascadeRCNNFrozenRPN.\nThe first one has all backbone layers frozen. The second one has also RPN network frozen.\nTo use them simply change the 'type' key value in model dictonary ('model = dict(...)',) in config file.\nThe type string should be changed to either 'CascadeRCNNFrozen' or 'CascadeRCNNFrozenRPN'.\n\n\n# Infer and evaluate using models\n\nCommands provided for using both the available pretrained models and ones trained using previous section.\n\n```\ncd /data/glosat_table_dataset/mmdetection\nmkdir /data/glosat_table_dataset/dla_results\n\n#\n# Infer \u003e\u003e Pretrained Model \u003e\u003e Table Detection (GloSAT dataset Test split)\n#\n\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model ( not reported in HIP 2021 paper, downloaded from https://github.com/DevashishPrasad/CascadeTabNet )\n# note: inference_original.py is needed not inference.py as the original CascadeTabNet model has different classes (Borderless, bordered etc.), so this separate script is needed to have correct labels.\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_original_table_det\npython3 dla/src/inference_original.py /data/glosat_table_dataset/models/cascadetabnet_epoch_14.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_original_table_det/ --visual True\n\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_original_table_det /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.1\n\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (fulltables only)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_fulltables_table_det\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_fulltables_only_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_fulltables_table_det/ --visual True\n\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_fulltables_table_det /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.1\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (full table, header, caption)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_det\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_tables_enchanced_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_det/ --visual True\n\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_det /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.1\n\n\n#\n# Infer \u003e\u003e Pretrained Model \u003e\u003e Table Detection (ICDAR dataset and aggregated dataset)\n#\n\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on cTDaR19 dataset Train split\n# model checkpoint = /data/glosat_table_dataset/models/model_tables_ICDAR.pth\n# download test data from ICDAR website ( https://zenodo.org/record/2649217#.YSjA2YhKiUk ) and follow the same procedure as for GloSAT models\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on aggregated dataset Train split\n# model checkpoint = /data/glosat_table_dataset/models/model_tables_both.pth\n# download image test data from ICDAR website ( https://zenodo.org/record/2649217#.YSjA2YhKiUk ), aggregate with GloSAT dataset Test split and follow the same procedure as for GloSAT models\n\n\n#\n# Infer \u003e\u003e Trained Model \u003e\u003e Table Detection\n#\n\n\n# Trained Model \u003e\u003e fulltable model\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/trained_model_fulltables_table_det\npython3 dla/src/inference.py /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/best_model.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/trained_model_fulltables_table_det/ --visual True\n\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/trained_model_fulltables_table_det /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.1\n\n\n# Trained Model \u003e\u003e enhanced model\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/trained_model_enhanced_table_det\npython3 dla/src/inference.py /data/glosat_table_dataset/dla_models/model_table_det_enhanced_train/best_model.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/trained_model_enhanced_table_det/ --visual True\n\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/trained_model_enhanced_table_det /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.1\n\n\n#\n# Infer \u003e\u003e Pretrained Model \u003e\u003e Table Structure Recognition (GloSAT dataset Test split)\n#\n\n\n# Pretrained Model \u003e\u003e GloSAT (coarse segmentation cells) table detected \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (fulltables only, coarse cells, post-processing, table det)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_fulltables_only_GloSAT.pth --cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\n\n\n# Pretrained Model \u003e\u003e GloSAT (individual cells) table detected \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (fulltables only, individual cells, post-processing, table det)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_fulltables_only_GloSAT.pth --cell_checkpoint /data/glosat_table_dataset/models/model_finecell_GloSAT.pth --coarse_cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\n\n\n# Pretrained Model \u003e\u003e GloSAT (coarse segmentation cells) table provided \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (fulltables only, coarse cells, post-processing, table provided)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided\npython3 dla/src/inference_regiongiven.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells --cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\n\n\n# Pretrained Model \u003e\u003e GloSAT (individual cells) table provided \u003e\u003e CascadeTabNet original model fine-tuned on GloSAT dataset Train split (fulltables only, individual cells, post-processing, table provided)\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided\npython3 dla/src/inference_regiongiven.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells --cell_checkpoint /data/glosat_table_dataset/models/model_finecell_GloSAT.pth --coarse_cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided /data/glosat_table_dataset/datasets/Test/Fine/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\n\n\n# Pretrained Model without any post processing \u003e\u003e GloSAT (coarse segmentation cells) table detected\n# note: no post processing means VOC format is output so needs a different eval script\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_no_post\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_fulltables_only_GloSAT.pth --cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_no_post/ --visual True --raw_cells True\n\npython3 dla/src/eval_VOC_wF1.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_no_post\npython3 dla/src/eval_VOC.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_no_post --IoU_threshold 0.5\n\n\n# Pretrained Model without any post processing \u003e\u003e GloSAT (individual cells) table detected\n# note: no post processing means VOC format is output so needs a different eval script\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_no_post\npython3 dla/src/inference.py /data/glosat_table_dataset/models/model_fulltables_only_GloSAT.pth --cell_checkpoint /data/glosat_table_dataset/models/model_finecell_GloSAT.pth --coarse_cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_no_post/ --visual True --raw_cells True\n\npython3 dla/src/eval_VOC_wF1.py /data/glosat_table_dataset/datasets/Test/Fine/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_no_post\npython3 dla/src/eval_VOC.py /data/glosat_table_dataset/datasets/Test/Fine/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_no_post --IoU_threshold 0.5\n\n\n# Pretrained Model without any post processing \u003e\u003e GloSAT (coarse segmentation cells) table provided\n# note: no post processing means VOC format is output so needs a different eval script\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided_no_post\npython3 dla/src/inference_regiongiven.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells --cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided_no_post/ --visual True --raw_cells True\n\npython3 dla/src/eval_VOC_wF1.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided_no_post\npython3 dla/src/eval_VOC.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_coarse_cells_table_provided_no_post --IoU_threshold 0.5\n\n\n# Pretrained Model without any post processing \u003e\u003e GloSAT (individual cells) table provided\n# note: no post processing means VOC format is output so needs a different eval script\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided_no_post\npython3 dla/src/inference_regiongiven.py /data/glosat_table_dataset/datasets/Test/Coarse/VOC_without_headercells --cell_checkpoint /data/glosat_table_dataset/models/model_finecell_GloSAT.pth --coarse_cell_checkpoint /data/glosat_table_dataset/models/model_coarsecell_GloSAT.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/cascadetabnet_GloSAT_table_struct_table_provided_no_post/ --visual True --raw_cells True\n\npython3 dla/src/eval_VOC_wF1.py /data/glosat_table_dataset/datasets/Test/Fine/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_table_provided_no_post\npython3 dla/src/eval_VOC.py /data/glosat_table_dataset/datasets/Test/Fine/VOC_without_headercells ../dla_results/cascadetabnet_GloSAT_table_struct_table_provided_no_post --IoU_threshold 0.5\n\n\n#\n# Infer \u003e\u003e Pretrained Model \u003e\u003e Table Structure Recognition (ICDAR dataset and aggregated dataset)\n#\n\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on cTDaR19 dataset Train split\n# model checkpoint = /data/glosat_table_dataset/models/model_cell_ICDAR.pth\n# download image test data from ICDAR website ( https://zenodo.org/record/2649217#.YSjA2YhKiUk ) and follow the same procedure as for GloSAT models\n\n# Pretrained Model \u003e\u003e CascadeTabNet original model fine-tuned on aggregated dataset Train split\n# model checkpoint (coarse) = /data/glosat_table_dataset/models/model_coarsecell_both.pth\n# model checkpoint (fine) = /data/glosat_table_dataset/models/model_finecell_both.pth\n# download image test data from ICDAR website ( https://zenodo.org/record/2649217#.YSjA2YhKiUk ) and aggregate with GloSAT dataset Test split and follow the same procedure as for GloSAT models\n\n\n#\n# Infer \u003e\u003e Trained Model \u003e\u003e Table Structure Recognition\n#\n\n\n# Trained Model \u003e\u003e coarse segmentation cells\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/trained_model_coarse_table_struct\npython3 dla/src/inference.py /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/best_model.pth --cell_checkpoint /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/best_model.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/trained_model_coarse_table_struct/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/trained_model_coarse_table_struct /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/trained_model_coarse_table_struct /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/trained_model_coarse_table_struct /data/glosat_table_dataset/datasets/Test/Coarse/ICDAR --IoU_threshold 0.5\n\n\n# Trained Model \u003e\u003e individual cells\ncd /data/glosat_table_dataset/mmdetection\nconda activate open-mmlab\nmkdir /data/glosat_table_dataset/dla_results/trained_model_fine_table_struct\npython3 dla/src/inference.py /data/glosat_table_dataset/dla_models/model_table_det_full_table_train/best_model.pth --cell_checkpoint /data/glosat_table_dataset/dla_models/model_table_struct_fine_train/best_model.pth --coarse_cell_checkpoint /data/glosat_table_dataset/dla_models/model_table_struct_coarse_train/best_model.pth --load_from /data/glosat_table_dataset/datasets/Test/JPEGImages --out /data/glosat_table_dataset/dla_results/trained_model_fine_table_struct/ --visual True\n\npython3 dla/src/eval_ICDAR_wF1.py /data/glosat_table_dataset/dla_results/trained_model_fine_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR\npython3 dla/src/eval_ICDAR.py /data/glosat_table_dataset/dla_results/trained_model_fine_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\npython3 dla/src/eval_rows_n_cols_only.py /data/glosat_table_dataset/dla_results/trained_model_fine_table_struct /data/glosat_table_dataset/datasets/Test/Fine/ICDAR --IoU_threshold 0.5\n\n\n\n```\n\nThree scripts exist for inference:\n- inference_original.py - for original model (original)\n- inference.py - for pretrained model (B2)\n- inference_regiongiven.py - for pretrained with table region given explicitly (B1)\n\nFor (original) and (B2), the first argument is the checkpoint file for model used to predict tables.\nFor region_given (B1), the first argument is path to folder with VOC annotations to read table regions from\n\nBesides that optional arguments can be passed:\n\nFor all: \n--visual will generate images with predicted bounding boxes\n--voc will generate VOC formatted xml as opposed to ICDAR-19 formatted xml\n\nFor original:\n--use_cells will use the cells provided by original model (no post-processing)\n\nFor pretrained models (B1 and B2):\n--cell_checkpoint specifies the model which will be used for cell prediction, cells skipped if not specified\n--coarse_cell_chceckpoint specifies the model to use for coarse-assist cell prediction\n--raw_cells will skip all post-processing on cells (if not given, post-processing is applied, by default pp is used)\n--skip_headers will only segment table bodies, if not given whole tables (inlc. headers) are segmented (by default headers are segmented)\n\n# Latest eval results\n\nThese results use the same model checkpoints as used for the HIP 2021 paper but with latest small improvements to post processing code.\n\nTable Detection\n\n| Model | Precision | Recall | F1 |\n| ----- | --------- | ------ | -- |\n| CascadeTabNet original (no fine tuning) | 0.97 | 1.0 | 0.98 |\n| CascadeTabNet + fine tuning on (full table) | 1.0 | 1.0 | 1.0 |\n| CascadeTabNet + fine tuning on (full table, header, caption) | 1.0 | 1.0 | 1.0 |\n\nTable Structure Recognition\n\n| Model | Automated Table Detection | Weighted Average F1 Score | Row F1 Score | Col F1 Score |\n| ----- | ------------------------- | ------------------------- | ------------ | ------------ |\n| GloSAT (coarse segmentation cells) CascadeTabNet + post-processing | Yes | 0.74 | 0.87 | 0.95 |\n| GloSAT (coarse segmentation cells) CascadeTabNet + post-processing | No | 0.75 | 0.87 | 0.95 |\n| GloSAT (coarse segmentation cells) CascadeTabNet | Yes | 0.37 | | |\n| GloSAT (coarse segmentation cells) CascadeTabNet | No | 0.37 | | |\n| GloSAT (individual cells) CascadeTabNet + post-processing | Yes | 0.38 | 0.57 | 0.92 |\n| GloSAT (individual cells) CascadeTabNet + post-processing | No | 0.39 | 0.58 | 0.92 |\n| GloSAT (individual cells) CascadeTabNet | Yes | 0.05 | | |\n| GloSAT (individual cells) CascadeTabNet | No | 0.05 | | |\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstuartemiddleton%2Fglosat_table_dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstuartemiddleton%2Fglosat_table_dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstuartemiddleton%2Fglosat_table_dataset/lists"}