https://github.com/swhl/tablerecognitionmetric

Compute benchmark of table structure recognition.
https://github.com/swhl/tablerecognitionmetric

ocr s-teds table-recognition teds

Last synced: about 1 year ago
JSON representation

Compute benchmark of table structure recognition.

Host: GitHub
URL: https://github.com/swhl/tablerecognitionmetric
Owner: SWHL
Created: 2023-07-11T02:18:11.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-04-23T14:01:34.000Z (about 2 years ago)
Last Synced: 2025-03-18T09:37:39.365Z (over 1 year ago)
Topics: ocr, s-teds, table-recognition, teds
Language: Python
Homepage:
Size: 48.8 KB
Stars: 18
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          


  

    Table Recognition Metric

  
















### 简介

该库用于计算TEDS指标，用来评测表格识别算法效果。可与[table_rec_test_dataset](https://huggingface.co/datasets/SWHL/table_rec_test_dataset)配套使用。

TEDS计算代码参考：[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/table/table_metric/table_metric.py) 和 [DAVAR-Lab-OCR](https://github.com/hikopensource/DAVAR-Lab-OCR/blob/main/davarocr/davar_table/utils/metric.py)

### 安装

```bash

pip install table_recognition_metric

```

### 使用说明：

#### 命令行运行

- Usage:

    ```bash

    $ table_recognition_metric -h

    usage: table_recognition_metric [-h] [-steds] [-gt GT_HTML] [-pred PRED_HTML]

    options:

    -h, --help            show this help message and exit

    -steds, --structure_only

    -gt GT_HTML, --gt_html GT_HTML

    -pred PRED_HTML, --pred_html PRED_HTML

    ```

- Example:

    ```bash

    $ table_recognition_metric -gt '购买方纳税人识别号地址、电记开户行及账号密码区货物或应税劳务、服务名称理肤泉清痘旅行装控油祛痘调节水油平衡理肤泉特安舒缓修护乳40ml合计规格型号单位11税率17%17%价税合计（大写）销售方纳税人识别号地址、电话开户行及账号备注' -pred ''

    # 0.0

    ```

#### 脚本运行

> [!NOTE]

> 如果只需要计算Struct-TEDS，只需在声明TEDS实例时，传入参数`structure_only=True`即可，默认该参数为`False`，即计算TEDS. e.g.

>

> `teds = TEDS(structure_only=True)`

```python

from table_recognition_metric import TEDS

teds = TEDS()

gt_html = '购买方纳税人识别号地址、电记开户行及账号密码区货物或应税劳务、服务名称理肤泉清痘旅行装控油祛痘调节水油平衡理肤泉特安舒缓修护乳40ml合计规格型号单位11税率17%17%价税合计（大写）销售方纳税人识别号地址、电话开户行及账号备注'

pred_html = '购买方纳税人识别号地址、电记开户行及账号密码区货物或应税劳务、服务名称理肤泉清痘旅行装控油祛痘调节水油平衡理肤泉特安舒缓修护乳40ml合计规格型号单位11税率17%17%价税合计（大写）销售方纳税人识别号地址、电话开户行及账号备注'

score = teds(gt_html, pred_html)

print(score)

# 1.0

```

#### 数据集上评测

- 这里以[`rapid-table`](https://github.com/RapidAI/RapidStructure/blob/main/docs/README_Table.md)在表格数据集[table_rec_test_dataset](https://huggingface.co/datasets/SWHL/table_rec_test_dataset)上的评测代码，大家可以以此类推。

- 安装必要的包

    ```bash

    pip install datasets

    pip install rapid_table

    pip install rapidocr_onnxruntime

    pip install table_recognition_metric

    ```

- 运行测试

    ```python

    import numpy as np

    from datasets import load_dataset

    from rapid_table import RapidTable

    from tqdm import tqdm

    from table_recognition_metric import TEDS

    dataset = load_dataset("SWHL/table_rec_test_dataset")

    test_data = dataset["test"]

    table_engine = RapidTable()

    teds = TEDS(structure_only=True)

    content = []

    for one_data in tqdm(test_data):

        img = one_data.get("image")

        gt = one_data.get("html")

        pred_str, _, _ = table_engine(np.array(img))

        scores = teds(gt, pred_str)

        content.append(scores)

    avg = sum(content) / len(content)

    print(f"TEDS: {avg:.5f}")

    ```

### Tree-EditDistance-based Similarity (TEDS)

- TEDS是IBM在论文《[Image-based table recognition: data, model, and evaluation](https://arxiv.org/pdf/1911.10683)》中提出的。

- [之前提出的评测算法](https://ieeexplore.ieee.org/document/1227792)，主要是将一个表格的`ground truth`和`recognition result`各自展平为非空cell两两之间的邻接关系列表。然后通过比较这两个列表，来计算precision, recall和F1-score。该metric主要存在两个明显问题：

    1. 由于它只检查非空单元格之间的直接邻接关系，因此它无法检测由空单元格和超出直接邻居的单元格未对齐引起的错误；

    2. 由于它通过精准匹配来检查关系，因此它没有衡量fine-grained单元格内容识别性能的机制。

- 针对以上问题，TEDS通过以下方法予以解决：

    1. 通过在全局树结构级别检查识别结果，使其能够识别它识别所有类型的结构错误，来解决上述问题1；

    2. 当**tree-edit**的操作是节点替换时，计算对应的字符串编辑距离，来解决上述问题2。

- 计算公式：

   $$TEDS(T_{a}, T_{b}) = 1 - \frac{EditDist(T_{a}, T_{b})}{max(|T_{a}|, |T_{b}|)}$$

    其中， $EditDist$指的是**tree-edit distance**, $|T|$ 指的是在 $T$ 中节点的数量。一个表格还原算法在一系列测试集上识别效果可以定义为：测试集中所有样例逐个计算其**ground truth**和**predict result**之间的TEDS，最终对所有样例的TEDS求均值得到最终得分。

### 更新日志

#### 2023-12-27 v0.0.4 update:

- 显示添加计算S-TEDS指标参数

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/swhl/tablerecognitionmetric

Awesome Lists containing this project

README

Table Recognition Metric