{"id":18032475,"url":"https://github.com/zeninglin/peneo","last_synced_at":"2025-03-27T05:31:15.301Z","repository":{"id":251870275,"uuid":"798666652","full_name":"ZeningLin/PEneo","owner":"ZeningLin","description":"[MM'2024] PEneo, an effective algorithm for key-value pair extraction from form-like documents, designed for real-world applications.","archived":false,"fork":false,"pushed_at":"2024-12-17T06:45:24.000Z","size":10589,"stargazers_count":28,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-23T04:51:13.527Z","etag":null,"topics":["document-ai","document-understanding","key-information-extraction","ocr","visual-information-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZeningLin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-10T08:34:00.000Z","updated_at":"2025-02-17T09:30:17.000Z","dependencies_parsed_at":"2024-09-08T08:30:06.229Z","dependency_job_id":"754d5e9b-36ff-4664-8d6e-8fdab1c8916c","html_url":"https://github.com/ZeningLin/PEneo","commit_stats":null,"previous_names":["zeninglin/peneo"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZeningLin%2FPEneo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZeningLin%2FPEneo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZeningLin%2FPEneo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZeningLin%2FPEneo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZeningLin","download_url":"https://codeload.github.com/ZeningLin/PEneo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245791404,"owners_count":20672665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-ai","document-understanding","key-information-extraction","ocr","visual-information-extraction"],"created_at":"2024-10-30T10:13:33.518Z","updated_at":"2025-03-27T05:31:15.294Z","avatar_url":"https://github.com/ZeningLin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1\u003ePEneo\u003c/h1\u003e\n\n\u003cdiv\u003e\n\u003ca href=\"https://dl.acm.org/doi/10.1145/3664647.3680931\"\u003e\n    \u003cimg alt=\"acm-link\" src=\"https://img.shields.io/badge/Paper-ACM Digital Library-blue\"\u003e\u003c/img\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/abs/2401.03472\"\u003e\n    \u003cimg alt=\"arxiv-link\" src=\"https://img.shields.io/badge/Paper-arXiv-B31B1B\"\u003e\u003c/img\u003e\n\u003c/a\u003e\n\u003cbr\u003e\n\u003ca href=\"https://github.com/SCUT-DLVCLab/RFUND\"\u003e\n    \u003cimg alt=\"dataset-rfund-link\" src=\"https://img.shields.io/badge/RFUND Annotation-GitHub-black\"\u003e\u003c/img\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/ZeningLin/PEneo/releases/tag/SIBR-revised-v1.0\"\u003e\n    \u003cimg alt=\"dataset-sibr-link\" src=\"https://img.shields.io/badge/SIBR Revised Annotation-GitHub-black\"\u003e\u003c/img\u003e\n\u003c/a\u003e\n\u003c/div\u003e\n\n\nOfficial implementation of PEneo introduced in the MM'2024 paper *PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction*.\n\nPEneo is designed for key-value pair extraction from form-like documents in real-world applications. Unlike previous approaches that rely on well annotated entity/paragraph/block-level OCR results, **PEneo is capable of handing line-level predictions provided by real-world OCR engines**. Entities that contain multiple text lines can be accurately extracted, and the linking relationships between key and value entities can be effectively identified. \n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/pred_orig.png\" width=400\u003e\n\u003cimg src=\"figures/pred_result.png\" width=400 \u003e\n\u003c/div\u003e\n\n\n\u003cbr\u003e\n\n\u003e Codes in this repository have undergone modifications from our original implementation to enhance its flexibility and usability. As a result, the model performance may vary slightly from the original implementation.\n\n\n\u003ch2\u003eTable of Contents\u003c/h2\u003e\n\n- [Introduction](#introduction)\n  - [Motivation](#motivation)\n  - [Methodology](#methodology)\n- [Setup](#setup)\n  - [Installation](#installation)\n  - [Dataset Preparation](#dataset-preparation)\n    - [RFUND](#rfund)\n    - [SIBR](#sibr)\n  - [Backbone Preparation](#backbone-preparation)\n    - [Supported Document-AI backbones](#supported-document-ai-backbones)\n    - [Pre-trained Utils Generation](#pre-trained-utils-generation)\n- [Fine-tuning](#fine-tuning)\n  - [Pair Extraction on RFUND](#pair-extraction-on-rfund)\n  - [Pair Extraction on SIBR](#pair-extraction-on-sibr)\n- [Citation](#citation)\n- [Acknowledgement](#acknowledgement)\n- [Copyright](#copyright)\n\n\n\n## Introduction\n\n### Motivation\n\nDocument pair extraction is a vital step in analyzing form-like documents containing information organized as key-value pairs. It involves identifying the key and value entities, as well as their linking relationships from document images. Previous research has generally divided it into two document understanding tasks: semantic entity recognition (SER) and relation extraction (RE). The SER task involves extracting contents that belong to predefined categories. Most of the existing methods implement SER using BIO tagging, where tokens in the input text sequence are tagged as the beginning (B), inside (I), or outside (O) element for each entity. On the other hand, the RE task aims to identify relations between given entities. Previous works have typically employed a linking classification network for relation extraction: given the entities in the document, it first generates representations for all possible entity pairs and then applies binary classification to filter out the valid ones. Document pair extraction is usually achieved by serially concatenating the above two tasks (SER+RE).\n\nAlthough achievements have been made in SER and RE, the existing SER+RE approach overlooks several issues. In previous settings, SER and RE are viewed as two distinct tasks that have inconsistent input/output forms and employ simplified evaluation metrics. For the SER task, entity-level OCR results are usually given, where text lines belonging to the same entity are aggregated and serialized in human reading order. The model categorizes each token based on the well-organized sequence, neglecting the impact of improper OCR outputs. In the RE task, models take the ground truths of the SER task as input, using prior knowledge of entity content and category. The model simply needs to predict the linkings based on the provided key and value entities, and the linking-level F1 score is taken as the evaluation metric. In real-world applications, however, the situation is considerably more complex. Commonly used OCR engines typically generate results at the line level. For entities with multiple lines, an extra line grouping step is required before BIO tagging, which is hard to realize for complex layout documents. Additionally, errors in SER predictions can significantly impact the RE step, resulting in unsatisfactory pair extraction results.\n\n\n### Methodology\n\nTo address the above issues, we propose **PEneo** (**P**air **E**xtraction **n**ew d**e**coder **o**ption), integrating three downstream subtasks. The following figure depicts the architecture of PEneo:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/modelarch.png\" width=800\u003e\n\u003c/div\u003e\n\nThe line extraction head identifies the position of the desired key/value lines from the unordered input token sequence. The line grouping head aggregates text lines that belong to the same key/value entity. The entity linking head predicts the relations between key and value entities. The three heads are jointly optimized during the training phase. At the inference step, key-value pairs can be parsed by unifying the outputs of the three heads. \n\n\n## Setup\n\nFor quick start, you can follow the instructions below to set up the environment, prepare the dataset, generate pre-trained weights and configurations, and fine-tune the model on the RFUND and SIBR datasets.\n\nFor more detailed usage, please refer to the [documentation](docs/documentation.md).\n\n\n### Installation\n\n```bash\nconda create -n vie python=3.10\nconda activate vie\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2\npip install -r requirements.txt\n```\n\nIf you want to use LayoutLMv2/LayoutXLM backbone, please additionally install detectron2:\n\n```bash\npip install 'git+https://github.com/facebookresearch/detectron2.git'\n```\n\n\n### Dataset Preparation\n\n#### RFUND\n\nThe RFUND annotations can be downloaded from [here](https://github.com/SCUT-DLVCLab/RFUND). Images of the dataset is available at the original release of [FUNSD](https://guillaumejaume.github.io/FUNSD/) and [XFUND](https://github.com/doc-analysis/XFUND/releases/tag/v1.0). The downloaded dataset should be organized as follows:\n\n```bash\nprivate_data\n└── rfund\n    ├── images\n    │   ├── de\n    │   ├── en\n    │   ├── es\n    │   ├── fr\n    │   ├── it\n    │   ├── ja\n    │   ├── pt\n    │   └── zh\n    ├── de.train.json\n    ├── de.val.json\n    ├── en.train.json\n    ├── en.val.json\n    ├── es.train.json\n    ├── es.val.json\n    ├── fr.train.json\n    ├── fr.val.json\n    ├── it.train.json\n    ├── it.val.json\n    ├── ja.train.json\n    ├── ja.val.json\n    ├── pt.train.json\n    ├── pt.val.json\n    ├── zh.train.json\n    └── zh.val.json\n```\n\n#### SIBR\n\nWe notice that some annotation errors exist in the original SIBR dataset (mainly due to the failure of data masking rules). To avoid potential issues, we made manual corrections and made the revised labels available [here](https://github.com/ZeningLin/PEneo/releases/tag/SIBR-revised-v1.0). Images of the dataset are available at the original release of [SIBR](https://www.modelscope.cn/datasets/iic/SIBR).\n\nAfter downloading the original SIBR dataset and our revised labels, you should extract and place the revised `converted_label` folder under the root of the original SIBR directory. The dataset should be organized as follows:\n\n```bash\nprivate_data\n└── sibr\n    ├── converted_label # revised labels\n    ├── images          # original images\n    ├── label           # original labels\n    ├── train.txt       # train split file\n    └── test.txt        # test split file\n```\n\n\n### Backbone Preparation\n\n#### Supported Document-AI backbones\n\n\u003cdiv align=\"center\"\u003e\n\n| Model Name              | Link                                                                                            |\n| ----------------------- | ----------------------------------------------------------------------------------------------- |\n| lilt-infoxlm-base       | 🤗 [SCUT-DLVCLab/lilt-infoxlm-base](https://huggingface.co/SCUT-DLVCLab/lilt-infoxlm-base)       |\n| lilt-roberta-en-base    | 🤗 [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) |\n| layoutxlm-base          | 🤗 [microsoft/layoutxlm-base](https://huggingface.co/microsoft/layoutxlm-base)                   |\n| layoutlmv2-base-uncased | 🤗 [microsoft/layoutlmv2-base-uncased](https://huggingface.co/microsoft/layoutlmv2-base-uncased) |\n| layoutlmv3-base         | 🤗 [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base)                 |\n| layoutlmv3-base-chinese | 🤗 [microsoft/layoutlmv3-base-chinese](https://huggingface.co/microsoft/layoutlmv3-base-chinese) |\n\n\u003c/div\u003e\n\n#### Pre-trained Utils Generation\n\nThe pre-trained contents will be stored in the `private_pretrained` directory. Please create this folder before running the utils-generation scripts.\n\n```bash\nmkdir private_pretrained\n```\n\nIf you want to use layoutlmv3-base as the model backbone, you can generate the required files by running the following command:\n\n\n```bash\npython tools/generate_peneo_weights.py \\\n  --backbone_name_or_path microsoft/layoutlmv3-base \\\n  --output_dir private_pretrained/layoutlmv3-base\n```\n\nThe scripts will automatically download the pre-trained weights, tokenizer, and config files from 🤗Huggingface hub and convert them to the required format. Results will be stored in the `private_pretrained` directory. If you want to use other backbones, you can change the `--backbone_name_or_path` parameter to the corresponding HF model ID.\n\nIf the scripts fail to download the pre-trained files, you may manually download them through the links in the above table, and set the `--backbone_name_or_path` parameter to the local directory of the downloaded files.\n\n\n## Fine-tuning\n\nCheckpoints, terminal outputs, and tensorboard logs will be saved in `private_output/weights`, `private_output/logs`, and `private_output/runs`, respectively. Please create these directories before running the fine-tuning scripts.\n\n```bash\nmkdir -p private_output/weights\nmkdir -p private_output/logs\nmkdir -p private_output/runs\n```\n\n### Pair Extraction on RFUND\n\n```bash\nexport PYTHONPATH=./\nexport CUDA_VISIBLE_DEVICES=0,1\nexport TRANSFORMERS_NO_ADVISORY_WARNINGS='true'\nPROC_PER_NODE=$(python -c \"import torch; print(torch.cuda.device_count())\")\nMASTER_PORT=11451\n\nLANGUAGE=en\nTASK_NAME=layoutlmv3-base_rfund_${LANGUAGE}\nPRETRAINED_PATH=private_pretrained/layoutlmv3-base\nDATA_DIR=private_data/rfund\nOUTPUT_DIR=private_output/weights/$TASK_NAME\nRUNS_DIR=private_output/runs/$TASK_NAME\nLOG_DIR=private_output/logs/$TASK_NAME.log\ntorchrun --nproc_per_node $PROC_PER_NODE --master_port $MASTER_PORT start/run_rfund.py \\\n    --model_name_or_path $PRETRAINED_PATH \\\n    --data_dir $DATA_DIR \\\n    --language $LANGUAGE \\\n    --output_dir $OUTPUT_DIR \\\n    --do_train \\\n    --do_eval \\\n    --fp16 \\\n    --per_device_train_batch_size 4 \\\n    --per_device_eval_batch_size 16 \\\n    --dataloader_num_workers 8 \\\n    --warmup_ratio 0.1 \\\n    --learning_rate 5e-5 \\\n    --max_steps 25000 \\\n    --evaluation_strategy steps \\\n    --eval_steps 1000 \\\n    --save_strategy steps \\\n    --save_steps 1000 \\\n    --save_total_limit 1 \\\n    --logging_strategy epoch \\\n    --logging_dir $RUNS_DIR \\\n    --detail_eval True \\\n    --save_eval_detail True \\\n    2\u003e\u00261 | tee -a $LOG_DIR\n```\n\nThe above script uses `layoutlmv3-base` as the model backbone and fine-tunes the model on `RFUND-EN`. You may try different backbones and language subsets by changing the `PRETRAINED_PATH` and `LANGUAGE` accordingly. Different backbones require different hyper-parameters, so you may need to adjust the training arguments accordingly.\n\n\n### Pair Extraction on SIBR\n\nSimilarly, you can fine-tune the model on the SIBR dataset by running the following script:\n\n```bash\nexport PYTHONPATH=./\nexport CUDA_VISIBLE_DEVICES=0,1\nexport TRANSFORMERS_NO_ADVISORY_WARNINGS='true'\nPROC_PER_NODE=$(python -c \"import torch; print(torch.cuda.device_count())\")\nMASTER_PORT=11451\n\nTASK_NAME=layoutlmv3-base-chinese_sibr\nPRETRAINED_PATH=private_pretrained/layoutlmv3-base-chinese\nDATA_DIR=private_data/sibr\nOUTPUT_DIR=private_output/weights/$TASK_NAME\nRUNS_DIR=private_output/runs/$TASK_NAME\nLOG_DIR=private_output/logs/$TASK_NAME.log\ntorchrun --nproc_per_node $PROC_PER_NODE --master_port $MASTER_PORT start/run_sibr.py \\\n    --model_name_or_path $PRETRAINED_PATH \\\n    --data_dir $DATA_DIR \\\n    --output_dir $OUTPUT_DIR \\\n    --do_train \\\n    --do_eval \\\n    --fp16 \\\n    --per_device_train_batch_size 4 \\\n    --per_device_eval_batch_size 16 \\\n    --dataloader_num_workers 8 \\\n    --warmup_ratio 0.1 \\\n    --learning_rate 5e-5 \\\n    --max_steps 25000 \\\n    --load_best_model_at_end True \\\n    --metric_for_best_model f1 \\\n    --evaluation_strategy steps \\\n    --eval_steps 1000 \\\n    --save_strategy steps \\\n    --save_steps 1000 \\\n    --save_total_limit 3 \\\n    --logging_strategy epoch \\\n    --logging_dir $RUNS_DIR \\\n    --detail_eval True \\\n    --save_eval_detail True \\\n    2\u003e\u00261 | tee -a $LOG_DIR\n```\n\nIt is worth noting that the SIBR dataset is a Chinese-English bilingual dataset. You should use multi-lingual backbones like `layoutlmv3-base-chinese`, `lilt-infoxlm-base` and `layoutxlm-base` for fine-tuning.\n\n\n## Citation\n\nIf you find PEneo helpful, please consider citing our paper:\n\n```\n@inproceedings{lin2024peneo,\n  title={PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction},\n  author={Lin, Zening and Wang, Jiapeng and Li, Teng and Liao, Wenhui and Huang, Dayi and Xiong, Longfei and Jin, Lianwen},\n  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},\n  pages={5171--5180},\n  year={2024}\n}\n```\n\n## Acknowledgement\n\nPart of the code is adapted from [LiLT](https://github.com/jpWang/LiLT), [LayoutLMv2/XLM](https://github.com/microsoft/unilm/tree/master/layoutlmft), [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3), and [TPLinker](https://github.com/131250208/TPlinker-joint-extraction). We sincerely thank the authors for their great work.\n\n\n## Copyright\n\nThis repository can only be used for non-commercial research purposes. \n\nFor commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).\n\nIf you encounter any problems, please open an issue or contact Zening Lin (zening.lin@outlook.com).\n\nCopyright 2024, Deep Learning and Vision Computing Lab (DLVC Lab, [HomePage](http://www.dlvc-lab.net/), [GitHub](https://github.com/SCUT-DLVCLab)), South China University of Technology.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeninglin%2Fpeneo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzeninglin%2Fpeneo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeninglin%2Fpeneo/lists"}