{"id":16881486,"url":"https://github.com/swhl/trocr-formula-rec","last_synced_at":"2025-07-27T10:37:01.101Z","repository":{"id":248712518,"uuid":"829469626","full_name":"SWHL/TrOCR-Formula-Rec","owner":"SWHL","description":"基于TrOCR + UniMER-1M数据集，训练一个小而美的公式识别模型","archived":false,"fork":false,"pushed_at":"2024-11-15T11:07:50.000Z","size":2000,"stargazers_count":20,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"Exp8","last_synced_at":"2025-03-18T09:37:45.083Z","etag":null,"topics":["formula-rec","latex-ocr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SWHL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-16T13:44:24.000Z","updated_at":"2025-03-12T05:54:46.000Z","dependencies_parsed_at":"2024-10-28T12:31:16.637Z","dependency_job_id":"58f28988-e236-4f46-98dc-7c6ef6870d30","html_url":"https://github.com/SWHL/TrOCR-Formula-Rec","commit_stats":{"total_commits":42,"total_committers":1,"mean_commits":42.0,"dds":0.0,"last_synced_commit":"6ce30f128f1f183822d92d43c6d4391449da1707"},"previous_names":["swhl/trocr-formula-rec"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FTrOCR-Formula-Rec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FTrOCR-Formula-Rec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FTrOCR-Formula-Rec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWHL%2FTrOCR-Formula-Rec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SWHL","download_url":"https://codeload.github.com/SWHL/TrOCR-Formula-Rec/tar.gz/refs/heads/Exp8","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244925175,"owners_count":20532873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["formula-rec","latex-ocr"],"created_at":"2024-10-13T16:02:48.244Z","updated_at":"2025-07-27T10:37:01.094Z","avatar_url":"https://github.com/SWHL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# √ TrOCR Formula Recognition\n\n❓缘由：看到[UniMERNet](https://github.com/opendatalab/UniMERNet)的工作，从他们发布的模型存储大小（4.91G）来看，实在太重了。同时，他们也发布了一个很大很全的公式识别数据集：UniMER_Dataset。\n\n🎯 于是，想着基于TrOCR + UniMER-1M数据集，训练一个小而美的公式识别数据集。\n\n仓库将UniMERNet作为Baseline，目标是超过UniMERNet，同时模型要小很多。\n\n仓库dataset目录下为UniMER-1M的Tiny版，只用来测试程序使用。\n\n### ⚠️注意事项\n\n- 使用transformers训练前，需要在`import torch`前，指定`CUDA_VISIBLE_DEVICES`，否则会卡住。\n- 以下实验数据，除**Exp1_1**外，其他的暂时都没有添加HME100K数据集\n- 所有实验均采用`microsoft/trocr-small-stage1`作为预训练模型训练的。\n\n#### TODO\n\n- [ ] 给出速度基准\n- [ ] 推理采用Flash Attention加速。（transformers==4.44.2中VisionEncoderDecoderModel不支持）\n- [ ] 转ONNX模型，并比较推理速度\n- [ ] 尝试使用xformers来优化推理速度\n\n### 🔬 实验记录\n\n实验表格来自[UniMERNet](https://arxiv.org/abs/2404.15254) Table 5\n\n| Method   | SPE-BLEU↑ | SPE-EditDis↓ | CPE-BLEU↑ | CPE-EditDis↓ | SCE-BLEU↑ | SCE-EditDis↓ | HWE-BLEU↑ | HWE-EditDis↓ |\n| :---- | :-------: | :----------: | :-------: | :----------: | :-------: | :----------: | :-------: | :----------: |\n| [Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR) |   0.873   |    0.088     |   0.655   |    0.408     |   0.092   |    0.817     |   0.012   |    0.920     |\n| [Texify](https://github.com/VikParuchuri/texify)      |   0.906   |    0.061     |   0.690   |    0.230     |   0.420   |    0.390     |   0.341   |    0.522     |\n| [UniMERNet](https://github.com/opendatalab/UniMERNet) |   0.917   |    0.058     |   0.916   |    0.060     |   0.616   |    0.229     |   0.921   |    0.055     |\n|||||||||\n| Exp1   |   0.815   |    0.121     |   0.677   |    0.259     |   0.589   |    0.227     |   0.150   |    0.520     |\n| Exp1_1 |   0.883   |    0.07     |   0.810   |    0.122     |   0.489   |    0.262     |   0.900   |    0.06     |\n| Exp2    |   0.798   |    0.132     |   0.677   |    0.259     |   0.589   |    0.227     |   0.150   |    0.520     |\n| Exp3 |   0.813   |    0.127     |   0.682   |    0.263     |   0.302   |   0.231     |   0.166   |   0.540      |\n| Exp4 |   0.873   |   0.077    |  0.801   |   0.130     |   0.550  |   0.238    |  0.092   |   0.469     |\n| Exp5 |  0.846  |   0.201  | 0.823  |  0.134     | 0.418  |  0.553   | 0.05  | 0.6724  |\n| Exp5_1 |  0.819  |   0.119  | 0.682  |  0.249     | 0.595  |  0.230   | 0.179  | 0.512  |\n| Exp6 |  0.812  |   0.116  | 0.676  |  0.253     | 0.657  |  0.210   | 0.342  | 0.404  |\n| Exp7 |  0.817  |   0.117  | 0.679  |  0.251     | 0.817  |  0.117   | 0.781  | 0.148  |\n| Exp8 |  **0.886**  |   **0.07**  | **0.822**  |  **0.108**     | **0.633**  |  **0.217**   | **0.897**  | **0.07**  |\n| Exp9 |  0.862  |   0.10  | 0.740  |  0.180     | 0.639  |  0.211   | 0.826  | 0.119  |\n| Exp10 |  0.900  |   0.07  | 0.841  |  0.09     | 0.594  |  0.227   | 0.912  | 0.06  |\n\n|  Exp  | 说明                                                                                                   |\n| :--- | :----------------------------------------------------------------------------------------------------- |\n| Exp1  | 首次基于UniMER-1M训练，采用预训练模型是`microsoft/trocr-small-stage1` \u003cbr/\u003e 采用TrOCR默认Tokenizer |\n| Exp1_1 | 基于Exp1，控制单一变量：训练30个Epoch by [limaopeng1](https://github.com/limaopeng1) |\n| Exp2  | 更改LaTex-OCR方法用的BPE Tokenizer                                                                   |\n| Exp3  | 修复Exp2中model配置bug                                                                               |\n| Exp4  | 与Exp3相比，单一变量：epoch=1 → epoch=5                                                             |\n| Exp5  | 与Exp1相比，单一变量：epoch=1 → epoch=10                                                             |\n| Exp5_1  | 补充实验，修复Exp5中，去掉text前后加了BOS和EOS的地方，只跑一个epoch                                            |\n| Exp6  | 与Exp5_1相比，单一变量：参考UniMERNet源码，增加数据增强                                      |\n| Exp7  | 与Exp6相比，单一变量：增加HME100k数据集                                      |\n| Exp8  | 与Exp7相比，单一变量：epoch=1 → epoch=10                                    |\n| Exp9  | 与Exp8相比，单一变量：增加fusion-image-to-latex-datasets数据集（3069505）, Epoch=1       |\n| Exp10  | 与Exp9相比，单一变量：epoch=1 → epoch=10 (fusion-image-to-latex-dataset 3467214)       |\n\n### 🦩 Checkpoint\n\n- [Exp5_1](https://huggingface.co/SWHL/TrOCR-Formula-Rec/tree/main/Exp5_1)\n- [Exp8](https://huggingface.co/SWHL/TrOCR-Formula-Rec/tree/main/Exp8)\n- [Exp10](https://huggingface.co/SWHL/TrOCR-Formula-Rec/tree/main/Exp10)\n\n### 🔢 Dataset\n\n⚠️注意：仓库中`dataset`目录下为示例，完整数据集需自行下载补充。\n\n[UniMER_Dataset](https://huggingface.co/datasets/wanderkid/UniMER_Dataset)\n完整的UniMER目录结构如下：\n\n```text\ndataset\n├── UniMER-1M\n│   ├── images\n│   └── train.txt\n└── UniMER-Test\n    ├── cpe\n    ├── hwe\n    ├── sce\n    ├── spe\n    ├── cpe.txt\n    ├── hwe.txt\n    ├── sce.txt\n    └── spe.txt\n```\n\n训练集总共1061,791 LaTeX-Image pairs。\n\n测试集由4种类型公式组成，总共23757张图像：\n\n- Simple Printed Expressions (SPE): 6,762 samples\n- Complex Printed Expressions (CPE): 5,921 samples\n- Screen Capture Expressions (SCE): 4,742 samples\n- Handwritten Expressions (HWE): 6,332 samples\n\n各个种类示例图像如下：\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/SWHL/TrOCR-Formula-Rec/releases/download/v0.0.0/dataset_deom.png\"\u003e\n\u003c/div\u003e\n\n### 🔢 其他数据集和项目\n\n- [fusion-image-to-latex-datasets](https://huggingface.co/datasets/hoang-quoc-trung/fusion-image-to-latex-datasets)\n- [TexTeller](https://github.com/OleehyO/TexTeller)\n\n### 📞 微信交流群\n\n微信关注公众号：RapidAI, 后台回复“公式识别”即可进群\n\n### 📚 Reference\n\n- [UniMERNet](https://github.com/opendatalab/UniMERNet)\n- [TrOCR-Handwritten-Mathematical-Expression-Recognition](https://github.com/win5923/TrOCR-Handwritten-Mathematical-Expression-Recognition.git)\n- [Transformers-Tutorials](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswhl%2Ftrocr-formula-rec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswhl%2Ftrocr-formula-rec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswhl%2Ftrocr-formula-rec/lists"}