{"id":15044279,"url":"https://github.com/rqluo/mixtex-datahub","last_synced_at":"2025-10-24T07:31:00.265Z","repository":{"id":250807669,"uuid":"835379712","full_name":"RQLuo/MixTeX-DataHub","owner":"RQLuo","description":"LaTeXDataHub is an open-source platform dedicated to the sharing and contribution of real-world LaTeX image datasets and their annotations, allows users to upload, download, and contribute to a growing collection of high-quality LaTeX datasets.","archived":false,"fork":false,"pushed_at":"2024-08-13T03:10:56.000Z","size":19,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-31T00:25:37.559Z","etag":null,"topics":["data","deep-learning","latex","machine-learning","ocr"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RQLuo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-29T18:00:22.000Z","updated_at":"2024-08-25T02:43:01.000Z","dependencies_parsed_at":"2024-08-13T04:27:51.699Z","dependency_job_id":"823cd446-cc70-4a2e-8348-f6c318b6feb6","html_url":"https://github.com/RQLuo/MixTeX-DataHub","commit_stats":null,"previous_names":["rqluo/latexdatahub","rqluo/mixtex-datahub"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RQLuo%2FMixTeX-DataHub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RQLuo%2FMixTeX-DataHub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RQLuo%2FMixTeX-DataHub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RQLuo%2FMixTeX-DataHub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RQLuo","download_url":"https://codeload.github.com/RQLuo/MixTeX-DataHub/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237932070,"owners_count":19389560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","deep-learning","latex","machine-learning","ocr"],"created_at":"2024-09-24T20:50:22.906Z","updated_at":"2025-10-24T07:30:59.952Z","avatar_url":"https://github.com/RQLuo.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LaTeXDataHub\nLaTeXDataHub is an open-source platform dedicated to the sharing and contribution of real-world LaTeX image datasets and their annotations, allows users to upload, download, and contribute to a growing collection of high-quality LaTeX datasets. To ensure that the data is not dependent on third-party platforms and can be shared everywhere, I recommend using magnet links to deliver the dataset.\n\nLaTeXDataHub 是一个开源平台，致力于共享和贡献真实 LaTeX 图像数据及其注释，允许上传、下载并为高质量 LaTeX 数据集做出贡献（数据集接受**任何语言**）。为确保数据不依赖于第三方平台，以及在任何地方都能够共享，我们建议使用磁力链接传递数据集。\n\n## 建议数据标注方法\n\n对于较为标准的现代打印latex文档的数据图片，您可以直接采用MixTeX，它已有较高的准确率，您只需要纠正少量的错误。\n\n对于手写或者老教材latex的数据集，目前MixTeX暂时还没有训练过，表现的不太好。\n\n您可以采用chatgpt或者claude辅助标注。您可以参考以下提示词：_latex ocr 直接输出，所有公式用align*，文字放在外面，文内公式用\\( .. \\)，不要废话，不要继承直接输出ocr结果：_\n\n## 常见数据集收集项目1：现代打印文档MixTeX表现较差数据集 (对应模型参数\u003c100M) 接受无标注图片\n使用电脑截图可以轻松获取，对于该数据集，您通常可以直接采用MixTeX，它已有较高的准确率，您只需要纠正少量的错误。\n我们会在将来的MixTeX应用上提供四个数据标注选项分别是：（完整修改标注提交，小错误反馈，公式编译失败反馈，重复反馈)\n对于完整修改标注提交我们将延续之前的训练，对于反馈我们将会探索RLHF训练方法。\n![85bc606db5bb0fba07acd2656cbf777](https://github.com/user-attachments/assets/6e9bca0b-017a-40e2-be81-2c65d931e552)\n\n### Latex 伪代码\n![image](https://github.com/user-attachments/assets/b3a19765-66c8-4888-81b8-d7184f7347e0)\n\n![image](https://github.com/user-attachments/assets/3fa05540-aa42-4436-b40c-0cc88754a4e0)\n\n参考数据集：https://huggingface.co/datasets/stanford-crfm/i2s-latex?row=0\n\n## 特殊数据集收集项目1：手写latex草稿数据集 (对应模型参数 150-200M) 接受无标注图片\n手写latex 可以是平时作业的手写稿件，也可以是在推导过程中的草稿，**需要分为两类（整洁，草稿）**。\n![屏幕截图 2024-08-05 025839](https://github.com/user-attachments/assets/893d395d-60e4-4c78-a6b7-fc0f97f02528)\n\n## 特殊数据集收集项目2：黑板板书latex数据集 (对应模型参数 150-200M) 接受无标注图片\n请在此声明您目前准备截取黑板板书的视频链接，或者您经老师允许在课上的拍照。\n![305fe6afff65edd8015bb24509d74b6](https://github.com/user-attachments/assets/3dbc950c-2594-4ab9-9dcd-fb7a6826a61d)\n\n## 复杂指令数据集1:识别latex并翻译成[语言] (对应模型参数 300-600M)\n\n## 复杂指令数据集2:识别latex并用自己的话重述 (对应模型参数 300-600M)\n\n## 超复杂指令数据集1:识别板书并写成Lecture Note (对应模型参数 \u003e 2B)\n\n## 超复杂指令数据集2: 识别latex并用前置知识解释 (对应模型参数 \u003e 2B)\n\n## 超复杂指令数据集3: 批改过的试卷以及正确答案 (对应模型参数 \u003e 2B)\n\n## claude都做不到的数据集1：证明和推理 (还不知道用啥模型能实现，该数据集必须有来源，最好是经典教材上的证明和推导)\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frqluo%2Fmixtex-datahub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frqluo%2Fmixtex-datahub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frqluo%2Fmixtex-datahub/lists"}