{"id":37711452,"url":"https://github.com/qurator-spk/page2tsv","last_synced_at":"2026-01-16T13:19:43.230Z","repository":{"id":43723249,"uuid":"228418795","full_name":"qurator-spk/page2tsv","owner":"qurator-spk","description":"PAGE-XML to TSV","archived":false,"fork":false,"pushed_at":"2025-04-23T13:29:40.000Z","size":350,"stargazers_count":4,"open_issues_count":5,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-23T14:02:09.759Z","etag":null,"topics":["ocr-d","qurator"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qurator-spk.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-12-16T15:36:34.000Z","updated_at":"2025-04-23T13:29:43.000Z","dependencies_parsed_at":"2023-01-22T10:30:10.642Z","dependency_job_id":"70d320c0-9a72-420f-900c-c6c0c324c82b","html_url":"https://github.com/qurator-spk/page2tsv","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/qurator-spk/page2tsv","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qurator-spk%2Fpage2tsv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qurator-spk%2Fpage2tsv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qurator-spk%2Fpage2tsv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qurator-spk%2Fpage2tsv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qurator-spk","download_url":"https://codeload.github.com/qurator-spk/page2tsv/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qurator-spk%2Fpage2tsv/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479026,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr-d","qurator"],"created_at":"2026-01-16T13:19:42.556Z","updated_at":"2026-01-16T13:19:43.216Z","avatar_url":"https://github.com/qurator-spk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TSV - Processing Tools\n\nCreate .tsv files that can be viewed and edited with [neat](https://github.com/qurator-spk/neat).\n\n## Installation:\n\nRequired python version is 3.11. \nConsider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system. \n\nActivate virtual environment (virtualenv):\n```\nsource venv/bin/activate\n```\nor (pyenv):\n```\npyenv activate my-python-3.11-virtualenv\n```\n\nUpdate pip:\n```\npip install -U pip\n```\nInstall tsvtools:\n```\npip install git+https://github.com/qurator-spk/page2tsv.git\n```\n\n## PAGE-XML to TSV Transformation:\n\nCreate a TSV file from OCR in PAGE-XML format (with word segmentation):\n\n```\npage2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1\n```\n\nIn order to create a TSV file for multiple PAGE XML files just perform successive calls\nof the tool using the same TSV file:\n\n```\npage2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1\npage2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2\npage2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3\npage2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4\npage2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5\n...\n...\n...\n```\n\nFor instance, for the file [example.xml](https://github.com/qurator-spk/page2tsv/blob/master/example.xml):\n\n```\npage2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg\n```\n\n---\n\n## Processing of already existing TSV files:\n\nCreate a URL-annotated TSV file from an existing TSV file:\n\n```\nannotate-tsv enp_DE.tsv enp_DE-annotated.tsv\n```\n\n# Command-line interface:\n\n```\npage2tsv --help\nUsage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE\n\n  Converts a page-XML file into a TSV file that can be edited with neat.\n  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as\n  parameters and performs NER and EL and the document if these are provided.\n\n  PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.\n\nOptions:\n  --purpose [NERD|OCR]       Purpose of output tsv file.\n                             \n                             NERD: NER/NED application/ground-truth creation.\n                             \n                             OCR: OCR application/ground-truth creation.\n                             \n                             default: NERD.\n  --image-url TEXT           An image retrieval link that enables neat to show\n                             the scan images corresponding to the text tokens.\n                             Example: https://content.staatsbibliothek-berlin.\n                             de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi\n                             dth,height/full/0/default.jpg\n  --ner-rest-endpoint TEXT   REST endpoint of sbb_ner service. See\n                             https://github.com/qurator-spk/sbb_ner for\n                             details. Only applicable in case of NERD.\n  --ned-rest-endpoint TEXT   REST endpoint of sbb_ned service. See\n                             https://github.com/qurator-spk/sbb_ned for\n                             details. Only applicable in case of NERD.\n  --noproxy                  disable proxy. default: enabled.\n  --scale-factor FLOAT       default: 1.0\n  --ned-threshold FLOAT\n  --min-confidence FLOAT\n  --max-confidence FLOAT\n  --ned-priority INTEGER\n  --normalization-file PATH\n  --help                     Show this message and exit.\n```\n\n```\ntsv2tsv --help\nUsage: tsv2tsv [OPTIONS] TSV_IN_FILE\n\nOptions:\n  --tsv-out-file PATH          Write modified TSV to this file.\n  --ner-rest-endpoint TEXT     REST endpoint of sbb_ner service. See\n                               https://github.com/qurator-spk/sbb_ner for\n                               details.\n  --noproxy                    disable proxy. default: enabled.\n  --num-tokens                 Print number of tokens in input/output file.\n  --sentence-count             Print sentence count in input/output file.\n  --max-sentence-len           Print maximum sentence len for input/output\n                               file.\n  --keep-tokenization          Keep the word tokenization exactly as it is.\n  --sentence-split-only        Do only sentence splitting.\n  --show-urls                  Print contained visualization URLs.\n  --just-zero                  Process only files that have max sentence\n                               length zero,i.e., that do not have sentence\n                               splitting.\n  --sanitize-sentence-numbers  Sanitize sentence numbering.\n  --show-columns               Show TSV columns.\n  --drop-column TEXT           Drop column\n  --help                       Show this message and exit.\n```\n\n```\nalto2tsv --help\nUsage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE\n\n  Converts a ALTO-XML file into a TSV file that can be edited with neat.\n  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as\n  parameters and performs NER and EL and the document if these are provided.\n\n  ALTO_XML_FILE: The source ALTO-XML file. \n  TSV_OUT_FILE: Resulting TSV file.\n\nOptions:\n  --purpose [NERD|OCR]      Purpose of output tsv file.\n                            \n                            NERD: NER/NED application/ground-truth creation.\n                            \n                            OCR: OCR application/ground-truth creation.\n                            \n                            default: NERD.\n  --image-url TEXT          An image retrieval link that enables neat to show\n                            the scan images corresponding to the text tokens.\n                            Example: https://content.staatsbibliothek-berlin.d\n                            e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt\n                            h,height/full/0/default.jpg\n  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See\n                            https://github.com/qurator-spk/sbb_ner for\n                            details. Only applicable in case of NERD.\n  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See\n                            https://github.com/qurator-spk/sbb_ned for\n                            details. Only applicable in case of NERD.\n  --noproxy                 disable proxy. default: enabled.\n  --scale-factor FLOAT      default: 1.0\n  --ned-threshold FLOAT\n  --ned-priority INTEGER\n  --help                    Show this message and exit.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqurator-spk%2Fpage2tsv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqurator-spk%2Fpage2tsv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqurator-spk%2Fpage2tsv/lists"}