{"id":17182421,"url":"https://github.com/bertsky/ocrd_publaynet","last_synced_at":"2025-04-13T17:52:33.324Z","repository":{"id":57447723,"uuid":"241478542","full_name":"bertsky/ocrd_publaynet","owner":"bertsky","description":"convert PubLayNet data into METS/PAGE-XML","archived":false,"fork":false,"pushed_at":"2020-03-17T21:51:36.000Z","size":6,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-27T08:48:28.016Z","etag":null,"topics":["ocr-d"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bertsky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-18T22:11:09.000Z","updated_at":"2021-01-02T16:51:04.000Z","dependencies_parsed_at":"2022-09-15T22:12:59.601Z","dependency_job_id":null,"html_url":"https://github.com/bertsky/ocrd_publaynet","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Focrd_publaynet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Focrd_publaynet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Focrd_publaynet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Focrd_publaynet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bertsky","download_url":"https://codeload.github.com/bertsky/ocrd_publaynet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501884,"owners_count":21114681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr-d"],"created_at":"2024-10-15T00:37:02.934Z","updated_at":"2025-04-13T17:52:33.303Z","avatar_url":"https://github.com/bertsky.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocrd_publaynet\n\n    convert PubLayNet data into METS/PAGE-XML\n    \n## Introduction\n\nThis offers [OCR-D](https://ocr-d.github.io) compliant (i.e. [METS-XML](https://ocr-d.github.io/en/spec/mets)/[PAGE-XML](https://ocr-d.github.io/en/spec/page) based) conversion for [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) or similar, [MS-COCO](http://cocodataset.org/#format-data)-based, ground-truth data.\n\n## Installation\n\n### System packages\n\nInstall GNU `make` and `wget` if you wish to use the Makefile.\n\n    # on Debian / Ubuntu:\n    sudo apt install make wget\n\nInstall Python3 regardless:\n\n    # on Debian / Ubuntu:\n    sudo apt install python3 python3-pip python3-venv\n\nEquivalently:\n\n    # on Debian / Ubuntu:\n    sudo make deps-ubuntu\n\n### Python packages\n\nIt is strongly recommended to use [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). You can create and install a virtual environment of your own (which the Makefile will re-use when activated), or have the Makefile do that for you.\n\n    pip install -r requirements.txt\n    pip install .\n    \nEquivalently:\n\n    make install\n\n## Usage\n\n### command-line interface `ocrd-import-mscoco`\n\nOnce installed, the following executable becomes available:\n\n```\nUsage: ocrd-import-mscoco [OPTIONS] COCOFILE DIRECTORY\n\n  Convert MS-COCO JSON to METS/PAGE XML files.\n\n  Load JSON ``cocofile`` (in MS-COCO format) and chdir to ``directory``\n  (which it refers to).\n\n  Start a METS file mets.xml with references to the image files (under\n  fileGrp ``OCR-D-IMG``) and their corresponding PAGE-XML annotations (under\n  fileGrp ``OCR-D-GT-SEG-BLOCK``), as parsed from ``cocofile`` and written\n  using the same basename.\n\nOptions:\n  --help  Show this message and exit.\n```\n\n### apply on `PubLayNet`\n\nTo apply on the _validation_ subsection:\n\n    ocrd-import-mscoco publaynet/val.json publaynet/val\n\nThis will create a METS `publaynet/val/mets.xml` and PAGE files `publaynet/val/*.xml` for all image files.\n\nTo apply on the _training_ subsection:\n\n    ocrd-import-mscoco publaynet/train.json publaynet/train\n\nThis will create a METS `publaynet/train/mets.xml` and PAGE files `publaynet/train/*.xml` for all image files.\n\nEquivalently (including download/extraction if necessary):\n\n    make convert\n\n\u003e **Note**: PubLayNet itself requires approximately 103 GB of disk space. If you already have it (elsewhere), but still wish to use the Makefile to convert the files, make sure to symlink it here, so it does not get downloaded twice: `ln -s your/path/to/publaynet publaynet`\n\n\u003e **Note**: PubLayNet's `train.json` is 1.6 GB on disk and takes about 10 GB in (resident!) memory to load. Any incremental/stream-based method would be magnitudes slower than plain `json.load()`. Also, MS-COCO cannot be split because it basically defines a (humongous) `annotations` dict with pointers to a (large) `images` dict – sequentially. Another problem is that we cannot parallelize this, since everything needs to be in one final METS file. So this may take days. Just grin and bear it!\n\n### all Makefile targets\n\n```\nRules to install ocrd-import-mscoco, and to use it on\nPubLayNet (by downloading, extracting and converting).\n\nTargets:\n\thelp: this message\n\tdeps-ubuntu: install system dependencies for Ubuntu\n\tall: alias for `install download convert`\n\tinstall: alias for `pip install .`\n\tdownload: alias for `publaynet.tar.gz`\n\tconvert: alias for `publaynet/val/mets.xml publaynet/train/mets.xml`\n\tuninstall: alias for `pip uninstall ocrd_publaynet`\n\tclean-xml: remove results of conversion (METS and PAGE files in `publaynet`)\n\tclean: remove `publaynet` altogether\n\nVariables:\n\tVIRTUAL_ENV: absolute path to (re-)use for the virtual environment\n\tPYTHON: name of the Python binary\n\tPIP: name of the Python packaging binary\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Focrd_publaynet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbertsky%2Focrd_publaynet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Focrd_publaynet/lists"}