{"id":37078378,"url":"https://github.com/ram02z/grobid","last_synced_at":"2026-01-14T09:10:11.079Z","repository":{"id":48911665,"uuid":"517045270","full_name":"ram02z/grobid","owner":"ram02z","description":"Python library for serializing GROBID TEI XML to dataclass","archived":false,"fork":false,"pushed_at":"2022-07-23T20:09:41.000Z","size":110,"stargazers_count":8,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-02T10:42:25.201Z","etag":null,"topics":["client-library","dataclasses","grobid","json","orjson","python","xml-parser"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ram02z.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-23T11:59:00.000Z","updated_at":"2025-08-01T23:22:01.000Z","dependencies_parsed_at":"2022-09-02T22:11:06.315Z","dependency_job_id":null,"html_url":"https://github.com/ram02z/grobid","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/ram02z/grobid","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ram02z%2Fgrobid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ram02z%2Fgrobid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ram02z%2Fgrobid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ram02z%2Fgrobid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ram02z","download_url":"https://codeload.github.com/ram02z/grobid/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ram02z%2Fgrobid/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["client-library","dataclasses","grobid","json","orjson","python","xml-parser"],"created_at":"2026-01-14T09:10:10.486Z","updated_at":"2026-01-14T09:10:11.070Z","avatar_url":"https://github.com/ram02z.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# grobid\n\u003e Python library for serializing GROBID TEI XML to [dataclasses](https://docs.python.org/3/library/dataclasses.html)\n\n[![Build Status](https://github.com/ram02z/grobid/workflows/tests/badge.svg)](https://github.com/ram02z/grobid/actions)\n[![Coverage Status](https://coveralls.io/repos/github/ram02z/grobid/badge.svg)](https://coveralls.io/github/ram02z/grobid)\n[![Latest Version](https://img.shields.io/pypi/v/grobid.svg)](https://pypi.python.org/pypi/grobid)\n[![Python Version](https://img.shields.io/pypi/pyversions/grobid.svg)](https://pypi.python.org/pypi/grobid)\n[![License](https://img.shields.io/badge/MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n## Installation\n\nUse `pip` to install:\n\n```shell\n$ pip install grobid\n$ pip install grobid[json] # for JSON serializable dataclass objects\n```\n\n\nYou can also download the `.whl` file from the release section:\n\n```shell\n$ pip install *.whl\n```\n\n## Usage\n\n### Client\n\nIn order to convert an academic PDF to TEI XML file, we use GROBID's REST\nservices. Specifically the [processFulltextDocument](https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessfulltextdocument) endpoint.\n\n\n```python\nfrom pathlib import Path\nfrom grobid.models.form import Form, File\nfrom grobid.models.response import Response\n\npdf_file = Path(\"\u003cyour-academic-article\u003e.pdf\")\nwith open(pdf_file, \"rb\") as file:\n    form = Form(\n        file=File(\n            payload=file.read(),\n            file_name=pdf_file.name,\n            mime_type=\"application/pdf\",\n        )\n    )\n    c = Client(base_url=\"\u003cbase-url\u003e\", form=form)\n    try:\n        xml_content = c.sync_request().content  # TEI XML file in bytes\n    except GrobidClientError as e:\n        print(e)\n```\n\nwhere `base-url` is the URL of the GROBID REST service\n\n\u003e You can use `https://cloud.science-miner.com/grobid/` to test\n\n#### [Form](https://github.com/ram02z/grobid/blob/master/src/grobid/models/form.py#L20)\n\nThe `Form` class supports most of the optional parameters of the processFulltextDocument\nendpoint.\n\n\n### Parser\n\nIf you want to serialize the XML content, we can use the `Parser` class to\ncreate [dataclasses](https://docs.python.org/3/library/dataclasses.html)\nobjects.\n\nNot all of the GROBID annoation guidelines are met, but compliance is a goal.\nSee [#1](https://github.com/ram02z/grobid/issues/1).\n\n```python\nfrom grobid.tei import Parser\n\nxml_content: bytes\nparser = Parser(xml_content)\narticle = parser.parse()\narticle.to_json()  # raises RuntimeError if extra require 'json' not installed\n```\n\nwhere `xml_content` is the same as in [Client section](#client)\n\nAlternately, you can load the XML from a file:\n\n```python\nfrom grobid.tei import Parser\n\nwith open(\"\u003cyour-academic-article\u003e.xml\", \"rb\") as xml_file:\n  xml_content = xml_file.read()\n  parser = Parser(xml_content)\n  article = parser.parse()\n  article.to_json()  # throws RuntimeError if extra require 'json' not installed\n```\n\nWe use [orjson](https://github.com/ijl/orjson) to provide a method `to_json` to\nserialize the dataclasses into JSON. By default, orjson isn't installed, use\n`pip install grobid[json]`.\n\n## License\n\nMIT\n\n## Contributing\n\nYou are welcome to add missing features by submitting a PR, however, I won't be\naccepting any requests other than GROBID annotation compliance.\n\n## Disclaimer\n\nThis module was originally part of a group university project, however, all the\ncode and tests was also authored by me.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fram02z%2Fgrobid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fram02z%2Fgrobid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fram02z%2Fgrobid/lists"}