https://github.com/ram02z/grobid

Python library for serializing GROBID TEI XML to dataclass
https://github.com/ram02z/grobid

client-library dataclasses grobid json orjson python xml-parser

Last synced: 6 months ago
JSON representation

Python library for serializing GROBID TEI XML to dataclass

Host: GitHub
URL: https://github.com/ram02z/grobid
Owner: ram02z
License: mit
Created: 2022-07-23T11:59:00.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2022-07-23T20:09:41.000Z (almost 4 years ago)
Last Synced: 2025-09-02T10:42:25.201Z (11 months ago)
Topics: client-library, dataclasses, grobid, json, orjson, python, xml-parser
Language: Python
Homepage:
Size: 107 KB
Stars: 8
Watchers: 1
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # grobid

> Python library for serializing GROBID TEI XML to [dataclasses](https://docs.python.org/3/library/dataclasses.html)

[![Build Status](https://github.com/ram02z/grobid/workflows/tests/badge.svg)](https://github.com/ram02z/grobid/actions)

[![Coverage Status](https://coveralls.io/repos/github/ram02z/grobid/badge.svg)](https://coveralls.io/github/ram02z/grobid)

[![Latest Version](https://img.shields.io/pypi/v/grobid.svg)](https://pypi.python.org/pypi/grobid)

[![Python Version](https://img.shields.io/pypi/pyversions/grobid.svg)](https://pypi.python.org/pypi/grobid)

[![License](https://img.shields.io/badge/MIT-blue.svg)](https://opensource.org/licenses/MIT)

## Installation

Use `pip` to install:

```shell

$ pip install grobid

$ pip install grobid[json] # for JSON serializable dataclass objects

```

You can also download the `.whl` file from the release section:

```shell

$ pip install *.whl

```

## Usage

### Client

In order to convert an academic PDF to TEI XML file, we use GROBID's REST

services. Specifically the [processFulltextDocument](https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessfulltextdocument) endpoint.

```python

from pathlib import Path

from grobid.models.form import Form, File

from grobid.models.response import Response

pdf_file = Path(".pdf")

with open(pdf_file, "rb") as file:

    form = Form(

        file=File(

            payload=file.read(),

            file_name=pdf_file.name,

            mime_type="application/pdf",

        )

    )

    c = Client(base_url="", form=form)

    try:

        xml_content = c.sync_request().content  # TEI XML file in bytes

    except GrobidClientError as e:

        print(e)

```

where `base-url` is the URL of the GROBID REST service

> You can use `https://cloud.science-miner.com/grobid/` to test

#### [Form](https://github.com/ram02z/grobid/blob/master/src/grobid/models/form.py#L20)

The `Form` class supports most of the optional parameters of the processFulltextDocument

endpoint.

### Parser

If you want to serialize the XML content, we can use the `Parser` class to

create [dataclasses](https://docs.python.org/3/library/dataclasses.html)

objects.

Not all of the GROBID annoation guidelines are met, but compliance is a goal.

See [#1](https://github.com/ram02z/grobid/issues/1).

```python

from grobid.tei import Parser

xml_content: bytes

parser = Parser(xml_content)

article = parser.parse()

article.to_json()  # raises RuntimeError if extra require 'json' not installed

```

where `xml_content` is the same as in [Client section](#client)

Alternately, you can load the XML from a file:

```python

from grobid.tei import Parser

with open(".xml", "rb") as xml_file:

  xml_content = xml_file.read()

  parser = Parser(xml_content)

  article = parser.parse()

  article.to_json()  # throws RuntimeError if extra require 'json' not installed

```

We use [orjson](https://github.com/ijl/orjson) to provide a method `to_json` to

serialize the dataclasses into JSON. By default, orjson isn't installed, use

`pip install grobid[json]`.

## License

MIT

## Contributing

You are welcome to add missing features by submitting a PR, however, I won't be

accepting any requests other than GROBID annotation compliance.

## Disclaimer

This module was originally part of a group university project, however, all the

code and tests was also authored by me.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ram02z/grobid

Awesome Lists containing this project

README