https://github.com/cboulanger/tei-annotator

Python library for annotating text with TEI XML tags using a two-stage LLM + GLiNER pipeline
https://github.com/cboulanger/tei-annotator

annotations llm-inference tei-xml

Last synced: about 1 month ago
JSON representation

Python library for annotating text with TEI XML tags using a two-stage LLM + GLiNER pipeline

Host: GitHub
URL: https://github.com/cboulanger/tei-annotator
Owner: cboulanger
License: mit
Created: 2026-02-28T19:00:48.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-10T09:39:59.000Z (3 months ago)
Last Synced: 2026-03-10T17:00:35.890Z (3 months ago)
Topics: annotations, llm-inference, tei-xml
Language: Python
Homepage:
Size: 424 KB
Stars: 0
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ---

title: Tei Annotator

emoji: 🦀

colorFrom: green

colorTo: pink

sdk: gradio

sdk_version: 6.9.0

python_version: '3.12'

app_file: app.py

hardware: cpu-basic

pinned: false

license: mit

short_description: Demo for cboulanger/tei-annotator

---

A Python library for annotating plain text with [TEI XML](https://tei-c.org/) tags using a two-stage LLM pipeline.

1. **(Optional) GLiNER pre-detection** — fast CPU-based span labelling generates candidates for the LLM to verify and extend.

2. **LLM annotation** — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes).

3. **Deterministic post-processing** — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is **never modified** by any model call.

---

## Pipeline stages

```text

  Input text

       │

       ▼  strip existing XML tags

       ▼  (optional) GLiNER pre-detection  ──→  tei_annotator/detection/

       ▼  chunk text                        ──→  tei_annotator/chunking/

       ▼  build LLM prompt                  ──→  tei_annotator/prompting/

       ▼  LLM inference                     ──→  tei_annotator/inference/

       ▼  parse JSON response               ──→  tei_annotator/postprocessing/

       ▼  resolve spans → char offsets

       ▼  validate against schema

       ▼  inject XML tags

       │

       ▼

  Annotated XML output

```

Stage documentation:

[Data models](tei_annotator/models/README.md) ·

[GLiNER detection](tei_annotator/detection/README.md) ·

[Chunking](tei_annotator/chunking/README.md) ·

[Prompt building](tei_annotator/prompting/README.md) ·

[Inference configuration](tei_annotator/inference/README.md) ·

[Post-processing](tei_annotator/postprocessing/README.md) ·

[Evaluation](tei_annotator/evaluation/README.md)

---

> **Disclaimer:** The code in this repository was generated by [Claude](https://claude.ai) (Anthropic) based on prompts and direction provided by [@cboulanger](https://github.com/cboulanger).

---

## Installation

Requires Python ≥ 3.12 and [uv](https://docs.astral.sh/uv/).

```bash

git clone 

cd tei-annotator

uv sync                    # runtime deps: jinja2, lxml, rapidfuzz

uv sync --extra gliner     # also installs gliner for optional pre-detection

```

API keys for LLM endpoints go in `.env` (copy from `.env.template`).

---

## Quick start

```python

from tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute

from tei_annotator import EndpointConfig, EndpointCapability

schema = TEISchema(

    rules=[

        "Emit a 'surname' span within every enclosing 'persName' span.",

    ],

    elements=[

        TEIElement(

            tag="persName",

            description="a person's name",

            attributes=[TEIAttribute(name="ref", description="authority URI")],

        ),

        TEIElement(tag="placeName", description="a geographical place name"),

    ],

)

def my_call_fn(prompt: str) -> str:

    ...  # any LLM: Anthropic, OpenAI, Gemini, Ollama, …

endpoint = EndpointConfig(

    capability=EndpointCapability.TEXT_GENERATION,

    call_fn=my_call_fn,

)

result = annotate(

    text="Marie Curie was born in Warsaw and later worked in Paris.",

    schema=schema,

    endpoint=endpoint,

    gliner_model=None,   # pass e.g. "numind/NuNER_Zero" to enable pre-detection

)

print(result.xml)

# Marie Curie was born in Warsaw

# and later worked in Paris.

```

For provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see [tei_annotator/inference/README.md](tei_annotator/inference/README.md).

---

## Built-in providers

Five connectors live in [`tei_annotator/providers/`](tei_annotator/providers/), enabled by setting the corresponding env var:

| Provider | Env var | ID |

| --- | --- | --- |

| HuggingFace Inference Router | `HF_TOKEN` | `hf` |

| Google Gemini | `GEMINI_API_KEY` | `gemini` |

| KISSKI academic cloud | `KISSKI_API_KEY` | `kisski` |

| OpenAI | `OPENAI_API_KEY` | `openai` |

| Anthropic Claude | `ANTHROPIC_API_KEY` | `claude` |

Adding a new provider: create a module in `tei_annotator/providers/`, subclass `Connector`, add an instance to `_ALL_CONNECTORS` in `__init__.py`. See [tei_annotator/providers/README.md](tei_annotator/providers/README.md).

---

## Built-in schemas

Two annotation schemas are registered in [`tei_annotator/schemas/registry.py`](tei_annotator/schemas/registry.py):

| Key | Task |

| --- | --- |

| `bibl` | Tag internal fields of a bibliographic reference (author, title, date, …) |

| `bibl-reference-segmenter` | Segment a reference list into `` spans with optional `` |

Each schema ships with at least one gold-standard corpus file in `data/corpus/.default.tei.xml` used by the evaluator and webservice.

Adding a new schema: register it in `SCHEMA_REGISTRY`. See [tei_annotator/schemas/README.md](tei_annotator/schemas/README.md).

---

## Evaluation and iterative improvement

`scripts/evaluate_llm.py` runs any available provider against a gold-standard TEI file:

```bash

# quick run: 5 records, gemini, bibl-reference-segmenter schema

uv run scripts/evaluate_llm.py \

    --provider gemini --schema bibl-reference-segmenter --max-items 5 --verbose

# all available providers, all records, output to file

uv run scripts/evaluate_llm.py --schema bibl --output-file results.txt

```

Key flags: `--provider`, `--model`, `--schema`, `--gold-file`, `--max-items`, `--batch-size`, `--match-mode`, `--verbose`, `--grep`, `--shuffle`.

`scripts/collect_hard_examples.py` builds a gold fixture of challenging examples by evaluating items in mini-batches and retaining those the model handles poorly:

```bash

# collect 30 hard bibl-reference-segmenter examples using KISSKI gemma-4-31b-it

uv run scripts/collect_hard_examples.py \

    --provider kisski --model gemma-4-31b-it \

    --limit 30 --batch-size 10 --f1-threshold 0.95 \

    --output data/hard-bibl-refseg-gemma.tei.xml

```

Key flags: `--schema`, `--provider`, `--model`, `--limit`, `--batch-size`, `--f1-threshold`, `--max-per-batch`, `--context`, `--shuffle`.

For the iterative schema-improvement workflow see [docs/tei-element-descriptions.md](docs/tei-element-descriptions.md). For metrics details see [tei_annotator/evaluation/README.md](tei_annotator/evaluation/README.md).

---

## Demo and webservice

- **HuggingFace demo:** 

- **`app.py`** — Gradio app for HuggingFace Spaces. See [docs/huggingface-deployment.md](docs/huggingface-deployment.md).

- **`webservice/`** — FastAPI JSON API + browser UI, all five providers. See [webservice/README.md](webservice/README.md).

---

## Testing

```bash

# Unit tests (fully mocked, < 0.5 s)

uv run pytest

# Integration tests (no model download needed)

uv run pytest --override-ini="addopts=" -m integration \

    tests/integration/test_pipeline_e2e.py -k "not real_gliner"

# Integration tests with real GLiNER model (~400 MB on first run)

uv run pytest --override-ini="addopts=" -m integration \

    tests/integration/test_gliner_detector.py \

    tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cboulanger/tei-annotator

Awesome Lists containing this project

README