https://github.com/cboulanger/tei-annotator
Python library for annotating text with TEI XML tags using a two-stage LLM + GLiNER pipeline
https://github.com/cboulanger/tei-annotator
annotations llm-inference tei-xml
Last synced: about 1 month ago
JSON representation
Python library for annotating text with TEI XML tags using a two-stage LLM + GLiNER pipeline
- Host: GitHub
- URL: https://github.com/cboulanger/tei-annotator
- Owner: cboulanger
- License: mit
- Created: 2026-02-28T19:00:48.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-10T09:39:59.000Z (3 months ago)
- Last Synced: 2026-03-10T17:00:35.890Z (3 months ago)
- Topics: annotations, llm-inference, tei-xml
- Language: Python
- Homepage:
- Size: 424 KB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
---
title: Tei Annotator
emoji: 🦀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 6.9.0
python_version: '3.12'
app_file: app.py
hardware: cpu-basic
pinned: false
license: mit
short_description: Demo for cboulanger/tei-annotator
---
A Python library for annotating plain text with [TEI XML](https://tei-c.org/) tags using a two-stage LLM pipeline.
1. **(Optional) GLiNER pre-detection** — fast CPU-based span labelling generates candidates for the LLM to verify and extend.
2. **LLM annotation** — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes).
3. **Deterministic post-processing** — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is **never modified** by any model call.
---
## Pipeline stages
```text
Input text
│
â–¼ strip existing XML tags
▼ (optional) GLiNER pre-detection ──→ tei_annotator/detection/
▼ chunk text ──→ tei_annotator/chunking/
▼ build LLM prompt ──→ tei_annotator/prompting/
▼ LLM inference ──→ tei_annotator/inference/
▼ parse JSON response ──→ tei_annotator/postprocessing/
▼ resolve spans → char offsets
â–¼ validate against schema
â–¼ inject XML tags
│
â–¼
Annotated XML output
```
Stage documentation:
[Data models](tei_annotator/models/README.md) ·
[GLiNER detection](tei_annotator/detection/README.md) ·
[Chunking](tei_annotator/chunking/README.md) ·
[Prompt building](tei_annotator/prompting/README.md) ·
[Inference configuration](tei_annotator/inference/README.md) ·
[Post-processing](tei_annotator/postprocessing/README.md) ·
[Evaluation](tei_annotator/evaluation/README.md)
---
> **Disclaimer:** The code in this repository was generated by [Claude](https://claude.ai) (Anthropic) based on prompts and direction provided by [@cboulanger](https://github.com/cboulanger).
---
## Installation
Requires Python ≥ 3.12 and [uv](https://docs.astral.sh/uv/).
```bash
git clone
cd tei-annotator
uv sync # runtime deps: jinja2, lxml, rapidfuzz
uv sync --extra gliner # also installs gliner for optional pre-detection
```
API keys for LLM endpoints go in `.env` (copy from `.env.template`).
---
## Quick start
```python
from tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute
from tei_annotator import EndpointConfig, EndpointCapability
schema = TEISchema(
rules=[
"Emit a 'surname' span within every enclosing 'persName' span.",
],
elements=[
TEIElement(
tag="persName",
description="a person's name",
attributes=[TEIAttribute(name="ref", description="authority URI")],
),
TEIElement(tag="placeName", description="a geographical place name"),
],
)
def my_call_fn(prompt: str) -> str:
... # any LLM: Anthropic, OpenAI, Gemini, Ollama, …
endpoint = EndpointConfig(
capability=EndpointCapability.TEXT_GENERATION,
call_fn=my_call_fn,
)
result = annotate(
text="Marie Curie was born in Warsaw and later worked in Paris.",
schema=schema,
endpoint=endpoint,
gliner_model=None, # pass e.g. "numind/NuNER_Zero" to enable pre-detection
)
print(result.xml)
# Marie Curie was born in Warsaw
# and later worked in Paris.
```
For provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see [tei_annotator/inference/README.md](tei_annotator/inference/README.md).
---
## Built-in providers
Five connectors live in [`tei_annotator/providers/`](tei_annotator/providers/), enabled by setting the corresponding env var:
| Provider | Env var | ID |
| --- | --- | --- |
| HuggingFace Inference Router | `HF_TOKEN` | `hf` |
| Google Gemini | `GEMINI_API_KEY` | `gemini` |
| KISSKI academic cloud | `KISSKI_API_KEY` | `kisski` |
| OpenAI | `OPENAI_API_KEY` | `openai` |
| Anthropic Claude | `ANTHROPIC_API_KEY` | `claude` |
Adding a new provider: create a module in `tei_annotator/providers/`, subclass `Connector`, add an instance to `_ALL_CONNECTORS` in `__init__.py`. See [tei_annotator/providers/README.md](tei_annotator/providers/README.md).
---
## Built-in schemas
Two annotation schemas are registered in [`tei_annotator/schemas/registry.py`](tei_annotator/schemas/registry.py):
| Key | Task |
| --- | --- |
| `bibl` | Tag internal fields of a bibliographic reference (author, title, date, …) |
| `bibl-reference-segmenter` | Segment a reference list into `` spans with optional `` |
Each schema ships with at least one gold-standard corpus file in `data/corpus/.default.tei.xml` used by the evaluator and webservice.
Adding a new schema: register it in `SCHEMA_REGISTRY`. See [tei_annotator/schemas/README.md](tei_annotator/schemas/README.md).
---
## Evaluation and iterative improvement
`scripts/evaluate_llm.py` runs any available provider against a gold-standard TEI file:
```bash
# quick run: 5 records, gemini, bibl-reference-segmenter schema
uv run scripts/evaluate_llm.py \
--provider gemini --schema bibl-reference-segmenter --max-items 5 --verbose
# all available providers, all records, output to file
uv run scripts/evaluate_llm.py --schema bibl --output-file results.txt
```
Key flags: `--provider`, `--model`, `--schema`, `--gold-file`, `--max-items`, `--batch-size`, `--match-mode`, `--verbose`, `--grep`, `--shuffle`.
`scripts/collect_hard_examples.py` builds a gold fixture of challenging examples by evaluating items in mini-batches and retaining those the model handles poorly:
```bash
# collect 30 hard bibl-reference-segmenter examples using KISSKI gemma-4-31b-it
uv run scripts/collect_hard_examples.py \
--provider kisski --model gemma-4-31b-it \
--limit 30 --batch-size 10 --f1-threshold 0.95 \
--output data/hard-bibl-refseg-gemma.tei.xml
```
Key flags: `--schema`, `--provider`, `--model`, `--limit`, `--batch-size`, `--f1-threshold`, `--max-per-batch`, `--context`, `--shuffle`.
For the iterative schema-improvement workflow see [docs/tei-element-descriptions.md](docs/tei-element-descriptions.md). For metrics details see [tei_annotator/evaluation/README.md](tei_annotator/evaluation/README.md).
---
## Demo and webservice
- **HuggingFace demo:**
- **`app.py`** — Gradio app for HuggingFace Spaces. See [docs/huggingface-deployment.md](docs/huggingface-deployment.md).
- **`webservice/`** — FastAPI JSON API + browser UI, all five providers. See [webservice/README.md](webservice/README.md).
---
## Testing
```bash
# Unit tests (fully mocked, < 0.5 s)
uv run pytest
# Integration tests (no model download needed)
uv run pytest --override-ini="addopts=" -m integration \
tests/integration/test_pipeline_e2e.py -k "not real_gliner"
# Integration tests with real GLiNER model (~400 MB on first run)
uv run pytest --override-ini="addopts=" -m integration \
tests/integration/test_gliner_detector.py \
tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner
```