{"id":47919692,"url":"https://github.com/cboulanger/tei-annotator","last_synced_at":"2026-05-15T00:13:16.414Z","repository":{"id":341249154,"uuid":"1169486694","full_name":"cboulanger/tei-annotator","owner":"cboulanger","description":"Python library for annotating text with TEI XML tags using a two-stage LLM + GLiNER pipeline","archived":false,"fork":false,"pushed_at":"2026-03-10T09:39:59.000Z","size":434,"stargazers_count":0,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-10T17:00:35.890Z","etag":null,"topics":["annotations","llm-inference","tei-xml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cboulanger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-28T19:00:48.000Z","updated_at":"2026-03-10T09:40:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/cboulanger/tei-annotator","commit_stats":null,"previous_names":["cboulanger/tei-annotator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cboulanger/tei-annotator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cboulanger%2Ftei-annotator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cboulanger%2Ftei-annotator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cboulanger%2Ftei-annotator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cboulanger%2Ftei-annotator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cboulanger","download_url":"https://codeload.github.com/cboulanger/tei-annotator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cboulanger%2Ftei-annotator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31389391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T04:26:24.776Z","status":"ssl_error","status_checked_at":"2026-04-04T04:23:34.147Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotations","llm-inference","tei-xml"],"created_at":"2026-04-04T05:52:05.807Z","updated_at":"2026-05-15T00:13:16.404Z","avatar_url":"https://github.com/cboulanger.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: Tei Annotator\nemoji: 🦀\ncolorFrom: green\ncolorTo: pink\nsdk: gradio\nsdk_version: 6.9.0\npython_version: '3.12'\napp_file: app.py\nhardware: cpu-basic\npinned: false\nlicense: mit\nshort_description: Demo for cboulanger/tei-annotator\n---\n\nA Python library for annotating plain text with [TEI XML](https://tei-c.org/) tags using a two-stage LLM pipeline.\n\n1. **(Optional) GLiNER pre-detection** — fast CPU-based span labelling generates candidates for the LLM to verify and extend.\n2. **LLM annotation** — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes).\n3. **Deterministic post-processing** — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is **never modified** by any model call.\n\n---\n\n## Pipeline stages\n\n```text\n  Input text\n       │\n       ▼  strip existing XML tags\n       ▼  (optional) GLiNER pre-detection  ──→  tei_annotator/detection/\n       ▼  chunk text                        ──→  tei_annotator/chunking/\n       ▼  build LLM prompt                  ──→  tei_annotator/prompting/\n       ▼  LLM inference                     ──→  tei_annotator/inference/\n       ▼  parse JSON response               ──→  tei_annotator/postprocessing/\n       ▼  resolve spans → char offsets\n       ▼  validate against schema\n       ▼  inject XML tags\n       │\n       ▼\n  Annotated XML output\n```\n\nStage documentation:\n[Data models](tei_annotator/models/README.md) ·\n[GLiNER detection](tei_annotator/detection/README.md) ·\n[Chunking](tei_annotator/chunking/README.md) ·\n[Prompt building](tei_annotator/prompting/README.md) ·\n[Inference configuration](tei_annotator/inference/README.md) ·\n[Post-processing](tei_annotator/postprocessing/README.md) ·\n[Evaluation](tei_annotator/evaluation/README.md)\n\n---\n\n\u003e **Disclaimer:** The code in this repository was generated by [Claude](https://claude.ai) (Anthropic) based on prompts and direction provided by [@cboulanger](https://github.com/cboulanger).\n\n---\n\n## Installation\n\nRequires Python ≥ 3.12 and [uv](https://docs.astral.sh/uv/).\n\n```bash\ngit clone \u003crepo\u003e\ncd tei-annotator\nuv sync                    # runtime deps: jinja2, lxml, rapidfuzz\nuv sync --extra gliner     # also installs gliner for optional pre-detection\n```\n\nAPI keys for LLM endpoints go in `.env` (copy from `.env.template`).\n\n---\n\n## Quick start\n\n```python\nfrom tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute\nfrom tei_annotator import EndpointConfig, EndpointCapability\n\nschema = TEISchema(\n    rules=[\n        \"Emit a 'surname' span within every enclosing 'persName' span.\",\n    ],\n    elements=[\n        TEIElement(\n            tag=\"persName\",\n            description=\"a person's name\",\n            attributes=[TEIAttribute(name=\"ref\", description=\"authority URI\")],\n        ),\n        TEIElement(tag=\"placeName\", description=\"a geographical place name\"),\n    ],\n)\n\ndef my_call_fn(prompt: str) -\u003e str:\n    ...  # any LLM: Anthropic, OpenAI, Gemini, Ollama, …\n\nendpoint = EndpointConfig(\n    capability=EndpointCapability.TEXT_GENERATION,\n    call_fn=my_call_fn,\n)\n\nresult = annotate(\n    text=\"Marie Curie was born in Warsaw and later worked in Paris.\",\n    schema=schema,\n    endpoint=endpoint,\n    gliner_model=None,   # pass e.g. \"numind/NuNER_Zero\" to enable pre-detection\n)\nprint(result.xml)\n# \u003cpersName\u003eMarie Curie\u003c/persName\u003e was born in \u003cplaceName\u003eWarsaw\u003c/placeName\u003e\n# and later worked in \u003cplaceName\u003eParis\u003c/placeName\u003e.\n```\n\nFor provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see [tei_annotator/inference/README.md](tei_annotator/inference/README.md).\n\n---\n\n## Built-in providers\n\nFive connectors live in [`tei_annotator/providers/`](tei_annotator/providers/), enabled by setting the corresponding env var:\n\n| Provider | Env var | ID |\n| --- | --- | --- |\n| HuggingFace Inference Router | `HF_TOKEN` | `hf` |\n| Google Gemini | `GEMINI_API_KEY` | `gemini` |\n| KISSKI academic cloud | `KISSKI_API_KEY` | `kisski` |\n| OpenAI | `OPENAI_API_KEY` | `openai` |\n| Anthropic Claude | `ANTHROPIC_API_KEY` | `claude` |\n\nAdding a new provider: create a module in `tei_annotator/providers/`, subclass `Connector`, add an instance to `_ALL_CONNECTORS` in `__init__.py`. See [tei_annotator/providers/README.md](tei_annotator/providers/README.md).\n\n---\n\n## Built-in schemas\n\nTwo annotation schemas are registered in [`tei_annotator/schemas/registry.py`](tei_annotator/schemas/registry.py):\n\n| Key | Task |\n| --- | --- |\n| `bibl` | Tag internal fields of a bibliographic reference (author, title, date, …) |\n| `bibl-reference-segmenter` | Segment a reference list into `\u003cbibl\u003e` spans with optional `\u003clabel\u003e` |\n\nEach schema ships with at least one gold-standard corpus file in `data/corpus/\u003cschema\u003e.default.tei.xml` used by the evaluator and webservice.\n\nAdding a new schema: register it in `SCHEMA_REGISTRY`. See [tei_annotator/schemas/README.md](tei_annotator/schemas/README.md).\n\n---\n\n## Evaluation and iterative improvement\n\n`scripts/evaluate_llm.py` runs any available provider against a gold-standard TEI file:\n\n```bash\n# quick run: 5 records, gemini, bibl-reference-segmenter schema\nuv run scripts/evaluate_llm.py \\\n    --provider gemini --schema bibl-reference-segmenter --max-items 5 --verbose\n\n# all available providers, all records, output to file\nuv run scripts/evaluate_llm.py --schema bibl --output-file results.txt\n```\n\nKey flags: `--provider`, `--model`, `--schema`, `--gold-file`, `--max-items`, `--batch-size`, `--match-mode`, `--verbose`, `--grep`, `--shuffle`.\n\n`scripts/collect_hard_examples.py` builds a gold fixture of challenging examples by evaluating items in mini-batches and retaining those the model handles poorly:\n\n```bash\n# collect 30 hard bibl-reference-segmenter examples using KISSKI gemma-4-31b-it\nuv run scripts/collect_hard_examples.py \\\n    --provider kisski --model gemma-4-31b-it \\\n    --limit 30 --batch-size 10 --f1-threshold 0.95 \\\n    --output data/hard-bibl-refseg-gemma.tei.xml\n```\n\nKey flags: `--schema`, `--provider`, `--model`, `--limit`, `--batch-size`, `--f1-threshold`, `--max-per-batch`, `--context`, `--shuffle`.\n\nFor the iterative schema-improvement workflow see [docs/tei-element-descriptions.md](docs/tei-element-descriptions.md). For metrics details see [tei_annotator/evaluation/README.md](tei_annotator/evaluation/README.md).\n\n---\n\n## Demo and webservice\n\n- **HuggingFace demo:** \u003chttps://huggingface.co/spaces/cmboulanger/tei-annotator\u003e\n- **`app.py`** — Gradio app for HuggingFace Spaces. See [docs/huggingface-deployment.md](docs/huggingface-deployment.md).\n- **`webservice/`** — FastAPI JSON API + browser UI, all five providers. See [webservice/README.md](webservice/README.md).\n\n---\n\n## Testing\n\n```bash\n# Unit tests (fully mocked, \u003c 0.5 s)\nuv run pytest\n\n# Integration tests (no model download needed)\nuv run pytest --override-ini=\"addopts=\" -m integration \\\n    tests/integration/test_pipeline_e2e.py -k \"not real_gliner\"\n\n# Integration tests with real GLiNER model (~400 MB on first run)\nuv run pytest --override-ini=\"addopts=\" -m integration \\\n    tests/integration/test_gliner_detector.py \\\n    tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcboulanger%2Ftei-annotator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcboulanger%2Ftei-annotator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcboulanger%2Ftei-annotator/lists"}