{"id":50685110,"url":"https://github.com/bsesic/trace","last_synced_at":"2026-06-08T22:04:05.502Z","repository":{"id":354177121,"uuid":"1222456359","full_name":"bsesic/trace","owner":"bsesic","description":"Pairwise philological alignment library with pluggable language packs — semi-global Needleman–Wunsch with affine gaps and abbreviation-span lookahead. Ships with a Biblical/Rabbinic Hebrew pack.","archived":false,"fork":false,"pushed_at":"2026-05-28T17:43:00.000Z","size":249,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T19:09:45.771Z","etag":null,"topics":["alignment","alignment-algorithm","critical-edition","digital-humanities","hebrew","manuscripts","manuscripts-restoration","needleman-wunsch","needleman-wunsch-algorithm","synopsys","tei","tei-xml","textual-criticism"],"latest_commit_sha":null,"homepage":"https://tracealign.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bsesic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-27T11:37:41.000Z","updated_at":"2026-05-28T17:43:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bsesic/trace","commit_stats":null,"previous_names":["bsesic/trace"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/bsesic/trace","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsesic%2Ftrace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsesic%2Ftrace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsesic%2Ftrace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsesic%2Ftrace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bsesic","download_url":"https://codeload.github.com/bsesic/trace/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsesic%2Ftrace/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34082148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","alignment-algorithm","critical-edition","digital-humanities","hebrew","manuscripts","manuscripts-restoration","needleman-wunsch","needleman-wunsch-algorithm","synopsys","tei","tei-xml","textual-criticism"],"created_at":"2026-06-08T22:04:04.663Z","updated_at":"2026-06-08T22:04:05.476Z","avatar_url":"https://github.com/bsesic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TRACE\n\n**Textual Reuse, Alignment, and Collation Engine** — a Python library for philological alignment with pluggable language packs. Pairwise (v0.1) and simultaneous multi-witness (v0.2) alignment.\n\n[![CI](https://github.com/bsesic/trace/actions/workflows/workflow.yml/badge.svg)](https://github.com/bsesic/trace/actions/workflows/workflow.yml)\n[![PyPI version](https://img.shields.io/pypi/v/tracealign.svg)](https://pypi.org/project/tracealign/)\n[![Python versions](https://img.shields.io/pypi/pyversions/tracealign.svg)](https://pypi.org/project/tracealign/)\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n[![Documentation Status](https://readthedocs.org/projects/tracealign/badge/?version=latest)](https://tracealign.readthedocs.io/en/latest/)\n[![DOI](https://zenodo.org/badge/1222456359.svg)](https://doi.org/10.5281/zenodo.20315408)\n\nTRACE is designed for textual criticism, manuscript witness comparison, and the creation of digital synopses and critical editions. The core is language-agnostic; the first shipped language pack covers Biblical and Rabbinic Hebrew (`hbo`).\n\n---\n\n## Highlights\n\n- **Tokenizer pipeline** with editorial-marker awareness (`[reconstructed]`, `⟦deletion⟧`, `〈insertion〉`, `(expanded)`, lacunae).\n- **Tiered scoring** returning `(score, reason)` per token pair — `EXACT`, `NIQQUD_STRIPPED`, `PLENE_DEFECTIVE`, `ABBREVIATION`, `ORTHOGRAPHIC`, `INSERTION`, `OMISSION`, `NO_MATCH`.\n- **Pairwise aligner** — semi-global Needleman–Wunsch with affine gap penalties (Gotoh) and a multi-token abbreviation lookahead (`ר\"י` ↔ `רבי ישמעאל`).\n- **Multi-witness aligner** (v0.2) — N witnesses aligned simultaneously into a canonical variant graph (DAG) plus a derived aligned table view, via pairwise distances → UPGMA guide tree → POA-based progressive merge. Determinism is pinned by a permutation-invariance property test; correctness by a lossless-reconstruction property test.\n- **Hebrew language pack** with niqqud strip, plene/defective skeleton matching, gershayim/maqqef tokenizer hooks, and a seed lexicon of rabbinic abbreviations (extendable via `Lexica.merge()`).\n- **I/O** for plain text, JSON (round-trip for both pairwise and multi-witness results), eScriptorium exports (with bbox + line metadata), and TEI XML (`\u003ctei:w\u003e` mode + flow-text fallback).\n- **Reproducible** — every `AlignmentResult` / `MultiAlignmentResult` carries `trace_version` and `language_pack_version` in its params.\n\n## Installation\n\n```bash\npip install tracealign\n```\n\nRequires Python 3.10, 3.11, or 3.12. Pulls `pydantic`, `numpy`, `lxml`, and `rapidfuzz`.\n\n### From source\n\n```bash\ngit clone https://github.com/bsesic/trace.git\ncd trace\npip install -e \".[dev]\"\n```\n\nThe `dev` extra adds `pytest` and `flake8` (the project's quality gates). For documentation contributions, use `pip install -e \".[docs]\"` to add Sphinx, furo, and myst-parser.\n\n### Verifying the install\n\n```bash\npython -c \"import tracealign; print(tracealign.__version__, tracealign.list_languages())\"\n```\n\nShould print the current version and `['hbo']` (the Hebrew language pack registers itself on import).\n\n## Quick start — pairwise\n\n```python\nimport tracealign\n\nw1 = tracealign.tokenize(\"שלום עולם רַבִּי דויד ר\\\"י אמר\", lang=\"hbo\", seq_label=\"W1\")\nw2 = tracealign.tokenize(\"שלום עולם רבי דוד רבי ישמעאל אמר\", lang=\"hbo\", seq_label=\"W2\")\n\nresult = tracealign.align(w1, w2, lang=\"hbo\")\n\nprint(f\"total score: {result.total_score:.2f}\")\nprint(f\"summary: {dict(result.summary)}\")\nfor m in result.matches:\n    a = m.token_a.text if m.token_a else \"—\"\n    b = m.token_b.text if m.token_b else \"—\"\n    print(f\"  {a:\u003e10} ↔ {b:\u003c10}  {m.reason.value:\u003c18} {m.score:.2f}\")\n```\n\nOutput (abridged):\n\n```\ntotal score: 0.91\nsummary: {EXACT: 3, NIQQUD_STRIPPED: 1, PLENE_DEFECTIVE: 1, ABBREVIATION: 1}\n       שלום ↔ שלום        exact              1.00\n       עולם ↔ עולם        exact              1.00\n      רַבִּי ↔ רבי         niqqud_stripped    0.95\n       דויד ↔ דוד          plene_defective    0.85\n        ר\"י ↔ רבי          abbreviation       0.85   (primary)\n        ר\"י ↔ ישמעאל       abbreviation       0.00   (continuation)\n        אמר ↔ אמר          exact              1.00\n```\n\n## Quick start — multi-witness (v0.2)\n\n```python\nimport tracealign\n\nwitnesses = {\n    \"W1\": tracealign.tokenize(\"שלום עולם רַבִּי דויד אמר\",  lang=\"hbo\", seq_label=\"W1\"),\n    \"W2\": tracealign.tokenize(\"שלום עולם רבי דוד אמר\",       lang=\"hbo\", seq_label=\"W2\"),\n    \"W3\": tracealign.tokenize(\"שלום עולם ר\\\"י אמר\",          lang=\"hbo\", seq_label=\"W3\"),\n    \"W4\": tracealign.tokenize(\"שלום עולם רבי דוד אמר טוב\",   lang=\"hbo\", seq_label=\"W4\"),\n}\n\nresult = tracealign.align_multi(witnesses, lang=\"hbo\")\n\nprint(result.guide_tree.format_text())\nprint(result.table.format_text())\n\nfor node in result.graph.variants():\n    readings = {wid: t.text for wid, t in node.tokens.items()}\n    print(node.id, readings)\n```\n\nThe `MultiAlignmentResult` exposes a canonical `VariantGraph` (DAG with witness trails), a derived `AlignedTable` (re-anchorable to any witness for presentation), a `GuideTree` (UPGMA-built, carrying the original distance matrix — useful for downstream stemmatic work), and the same reproducibility-aware `params` snapshot the pairwise aligner produces.\n\nJSON persistence works the same way as the pairwise aligner, in its own module:\n\n```python\nfrom tracealign.io import multi_result as mr_io\n\nmr_io.dump(result, \"alignment.json\")\nrestored = mr_io.load(\"alignment.json\")\n```\n\nSee **[the documentation](https://tracealign.readthedocs.io/en/latest/)** for the full API, more usage examples, the algorithm details, FAQs, and the design rationale.\n\n## Documentation\n\n| Section | What it covers |\n|---|---|\n| [Installation](https://tracealign.readthedocs.io/en/latest/installation.html) | pip / from source / dev setup / docs build |\n| [Usage](https://tracealign.readthedocs.io/en/latest/usage.html) | Tokenize, pairwise align, multi-witness align, work with the result, custom lexica, I/O |\n| [Details](https://tracealign.readthedocs.io/en/latest/details.html) | Tokenizer pipeline, scoring tiers, pairwise DP algorithm, multi-witness POA pipeline |\n| [FAQ](https://tracealign.readthedocs.io/en/latest/faq.html) | Common questions about scope, language packs, performance, multi-witness semantics |\n| [Contributing](https://tracealign.readthedocs.io/en/latest/contributing.html) | Development workflow, TDD discipline, branch model |\n\n## Project status\n\n| | |\n|---|---|\n| Current PyPI release | 0.1.3 (v0.2.0 in flight on `feature/v0.2-multi-witness`) |\n| Roadmap | [docs/ROADMAP.md](docs/ROADMAP.md) — ten-stage long-term vision |\n| v0.1 design spec | [docs/superpowers/specs/2026-04-28-trace-v0.1-design.md](docs/superpowers/specs/2026-04-28-trace-v0.1-design.md) |\n| v0.2 design spec | [docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md](docs/superpowers/specs/2026-05-21-trace-v0.2-multi-witness-design.md) |\n| Released stages | 1 (pairwise + Hebrew pack) |\n| In progress | 2 (master alignment graph / multi-witness) |\n| Future sub-projects | Geniza anchor detection · Text-reuse · Apparatus / critical edition · Cross-tradition Hexapla · Stemmatic reconstruction · Allusion detection · Citation graphs · Reception history |\n\n## Citation\n\nIf you use TRACE in academic work, please cite via the [Zenodo concept DOI](https://doi.org/10.5281/zenodo.20315408) (always resolves to the latest archived release) or pick a specific version DOI from the Zenodo record. A `CITATION.cff` is at the repo root — GitHub's \"Cite this repository\" button generates APA / BibTeX / RIS automatically from it.\n\n## License\n\n[MIT](LICENSE) © 2026 Benjamin Schnabel.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbsesic%2Ftrace","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbsesic%2Ftrace","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbsesic%2Ftrace/lists"}