{"id":42787762,"url":"https://github.com/recogito/tei-standoffconverter-js","last_synced_at":"2026-01-29T23:15:42.185Z","repository":{"id":290508900,"uuid":"974659568","full_name":"recogito/tei-standoffconverter-js","owner":"recogito","description":"Convert between TEI/XML and plaintext without losing markup context.","archived":false,"fork":false,"pushed_at":"2025-05-15T07:48:08.000Z","size":182,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-15T08:41:55.707Z","etag":null,"topics":["ner","nlp","tei","tei-xml"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/recogito.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-29T05:45:48.000Z","updated_at":"2025-05-15T07:48:11.000Z","dependencies_parsed_at":"2025-05-15T08:35:11.045Z","dependency_job_id":"4caac7b3-523a-4576-ad56-3cfc3c54464f","html_url":"https://github.com/recogito/tei-standoffconverter-js","commit_stats":null,"previous_names":["rsimon/xml-standoff-converter-js","recogito/tei-standoffconverter-js"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/recogito/tei-standoffconverter-js","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recogito%2Ftei-standoffconverter-js","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recogito%2Ftei-standoffconverter-js/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recogito%2Ftei-standoffconverter-js/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recogito%2Ftei-standoffconverter-js/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/recogito","download_url":"https://codeload.github.com/recogito/tei-standoffconverter-js/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recogito%2Ftei-standoffconverter-js/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28889871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-29T21:06:44.224Z","status":"ssl_error","status_checked_at":"2026-01-29T21:06:42.160Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ner","nlp","tei","tei-xml"],"created_at":"2026-01-29T23:15:42.119Z","updated_at":"2026-01-29T23:15:42.179Z","avatar_url":"https://github.com/recogito.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Recogito TEI/XML Standoff Converter\n\nA JavaScript/TypeScript utility that bridges the gap between TEI/XML documents and plaintext processing tools. \n\nThis library creates a reversible mapping between TEI/XML markup and character offsets in plaintext, allowing you to apply text analysis tools to TEI documents without losing markup context.\n\n```\nTEI/XML:   \u003cp\u003eThis is a \u003chi\u003esample\u003c/hi\u003e text.\u003c/p\u003e\nPlaintext: This is a sample text.\n                     ^----^\n                     Identified entity\nResult:    \u003cp\u003eThis is a \u003chi\u003e\u003cplaceName\u003esample\u003c/placeName\u003e\u003c/hi\u003e text.\u003c/p\u003e\n```\n\nThe core logic was ported to TypeScript from the excellent Python [standoffconverter](https://github.com/standoff-nlp/standoffconverter) by [@millawell](https://github.com/millawell).\n\n## Installation\n\n```sh\nnpm install @recogito/standoff-converter\n```\n\n## Why?\n\nText analysis tools (e.g. for Named Entity Recognition) typically work with plaintext only. However, TEI/XML documents contain rich structural markup that must be stripped away before processing. When analysis tools identify entities or features at specific character positions in the plaintext, it's hard to map those positions back to the original TEI markup structure.\n\nThis library:\n- Creates a linearized representation that maintains the relationship between plaintext character positions and XML markup.\n- Allows you to process the plaintext with any text analysis tool.\n- Maps the identified features back to the exact location in the original XML.\n- Allows you to modify the TEI/XML structure while preserving all existing markup.\n\nPerfect for enriching TEI documents with automatically extracted entities, annotations, or other textual features!\n\n## Features\n\n- Extract plaintext from TEI/XML while preserving a bidirectional mapping between character offsets and markup.\n- Convert between plaintext character offsets and TEI XPointer expressions.\n- Insert new inline tags at specific character positions (e.g. add `\u003cplaceName\u003e` or `\u003cpersName\u003e` tags based on NER results).\n- Preserve all original markup when serializing changes back to TEI/XML.\n- Works in both Node.js and browser environments.\n\n## Usage in Node\n\nThis library works in Node (using [xmldom](https://github.com/xmldom/xmldom) and [xpath](https://github.com/goto100/xpath) internally).\n\n```ts\nimport { parseXML } from '@recogito/standoff-converter';\n\nconst xml = `\n  \u003cTEI xmlns=\"http://www.tei-c.org/ns/1.0\"\u003e\n    \u003cteiHeader\u003e\n      \u003cfileDesc\u003e\n        \u003ctitleStmt\u003e\n          \u003ctitle\u003eSample TEI Document\u003c/title\u003e\n        \u003c/titleStmt\u003e\n      \u003c/fileDesc\u003e\n    \u003c/teiHeader\u003e\n    \u003ctext\u003e\n      \u003cbody\u003e\n        \u003cp\u003eThis is a \u003chi rend=\"italic\"\u003esample\u003c/hi\u003e paragraph with \u003cterm\u003emarkup\u003c/term\u003e.\u003c/p\u003e\n      \u003c/body\u003e\n    \u003c/text\u003e\n  \u003c/TEI\u003e\n`;\n\nconst parsed = parseXML(xml);\n\n// Get plaintext\nconst text = parsed.text();\n\n// XPointer expression from character position\nconst xpointer = parsed.getXPointer(550);\n\n// Character position from XPointer expression\nconst position = parsed.getCharacterOffset('//TEI/text[1]/body[1]/p[1]/::5');\n\n// Add inline tag at character position\nparsed.addInline(5, 7, 'rs', { resp: 'aboutgeo' });\n\n// Modified markup as a DOM Element\nconst el = parsed.xml();\n\n// Modified markup serialized to string\nconst xml = parsed.xmlString();\n```\n\n## Usage in the Browser\n\nYou can use this library in the browser in combination with [CETEIcean](https://github.com/TEIC/CETEIcean).\n\n```ts\nimport { parseXML } from '@recogito/standoff-converter';\n\nwindow.onload = async function () {\n  const CETEIcean = new CETEI();\n\n  CETEIcean.getHTML5('paradise-lost.xml', data =\u003e {\n    document.getElementById('orig').appendChild(data);\n    const el = document.getElementById('orig').firstChild;\n\n    // Parse CETEIcean content\n    const parsed = parseXML(el);\n\n    // Get XPointer expressions from plaintext character offsets\n    console.log(parsed.getXPointer(550));\n\n    // Get character offsets from an XPointer expression (format: path::offset)\n    const xpointer = '//text[@xml:id=\"text-1\"]/body[1]/div[1]/p[4]/hi[1]::5';\n    console.log(parsed.getCharacterOffset(xpointer));\n\n    // Add inline tags at character positions\n    parsed.addInline(550, 560, 'tei-note', { type: 'comment', resp: 'aboutgeo' });\n\n    // Serialize back to TEI/XML\n    const teiElement = parsed.toXML();\n    document.getElementById('serialized').appendChild(teiElement);\n  });\n};\n```\n\n## API\n\n### Core Functions\n\n| Function | Description | Parameters | Return Value |\n|----------------|-------------|------------|--------------|\n| `parseXML(input)` | Parse TEI/XML | `input`: XML string or Element | `parsed` instance |\n| `parsed.text()` | Get plaintext | None | `string` |\n| `parsed.tokens` | Access linearized token array | - | `Array` of token objects |\n| `parsed.getXPointer(offset)` | Convert plaintext character offset to XPointer | `offset`: number | `string` XPointer expression |\n| `parsed.getCharacterOffset(xpointer)` | Convert XPointer to character offset | `xpointer`: string | `number` |\n| `parsed.addInline(start, end, tagName, attrs)` | Insert inline tag at character positions | `start`: number\u003cbr\u003e`end`: number\u003cbr\u003e`tagName`: string\u003cbr\u003e`attrs`: object | `void` |\n| `parsed.xml()` | Get TEI/XML (DOM Element) | None | `Element` |\n| `parsed.xmlString()` | Get XML (serialized string) | None | `string` |\n\n### Recogito-Specific Functions\n\n| Function/Method | Description | Parameters | Return Value |\n|----------------|-------------|------------|--------------|\n| `parsed.annotations(standOffId?)` | Get standoff annotations from all or a specific TEI `\u003cstandOff\u003e` element | `standOffId?`: string | `Array` of standoff annotation objects |\n| `parsed.addStandOff(id)` | Add a new TEI `\u003cstandOff\u003e` element | `id`: string | `string` annotation ID |\n| `parsed.addAnnotation(standOffId, annotation)` | Add Recogito annotation to `standOff` element | `standOffId`: string\u003cbr\u003e`annotation`: standoff annotation | `void` |\n| `parsed.addStandOffTag(standOffId, start, end, tag)` | Add Recogito annotation to `standOff` element that represents a simple (NER) tag | `standOffId`: string\u003cbr\u003e`start`: number\u003cbr\u003e`end`: number\u003cbr\u003e`tag`: string or `{ id, label }` \u003cbr\u003e | `void` |\n\n## Known Issues\n\n- **standOff Anchors and inline markup modifications**. Using `.addInline` will change the TEI markup. This has the potential to break XPath anchors for annotations in `\u003cstandOff\u003e` blocks. In order to prevent this, we would need to do before/after checks for affected anchors, and update them accordingly, so that they remain in sync with the changed TEI document.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecogito%2Ftei-standoffconverter-js","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frecogito%2Ftei-standoffconverter-js","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecogito%2Ftei-standoffconverter-js/lists"}