https://github.com/ntrkd/tei-xml-formatter
A VS Code extension to format TEI XML files
https://github.com/ntrkd/tei-xml-formatter
formatter tei-xml typescript vscode-extension
Last synced: 3 days ago
JSON representation
A VS Code extension to format TEI XML files
- Host: GitHub
- URL: https://github.com/ntrkd/tei-xml-formatter
- Owner: ntrkd
- License: mit
- Created: 2025-10-18T19:35:37.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2026-03-31T21:17:30.000Z (3 months ago)
- Last Synced: 2026-05-03T17:06:42.792Z (about 2 months ago)
- Topics: formatter, tei-xml, typescript, vscode-extension
- Language: TypeScript
- Homepage:
- Size: 166 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
> This project is under active development, expect ~~breaking~~ massive changes!
## About
This is a TypeScript library that formats [TEI XML](https://tei-c.org/) files. It uses [saxes](https://github.com/lddubeau/saxes/) to parse XML, format, and then output a formatted string. This formatter expects valid XML files.
## Demonstrations
### Unformatted
```xml
Letter from Emily to JohnDear John, I hope this letter finds you well.
The weather here has been unusually warm for October.
I have enclosed the sketches you asked for.
Original note: “See attached drawings.”
Yours sincerely,
Emily
80 808 0808 080808080808 0808008 8 8 08 08 08 80 80 80 8080 8080 8008 080 8080 8080 080 0
```
### Formatted
```xml
Letter from Emily to JohnDear John,
I hope this letter finds you well. The weather here has been
unusually warm
for October.
I have enclosed the sketches you asked for.
Original note: “See attached drawings.”
Yours sincerely, Emily
80 808 0808 080808080808 0808008 8 8 08 08 08 80 80 80 8080 8080 8008 080 8080 8080 080 0
```
## Importing and Usage
The package is published on [npmjs](www.npmjs.com/package/tei-xml-fmt). We publish CJS and [ESM](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import) versions to allow either type of project to import.
```js
// ESM import
import { Formatter } from "tei-xml-fmt";
```
```js
// CJS require() import
const { Formatter } = require('tei-xml-fmt');
```
```js
// Make a new instance
const texfmt = new Formatter();
let unformattedXML = '
Hello World! Tags that are sufficiently long will be wrapped automatically.
';
// The format function takes in a single string with the text to format.
// It returns a string with the formatted text.
let formattedText = texfmt.format(unformattedXML);
console.log(formattedText);
```
## Algorithm
### TL;DR
First uses saxes to parse the .xml file into code which can be processed easier than raw text. First we construct a tree called an Abstract Syntax Tree (AST). This contains enough information to distinguish between Tag Nodes, Close Tag Nodes, Text, and Spaces. Then we take that, process it a bit more and lower it into a Formatting Tree which strips out even more information down to just Groups (contain Text and Space nodes), Text, and Spacing Nodes (Line Indent/Deindent, Space or Line). From here there is very little information to process and most of the formatting has been done. We render the Formatting Tree into raw text again.
### Steps
1. Construct an editable AST tree from the XML file.
a. Combine adjacent text nodes into singular text nodes.
b. Normalize all spaces ' ', new lines '\n', and tab lines '\t' within text to a singlar space.
c. A text node containing a single space should be transformed into a Spacing Node. If the text node contains text, trailing and leading spaces become Spacing nodes.
- If the Spacing node will reside next to another Spacing Node, do not insert it.
3. Sanitize the AST using a Zipper to allow for better traversal.
a. A space node can be \n, \t or ' ' as long as it does not reside between two text characters / nodes.
b. Spacing nodes should be carried in both directions.
- Carrying means inserting another Spacing node after the next node if the node in front of it can be crossed.
- If we are carrying left, it can cross only open tags. If we are carrying right, it can cross only close tags.
- If the Spacing node will reside next to another Spacing node, do not insert it.
c. There should now be a single Spacing node everywhere we can insert spaces into.
5. Translate the AST into a formatting tree.
a. Convert all nodes normally into text. Spacing nodes require more attention.
b. When we encounter a spacing node, we look backward and forward to see what type of FMT node to insert.
- LineIndent - If the previous tag is an open tag and the next node is not a close tag
- LineDeindent - If the previous tag is not an open tag and the next node is a close tag
- SpaceOrLine - Default
// TODO: A group of carried Spacing nodes should be linked together. As if all them dont need to be wrapped, only one of the Spacing nodes needs to become a space. Not all of them.
7. Generate the final XML using the formatting tree.
a. Use width() calculations on the FMT nodes to determine whether to wrap then output the correct string literal.
## Definitions and Observations
- TEI XML prefers explicit spacing. It defines no standards for how implict spaces are treated. Thus these formatting rules are specific to the renderer used in the Eartha M. M. White project. I would recommend using explicit spacing wherever possible.
- A singular space is the same as multiple spaces. One spacing node may be expanded to multiple.
- New lines and tab lines are also treated as spaces.
- Block tags are tags that make their own spacing during rendering thus ignoring the immediate spacing around them.
- Inline tags are tags that depend on spacing near them. Having no space means the rendered text might be joined together. However, having even one space between multiple inline tags that aren't interrupted by text means that all of them can have spaces and not change the final layout.
- I have yet to encouter a tag that has asymmetrical spacing requirements. So for now we disregard them.
- Ignore everything but open tags, close tags, and text nodes for now. Comments, CDATA, Processing Instruction, and XML Declaration will be implemented at a later date.
## Credits
Yorick Peterse - [How to write a code formatter](https://yorickpeterse.com/articles/how-to-write-a-code-formatter/)
Gerard Huet - [The Zipper](https://gallium.inria.fr/~huet/PUBLIC/zip.pdf)
TEI Council - [TEI Specification](https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html)