An open API service indexing awesome lists of open source software.

https://github.com/smikhalevski/tag-soup

🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.
https://github.com/smikhalevski/tag-soup

dom html javascript parser sax xml

Last synced: 8 months ago
JSON representation

🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.

Awesome Lists containing this project

README

          


TagSoup

TagSoup is [the fastest](#performance) pure JS SAX/DOM XML/HTML parser and serializer.

- Extremely low memory consumption.
- Tolerant of malformed tag nesting, missing end tags, etc.
- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
- Supports both strict XML and forgiving HTML parsing modes.
- [20 kB gzipped](https://bundlephobia.com/result?p=tag-soup), including dependencies.
- Check out TagSoup dependencies: [Speedy Entities](https://github.com/smikhalevski/speedy-entities#readme)
and [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme).

```sh
npm install --save-prod tag-soup
```

- [API docs](https://smikhalevski.github.io/tag-soup/)
- [DOM parsing](#dom-parsing)
- [SAX parsing](#sax-parsing)
- [Tokenization](#tokenization)
- [Serialization](#serialization)
- [Performance](#performance)
- [Limitations](#limitations)

# DOM parsing

TagSoup exports preconfigured [`HTMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLDOMParser.html)
which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:

```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('

hello

cool');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '

hello

cool

'
```

`HTMLDOMParser` decodes both HTML entities and numeric character references with
[`decodeHTML`](https://smikhalevski.github.io/speedy-entities/variables/decodeHTML.html).

[`XMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/XMLDOMParser.html)
parses XML markup as a DOM node. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't
satisfy XML spec:

```ts
import { XMLDOMParser, toXML } from 'tag-soup';

XMLDOMParser.parseFragment('

hello');
// ❌ ParserError: Unexpected end tag.

const fragment = XMLDOMParser.parseFragment('

hello

');
// ⮕ DocumentFragment

toXML(fragment);
// ⮕ '

hello


```

`XMLDOMParser` decodes both XML entities and numeric character references with
[`decodeXML`](https://smikhalevski.github.io/speedy-entities/variables/decodeXML.html).

TagSoup uses [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme) nodes, which provide many standard
DOM manipulation features:

```ts
const document = HTMLDOMParser.parseDocument('hello');

document.doctype.name;
// ⮕ 'html'

document.textContent;
// ⮕ 'hello'
```

For example, you can use `TreeWalker` to traverse DOM nodes:

```ts
import { TreeWalker, NodeFilter } from 'flyweight-dom';

const fragment = XMLDOMParser.parseFragment('

hello

');

const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);

treeWalker.nextNode();
// ⮕ Text { 'hello' }
```

Create a custom DOM parser using
[`createDOMParser`](https://smikhalevski.github.io/tag-soup/functions/createDOMParser.html):

```ts
import { createDOMParser } from 'tag-soup';

const myParser = createDOMParser({
voidTags: ['br'],
});

myParser.parseFragment('


');
// ⮕ DocumentFragment
```

# SAX parsing

TagSoup exports preconfigured [`HTMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html)
which parses HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing
and forgives malformed markup:

```ts
import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('

hello

cool', {
onStartTagOpening(tagName) {
// Called with 'p', 'p', and 'br'
},
onText(text) {
// Called with 'hello' and 'cool'
},
});
```

[`XMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/XMLSAXParser.html) parses XML markup and calls
handler methods when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts
import { XMLSAXParser } from 'tag-soup';

XMLSAXParser.parseFragment('

hello', {});
// ❌ ParserError: Unexpected end tag.

XMLSAXParser.parseFragment('

hello

', {
onEndTag(tagName) {
// Called with 'br' and 'p'
},
});
```

Create a custom SAX parser using
[`createSAXParser`](https://smikhalevski.github.io/tag-soup/functions/createSAXParser.html):

```ts
import { createSAXParser } from 'tag-soup';

const myParser = createSAXParser({
voidTags: ['br'],
});

myParser.parseFragment('


', {
onStartTagOpening(tagName) {
// Called with 'p' and 'br'
},
});
```

# Tokenization

TagSoup exports preconfigured
[`HTMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html) which parses HTML markup and
invokes a callback when a token is read. This tokenizer never throws errors during tokenization and forgives malformed
markup:

```ts
import { HTMLTokenizer } from 'tag-soup';

HTMLTokenizer.tokenizeFragment('

hello

cool', (token, startIndex, endIndex) => {
// Handle token
});
```

[`XMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/XMLTokenizer.html) parses XML markup and invokes
a callback when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts
import { XMLTokenizer } from 'tag-soup';

XMLTokenizer.tokenizeFragment('

hello', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.

XMLTokenizer.tokenizeFragment('

hello

', (token, startIndex, endIndex) => {
// Handle token
});
```

Create a custom tokenizer using
[`createTokenizer`](https://smikhalevski.github.io/tag-soup/functions/createTokenizer.html):

```ts
import { createTokenizer } from 'tag-soup';

const myTokenizer = createTokenizer({
voidTags: ['br'],
});

myTokenizer.tokenizeFragment('


', (token, startIndex, endIndex) => {
// Handle token
});
```

# Serialization

TagSoup exports two preconfigured serializers:
[`toHTML`](https://smikhalevski.github.io/tag-soup/variables/toHTML.html) and
[`toXML`](https://smikhalevski.github.io/tag-soup/variables/toXML.html).

```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('

hello

cool');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '

hello

cool

'
```

Create a custom serializer using
[`createSerializer`](https://smikhalevski.github.io/tag-soup/functions/createSerializer.html):

```ts
import { HTMLDOMParser, createSerializer } from 'tag-soup';

const mySerializer = createSerializer({
voidTags: ['br'],
});

const fragment = HTMLDOMParser.parseFragment('

hello');
// ⮕ DocumentFragment

mySerializer(fragment);
// ⮕ '

hello

'
```

# Performance

Execution performance is measured in operations per second (± 5%), the higher number is better.
Memory consumption (RAM) is measured in bytes, the lower number is better.

Library
Library size
DOM parsing
SAX parsing

Ops/sec
RAM
Ops/sec
RAM

tag-soup​@3.0.0

20 kB

26 Hz
22 MB
58 Hz
22 kB

htmlparser2​@10.0.0

58 kB

19 Hz
23 MB
31 Hz
10 MB

parse5​@8.0.0

45 kB

7 Hz
105 MB
12 Hz
10 MB

Performance was measured when parsing [the 3.8 MB HTML file](./src/test/test.html).

Tests were conducted using [TooFast](https://github.com/smikhalevski/toofast#readme) on Apple M1 with Node.js v23.11.1.

To reproduce [the performance test suite](./src/test/perf/overall.perf.js) results, clone this repo and run:

```shell
npm ci
npm run build
npm run perf
```

# Limitations

TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.

Assume the following markup:

```html

okay

nope
```

With [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) this markup would be transformed to:

```html


okay


nope


```

TagSoup doesn't insert the second `strong` tag:

```html

okay


nope


```