https://github.com/smikhalevski/tag-soup

🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.
https://github.com/smikhalevski/tag-soup

dom html javascript parser sax xml

Last synced: 9 months ago
JSON representation

🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.

Host: GitHub
URL: https://github.com/smikhalevski/tag-soup
Owner: smikhalevski
License: mit
Created: 2020-05-31T13:12:24.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2023-03-05T03:22:38.000Z (over 3 years ago)
Last Synced: 2025-06-11T21:01:30.173Z (about 1 year ago)
Topics: dom, html, javascript, parser, sax, xml
Language: HTML
Homepage: https://smikhalevski.github.io/tag-soup
Size: 1.88 MB
Stars: 7
Watchers: 2
Forks: 2
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          


  



TagSoup is [the fastest](#performance) pure JS SAX/DOM XML/HTML parser and serializer.

- Extremely low memory consumption.

- Tolerant of malformed tag nesting, missing end tags, etc.

- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.

- Supports both strict XML and forgiving HTML parsing modes.

- [20 kB gzipped](https://bundlephobia.com/result?p=tag-soup), including dependencies.

- Check out TagSoup dependencies: [Speedy Entities](https://github.com/smikhalevski/speedy-entities#readme)

  and [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme).

```sh

npm install --save-prod tag-soup

```

- [API docs](https://smikhalevski.github.io/tag-soup/)

- [DOM parsing](#dom-parsing)

- [SAX parsing](#sax-parsing)

- [Tokenization](#tokenization)

- [Serialization](#serialization)

- [Performance](#performance)

- [Limitations](#limitations)

# DOM parsing

TagSoup exports preconfigured [`HTMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLDOMParser.html)

which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:

```ts

import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('
hello
cool');

// ⮕ DocumentFragment

toHTML(fragment);

// ⮕ '
hello
cool
'

```

`HTMLDOMParser` decodes both HTML entities and numeric character references with

[`decodeHTML`](https://smikhalevski.github.io/speedy-entities/variables/decodeHTML.html).

[`XMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/XMLDOMParser.html)

parses XML markup as a DOM node. It throws

[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't

satisfy XML spec:

```ts

import { XMLDOMParser, toXML } from 'tag-soup';

XMLDOMParser.parseFragment('
hello');

// ❌ ParserError: Unexpected end tag.

const fragment = XMLDOMParser.parseFragment('
hello
');

// ⮕ DocumentFragment

toXML(fragment);

// ⮕ '
hello


```

`XMLDOMParser` decodes both XML entities and numeric character references with

[`decodeXML`](https://smikhalevski.github.io/speedy-entities/variables/decodeXML.html).

TagSoup uses [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme) nodes, which provide many standard

DOM manipulation features:

```ts

const document = HTMLDOMParser.parseDocument('hello');

document.doctype.name;

// ⮕ 'html'

document.textContent;

// ⮕ 'hello'

```

For example, you can use `TreeWalker` to traverse DOM nodes:

```ts

import { TreeWalker, NodeFilter } from 'flyweight-dom';

const fragment = XMLDOMParser.parseFragment('
hello
');

const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);

treeWalker.nextNode();

// ⮕ Text { 'hello' }

```

Create a custom DOM parser using

[`createDOMParser`](https://smikhalevski.github.io/tag-soup/functions/createDOMParser.html):

```ts

import { createDOMParser } from 'tag-soup';

const myParser = createDOMParser({

  voidTags: ['br'],

});

myParser.parseFragment('

');

// ⮕ DocumentFragment

```

# SAX parsing

TagSoup exports preconfigured [`HTMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html)

which parses HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing

and forgives malformed markup:

```ts

import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('
hello
cool', {

  onStartTagOpening(tagName) {

    // Called with 'p', 'p', and 'br'

  },

  onText(text) {

    // Called with 'hello' and 'cool'

  },

});

```

[`XMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/XMLSAXParser.html) parses XML markup and calls

handler methods when a token is read. It throws

[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts

import { XMLSAXParser } from 'tag-soup';

XMLSAXParser.parseFragment('
hello', {});

// ❌ ParserError: Unexpected end tag.

XMLSAXParser.parseFragment('
hello
', {

  onEndTag(tagName) {

    // Called with 'br' and 'p'

  },

});

```

Create a custom SAX parser using

[`createSAXParser`](https://smikhalevski.github.io/tag-soup/functions/createSAXParser.html):

```ts

import { createSAXParser } from 'tag-soup';

const myParser = createSAXParser({

  voidTags: ['br'],

});

myParser.parseFragment('

', {

  onStartTagOpening(tagName) {

    // Called with 'p' and 'br'

  },

});

```

# Tokenization

TagSoup exports preconfigured

[`HTMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html) which parses HTML markup and

invokes a callback when a token is read. This tokenizer never throws errors during tokenization and forgives malformed

markup:

```ts

import { HTMLTokenizer } from 'tag-soup';

HTMLTokenizer.tokenizeFragment('
hello
cool', (token, startIndex, endIndex) => {

  // Handle token

});

```

[`XMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/XMLTokenizer.html) parses XML markup and invokes

a callback when a token is read. It throws

[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:

```ts

import { XMLTokenizer } from 'tag-soup';

XMLTokenizer.tokenizeFragment('
hello', (token, startIndex, endIndex) => {});

// ❌ ParserError: Unexpected end tag.

XMLTokenizer.tokenizeFragment('
hello
', (token, startIndex, endIndex) => {

  // Handle token

});

```

Create a custom tokenizer using

[`createTokenizer`](https://smikhalevski.github.io/tag-soup/functions/createTokenizer.html):

```ts

import { createTokenizer } from 'tag-soup';

const myTokenizer = createTokenizer({

  voidTags: ['br'],

});

myTokenizer.tokenizeFragment('

', (token, startIndex, endIndex) => {

  // Handle token

});

```

# Serialization

TagSoup exports two preconfigured serializers:

[`toHTML`](https://smikhalevski.github.io/tag-soup/variables/toHTML.html) and

[`toXML`](https://smikhalevski.github.io/tag-soup/variables/toXML.html).

```ts

import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('
hello
cool');

// ⮕ DocumentFragment

toHTML(fragment);

// ⮕ '
hello
cool
'

```

Create a custom serializer using

[`createSerializer`](https://smikhalevski.github.io/tag-soup/functions/createSerializer.html):

```ts

import { HTMLDOMParser, createSerializer } from 'tag-soup';

const mySerializer = createSerializer({

  voidTags: ['br'],

});

const fragment = HTMLDOMParser.parseFragment('
hello');

// ⮕ DocumentFragment

mySerializer(fragment);

// ⮕ '
hello
'

```

# Performance

Execution performance is measured in operations per second (± 5%), the higher number is better.

Memory consumption (RAM) is measured in bytes, the lower number is better.

Library

Library size

DOM parsing

SAX parsing

Ops/sec

RAM

Ops/sec

RAM

tag-soup@3.0.0

20 kB

26 Hz

22 MB

58 Hz

22 kB

htmlparser2@10.0.0

58 kB

19 Hz

23 MB

31 Hz

10 MB

parse5@8.0.0

45 kB

7 Hz

105 MB

12 Hz

10 MB

Performance was measured when parsing [the 3.8 MB HTML file](./src/test/test.html).

Tests were conducted using [TooFast](https://github.com/smikhalevski/toofast#readme) on Apple M1 with Node.js v23.11.1.

To reproduce [the performance test suite](./src/test/perf/overall.perf.js) results, clone this repo and run:

```shell

npm ci

npm run build

npm run perf

```

# Limitations

TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.

Assume the following markup:

```html

okay

nope

```


With [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) this markup would be transformed to:

```html



okay

nope

```

TagSoup doesn't insert the second `strong` tag:

```html

okay

nope

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/smikhalevski/tag-soup

Awesome Lists containing this project

README