https://github.com/smikhalevski/tag-soup
🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.
https://github.com/smikhalevski/tag-soup
dom html javascript parser sax xml
Last synced: 8 months ago
JSON representation
🍜 The fastest pure JS SAX/DOM XML/HTML parser with streaming support.
- Host: GitHub
- URL: https://github.com/smikhalevski/tag-soup
- Owner: smikhalevski
- License: mit
- Created: 2020-05-31T13:12:24.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-03-05T03:22:38.000Z (over 3 years ago)
- Last Synced: 2025-06-11T21:01:30.173Z (about 1 year ago)
- Topics: dom, html, javascript, parser, sax, xml
- Language: HTML
- Homepage: https://smikhalevski.github.io/tag-soup
- Size: 1.88 MB
- Stars: 7
- Watchers: 2
- Forks: 2
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
TagSoup is [the fastest](#performance) pure JS SAX/DOM XML/HTML parser and serializer.
- Extremely low memory consumption.
- Tolerant of malformed tag nesting, missing end tags, etc.
- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
- Supports both strict XML and forgiving HTML parsing modes.
- [20 kB gzipped](https://bundlephobia.com/result?p=tag-soup), including dependencies.
- Check out TagSoup dependencies: [Speedy Entities](https://github.com/smikhalevski/speedy-entities#readme)
and [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme).
```sh
npm install --save-prod tag-soup
```
- [API docs](https://smikhalevski.github.io/tag-soup/)
- [DOM parsing](#dom-parsing)
- [SAX parsing](#sax-parsing)
- [Tokenization](#tokenization)
- [Serialization](#serialization)
- [Performance](#performance)
- [Limitations](#limitations)
# DOM parsing
TagSoup exports preconfigured [`HTMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLDOMParser.html)
which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:
```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';
const fragment = HTMLDOMParser.parseFragment('
hello
cool');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '
hello
cool
'
```
`HTMLDOMParser` decodes both HTML entities and numeric character references with
[`decodeHTML`](https://smikhalevski.github.io/speedy-entities/variables/decodeHTML.html).
[`XMLDOMParser`](https://smikhalevski.github.io/tag-soup/variables/XMLDOMParser.html)
parses XML markup as a DOM node. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't
satisfy XML spec:
```ts
import { XMLDOMParser, toXML } from 'tag-soup';
XMLDOMParser.parseFragment('
hello');
// ❌ ParserError: Unexpected end tag.
const fragment = XMLDOMParser.parseFragment('
hello
');
// ⮕ DocumentFragment
toXML(fragment);
// ⮕ '
hello
```
`XMLDOMParser` decodes both XML entities and numeric character references with
[`decodeXML`](https://smikhalevski.github.io/speedy-entities/variables/decodeXML.html).
TagSoup uses [Flyweight DOM](https://github.com/smikhalevski/flyweight-dom#readme) nodes, which provide many standard
DOM manipulation features:
```ts
const document = HTMLDOMParser.parseDocument('hello');
document.doctype.name;
// ⮕ 'html'
document.textContent;
// ⮕ 'hello'
```
For example, you can use `TreeWalker` to traverse DOM nodes:
```ts
import { TreeWalker, NodeFilter } from 'flyweight-dom';
const fragment = XMLDOMParser.parseFragment('
hello
');
const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);
treeWalker.nextNode();
// ⮕ Text { 'hello' }
```
Create a custom DOM parser using
[`createDOMParser`](https://smikhalevski.github.io/tag-soup/functions/createDOMParser.html):
```ts
import { createDOMParser } from 'tag-soup';
const myParser = createDOMParser({
voidTags: ['br'],
});
myParser.parseFragment('
');
// ⮕ DocumentFragment
```
# SAX parsing
TagSoup exports preconfigured [`HTMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html)
which parses HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing
and forgives malformed markup:
```ts
import { HTMLSAXParser } from 'tag-soup';
HTMLSAXParser.parseFragment('
hello
cool', {
onStartTagOpening(tagName) {
// Called with 'p', 'p', and 'br'
},
onText(text) {
// Called with 'hello' and 'cool'
},
});
```
[`XMLSAXParser`](https://smikhalevski.github.io/tag-soup/variables/XMLSAXParser.html) parses XML markup and calls
handler methods when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:
```ts
import { XMLSAXParser } from 'tag-soup';
XMLSAXParser.parseFragment('
hello', {});
// ❌ ParserError: Unexpected end tag.
XMLSAXParser.parseFragment('
hello
', {
onEndTag(tagName) {
// Called with 'br' and 'p'
},
});
```
Create a custom SAX parser using
[`createSAXParser`](https://smikhalevski.github.io/tag-soup/functions/createSAXParser.html):
```ts
import { createSAXParser } from 'tag-soup';
const myParser = createSAXParser({
voidTags: ['br'],
});
myParser.parseFragment('
', {
onStartTagOpening(tagName) {
// Called with 'p' and 'br'
},
});
```
# Tokenization
TagSoup exports preconfigured
[`HTMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/HTMLSAXParser.html) which parses HTML markup and
invokes a callback when a token is read. This tokenizer never throws errors during tokenization and forgives malformed
markup:
```ts
import { HTMLTokenizer } from 'tag-soup';
HTMLTokenizer.tokenizeFragment('
hello
cool', (token, startIndex, endIndex) => {
// Handle token
});
```
[`XMLTokenizer`](https://smikhalevski.github.io/tag-soup/variables/XMLTokenizer.html) parses XML markup and invokes
a callback when a token is read. It throws
[`ParserError`](https://smikhalevski.github.io/tag-soup/classes/ParserError.html) if markup doesn't satisfy XML spec:
```ts
import { XMLTokenizer } from 'tag-soup';
XMLTokenizer.tokenizeFragment('
hello', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.
XMLTokenizer.tokenizeFragment('
hello
', (token, startIndex, endIndex) => {
// Handle token
});
```
Create a custom tokenizer using
[`createTokenizer`](https://smikhalevski.github.io/tag-soup/functions/createTokenizer.html):
```ts
import { createTokenizer } from 'tag-soup';
const myTokenizer = createTokenizer({
voidTags: ['br'],
});
myTokenizer.tokenizeFragment('
', (token, startIndex, endIndex) => {
// Handle token
});
```
# Serialization
TagSoup exports two preconfigured serializers:
[`toHTML`](https://smikhalevski.github.io/tag-soup/variables/toHTML.html) and
[`toXML`](https://smikhalevski.github.io/tag-soup/variables/toXML.html).
```ts
import { HTMLDOMParser, toHTML } from 'tag-soup';
const fragment = HTMLDOMParser.parseFragment('
hello
cool');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '
hello
cool
'
```
Create a custom serializer using
[`createSerializer`](https://smikhalevski.github.io/tag-soup/functions/createSerializer.html):
```ts
import { HTMLDOMParser, createSerializer } from 'tag-soup';
const mySerializer = createSerializer({
voidTags: ['br'],
});
const fragment = HTMLDOMParser.parseFragment('
hello');
// ⮕ DocumentFragment
mySerializer(fragment);
// ⮕ '
hello
'
```
# Performance
Execution performance is measured in operations per second (± 5%), the higher number is better.
Memory consumption (RAM) is measured in bytes, the lower number is better.
Library
Library size
DOM parsing
SAX parsing
Ops/sec
RAM
Ops/sec
RAM
tag-soup@3.0.0
26 Hz
22 MB
58 Hz
22 kB
htmlparser2@10.0.0
19 Hz
23 MB
31 Hz
10 MB
parse5@8.0.0
7 Hz
105 MB
12 Hz
10 MB
Performance was measured when parsing [the 3.8 MB HTML file](./src/test/test.html).
Tests were conducted using [TooFast](https://github.com/smikhalevski/toofast#readme) on Apple M1 with Node.js v23.11.1.
To reproduce [the performance test suite](./src/test/perf/overall.perf.js) results, clone this repo and run:
```shell
npm ci
npm run build
npm run perf
```
# Limitations
TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.
Assume the following markup:
```html
okay
nope
```
With [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) this markup would be transformed to:
```html
okay
nope
```
TagSoup doesn't insert the second `strong` tag:
```html
okay
nope
```
