https://github.com/michaelfranzl/sanitize-dom

Isomorphic library for recursive manipulation of live WHATWG DOMs.
https://github.com/michaelfranzl/sanitize-dom
dom html recursive-algorithm sanitization sanitize-html sanitizer whatwg-dom
Last synced: 7 months ago
JSON representation
Isomorphic library for recursive manipulation of live WHATWG DOMs.
Host: GitHub
URL: https://github.com/michaelfranzl/sanitize-dom
Owner: michaelfranzl
License: mit
Created: 2017-11-20T11:44:10.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-03-04T05:52:42.000Z (over 2 years ago)
Last Synced: 2024-08-09T03:53:49.160Z (10 months ago)
Topics: dom, html, recursive-algorithm, sanitization, sanitize-html, sanitizer, whatwg-dom
Language: JavaScript
Homepage:
Size: 398 KB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

        # sanitize-dom

![Test](https://github.com/michaelfranzl/sanitize-dom/workflows/Test/badge.svg?branch=master)

Recursive sanitizer/filter to manipulate live [WHATWG DOM](https://dom.spec.whatwg.org)s rather than HTML, for the browser and Node.js.

## Rationale

Direct DOM manipulation has gotten a bad reputation in the last decade of web development. From Ruby on Rails to React, the DOM was seen as something to gloriously destroy and re-render from the server or even from the browser. Never mind that the browser already exerted a lot of effort parsing HTML and constructing this tree! Mind-numbingly complex HTML string regular expression tests and manipulations had to deal with low-level details of the HTML syntax to insert, delete and change elements, sometimes on every keystroke! Contrasting to that, functions like `createElement`, `remove` and `insertBefore` from the DOM world were largely unknown and unused, except perhaps in jQuery.

Processing of HTML is **destructive**: The original DOM is destroyed and garbage collected with a certain time delay. Attached event handlers are detached and garbage collected. A completely new DOM is created from parsing new HTML set via `.innerHTML =`. Event listeners will have to be re-attached from the user-land (this is no issue when using `on*` HTML attributes, but this has disadvantages as well).

*It doesn't have to be this way. Do not eliminate, but manipulate!*

### Save the (DOM) trees!

`sanitize-dom` crawls a DOM subtree (beginning from a given node, all the way down to its ancestral leaves) and filters and manipulates it non-destructively. This is very efficient: The browser doesn't have to re-render everything; it only re-renders what has been *changed* (sound familiar from React?).

The benefits of direct DOM manipulation:

* Nodes stay alive.

* References to nodes (i.e. stored in a `Map` or `WeakMap`) stay alive.

* Already attached event handlers stay alive.

* The browser doesn't have to re-render entire sections of a page; thus no flickering, no scroll jumping, no big CPU spikes.

* CPU cycles for repeatedly parsing and dumping of HTML are eliminated.

`sanitize-dom`s further advantages:

* No dependencies.

* Small footprint (only about 7 kB minimized).

* Faster than other HTML sanitizers because there is no HTML parsing and serialization.

## Use cases

Aside from the browser, `sanitize-dom` can also be used in Node.js by supplying WHATWG DOM implementations like [jsdom](https://github.com/tmpvar/jsdom).

The [test file](test/run-tests.js) describes additional usage patterns and features.

For the usage examples below, I'll use `sanitizeHtml` just to be able to illustrate the HTML output.

By default, all tags are 'flattened', i.e. only their inner text is kept:

```javascript

sanitizeHtml(document, '
abc def');

"abc def"

```

Selective joining of same-tag siblings:

```javascript

// Joins the two I tags.

sanitizeHtml(document, 'Hello world! Goodbye world!', {

  allow_tags_deep: { '.*': '.*' },

  join_siblings: ['I'],

});

"Hello world! Goodbye world!"

```

Removal of redundant nested nodes (ubiquitous when using a WYSIWYG `contenteditable` editor):

```javascript

sanitizeHtml(document, 'Hello world! Goodbye world!', {

  allow_tags_deep: { '.*': '.*' },

  flatten_tags_deep: { i: 'i' },

});

"Hello  world! Goodbye world!"

```

Remove redundant empty tags:

```javascript

sanitizeHtml(document, 'Hello world!', {

  allow_tags_deep: { '.*': '.*' },

  remove_empty: true,

});

"Hello world!"

```

By default, all classes and attributes are removed:

```javascript

// Keep all nodes, but remove all of their attributes and classes:

sanitizeHtml(document, '
abc def', {

  allow_tags_deep: { '.*': '.*' },

});

"abc def"

```

Keep all nodes and all their attributes and classes:

```javascript

sanitizeHtml(document, '
abc def', {

  allow_tags_deep: { '.*': '.*' },

  allow_attributes_by_tag: { '.*': '.*' },

  allow_classes_by_tag: { '.*': '.*' },

});

'abc def'

```

White-listing of classes and attributes:

```javascript

// Keep only data- attributes and 'green' classes

sanitizeHtml(document, '
abc def', {

  allow_tags_deep: { '.*': '.*' },

  allow_attributes_by_tag: { '.*': 'data-.*' },

  allow_classes_by_tag: { '.*': 'green' },

});

'abc def'

```

White-listing of node tags to keep:

```javascript

// Keep only B tags anywhere in the document.

sanitizeHtml(document, 'abc def ghi', {

  allow_tags_deep: { '.*': '^b$' },

});

"abc def ghi"

// Keep only DIV children of BODY and I children of DIV.

sanitizeHtml(document, '
 abc def

 ghi', {

  allow_tags_direct: {

    body: 'div',

    div: '^i',

  },

});

" abc def ghi"

```

Selective flattening of nodes:

```javascript

// Flatten only EM children of DIV.

sanitizeHtml(document, '
 abc def

 ghi', {

  allow_tags_deep: { '.*': '.*' },

  flatten_tags_direct: {

    div: 'em',

  },

});

" abc def
 ghi"

// Flatten I tags anywhere in the document.

sanitizeHtml(document, '
 abc def

 ghi', {

  allow_tags_deep: { '.*': '.*' },

  flatten_tags_deep: {

    '.*': '^i',

  },

});

" abc def

 ghi"

```

Selective removal of tags:

```javascript

// Remove I children of DIVs.

sanitizeHtml(document, '
 abc def

 ghi', {

  allow_tags_deep: { '.*': '.*' },

  remove_tags_direct: {

    'div': 'i',

  },

});

"  def

 ghi"

```

Then, sometimes there are more than one way to accomplish the same, as shown in this advanced

example:

```javascript

// Keep all tags except B, anywhere in the document. Two different solutions:

sanitizeHtml(document, '
 abc def ghi ', {

  allow_tags_deep: { '.*': '.*' },

  flatten_tags_deep: { '.*': 'B' },

});

" abc def ghi 
"

sanitizeHtml(document, '
 abc def ghi ', {

  allow_tags_deep: { '.*': '^((?!b).)*$' }

});

" abc def ghi "

```

And finally, filter functions allow ultimate flexibility:

```javascript

// change B node to EM node with contextual inner text; attach an event listener.

sanitizeHtml(document, '
abc def ghi', {

  allow_tags_direct: {

    '.*': '.*',

  },

  filters_by_tag: {

    B: [

      function changesToEm(node, { parentNodes, parentNodenames, siblingIndex }) {

        const em = document.createElement('em');

        const text = `${parentNodenames.join(', ')} - ${siblingIndex}`;

        em.innerHTML = text;

        em.addEventListener('click', () => alert(text));

        return em;

      },

    ],

  },

});

// In a browser, the EM tags would be clickable and an alert box would pop up.

"abc I, P, BODY - 0 I, P, BODY - 2"

```

## Tests

Run in Node.js:

```sh

npm test

```

For the browser, run:

```sh

cd sanitize-dom

npm i -g [email protected] http-server

jspm install @jspm/[email protected]

http-server

```

Then, in a browser which supports `` (e.g. Google Chrome

version >= 81), browse to http://127.0.0.1:8080/test

# API Reference

## Functions



sanitizeNode(doc, node, [opts], [nodePropertyMap])



Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.



sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])



Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not

the node itself.





sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒ String





Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.



sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])



This function is not exported: Please use the wrapper functions instead:

sanitizeHtml, sanitizeNode, and sanitizeChildNodes.

Recursively processes a tree with node at the root.

In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes.

For example, if the B node in <i>abc<b>def<u>ghi</u></b></i> is flattened, the result is

<i>abcdef<u>ghi</u></i>.

Each node is processed in the following sequence:



Filters matching the opts.filters_by_tag spec are called. If the filter returns null, the

node is removed and processing stops (see filters).

If the opts.remove_tags_* spec matches, the node is removed and processing stops.

If the opts.flatten_tags_* spec matches, the node is flattened and processing stops.

If the opts.allow_tags_* spec matches:

All attributes not matching opts.allow_attributes_by_tag are removed.

All class names not matching opts.allow_classes_by_tag are removed.

The node is kept and processing stops.





The node is flattened.







## Typedefs





DomDocument : Object





Implements the WHATWG DOM Document interface.

In the browser, this is window.document. In Node.js, this may for example be

new JSDOM().window.document.





DomNode : Object





Implements the WHATWG DOM Node interface.

Custom properties for each node can be stored in a WeakMap passed as option nodePropertyMap

to one of the sanitize functions.





Tagname : string





Node tag name.

Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name #text,

these are referred to by the simpler string 'TEXT' for convenience.





Regex : string





A string which is compiled to a case-insensitive regular expression new RegExp(regex, 'i').

The regular expression is used to match a Tagname.





ParentChildSpec : Object.<Regex, Array.<Regex>>





Property names are matched against a (direct or ancestral) parent node's Tagname.

Associated values are matched against the current nodes Tagname.





TagAttributeNameSpec : Object.<Regex, Array.<Regex>>





Property names are matched against the current nodes Tagname. Associated values are

used to match its attribute names.





TagClassNameSpec : Object.<Regex, Array.<Regex>>





Property names are matched against the current nodes Tagname. Associated values are used

to match its class names.





FilterSpec : Object.<Regex, Array.<filter>>





Property names are matched against node Tagnames. Associated values

are the filters which are run on the node.





filter ⇒ DomNode | Array.<DomNode> | null





Filter functions can either...



return the same node (the first argument),

return a single, or an Array of, newly created DomNode(s), in which case node is

replaced with the new node(s),

return null, in which case node is removed.



Note that newly generated DomNode(s) are processed by running sanitizeDom

on them, as if they had been part of the original tree. This has the following implication:

If a filter returns a newly generated DomNode with the same Tagname as node, it

would cause the same filter to be called again, which may lead to an infinite loop if the filter

is always returning the same result (this would be a badly behaved filter). To protect against

infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean

property called 'skip_filters' for the DomNode) (in a WeakMap which the caller must

provide to one of the sanitize functions as the argument nodePropertyMap). If 'skip_filters' is

not set, an error is thrown. With well-behaved filters it is possible to continue subsequent

processing of the returned node without causing an infinite loop.







## sanitizeNode(doc, node, [opts], [nodePropertyMap])

Simple wrapper for [sanitizeDom](#sanitizeDom). Processes the node and its childNodes recursively.

**Kind**: global function  

| Param | Type | Default | Description |

| --- | --- | --- | --- |

| doc | [DomDocument](#DomDocument) |  |  |

| node | [DomNode](#DomNode) |  |  |

| [opts] | Object | {} |  |

| [nodePropertyMap] | WeakMap.<DomNode, Object> | new WeakMap() | Additional node properties |



## sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])

Simple wrapper for [sanitizeDom](#sanitizeDom). Processes only the node's childNodes recursively, but not

the node itself.

**Kind**: global function  

| Param | Type | Default | Description |

| --- | --- | --- | --- |

| doc | [DomDocument](#DomDocument) |  |  |

| node | [DomNode](#DomNode) |  |  |

| [opts] | Object | {} |  |

| [nodePropertyMap] | WeakMap.<DomNode, Object> | new WeakMap() | Additional node properties |



## sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒ String

Simple wrapper for [sanitizeDom](#sanitizeDom). Instead of a DomNode, it takes an HTML string.

**Kind**: global function  

**Returns**: String - The processed HTML  

| Param | Type | Default | Description |

| --- | --- | --- | --- |

| doc | [DomDocument](#DomDocument) |  |  |

| html | string |  |  |

| [opts] | Object | {} |  |

| [isDocument] | Boolean | false | Set this to `true` if you are passing an entire HTML document (beginning with the  tag). The context node name will be HTML. If `false`, then the context node name will be BODY. |

| [nodePropertyMap] | WeakMap.<DomNode, Object> | new WeakMap() | Additional node properties |



## sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])

This function is not exported: Please use the wrapper functions instead:

[sanitizeHtml](#sanitizeHtml), [sanitizeNode](#sanitizeNode), and [sanitizeChildNodes](#sanitizeChildNodes).

Recursively processes a tree with `node` at the root.

In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes.

For example, if the B node in `abcdefghi` is flattened, the result is

`abcdefghi`.

Each node is processed in the following sequence:

1. Filters matching the `opts.filters_by_tag` spec are called. If the filter returns `null`, the

   node is removed and processing stops (see [filter](#filter)s).

2. If the `opts.remove_tags_*` spec matches, the node is removed and processing stops.

3. If the `opts.flatten_tags_*` spec matches, the node is flattened and processing stops.

4. If the `opts.allow_tags_*` spec matches:

    * All attributes not matching `opts.allow_attributes_by_tag` are removed.

    * All class names not matching `opts.allow_classes_by_tag` are removed.

    * The node is kept and processing stops.

5. The node is flattened.

**Kind**: global function  

| Param | Type | Default | Description |

| --- | --- | --- | --- |

| doc | [DomDocument](#DomDocument) |  | The document |

| contextNode | [DomNode](#DomNode) |  | The root node |

| [opts] | Object | {} | Options for processing. |

| [opts.filters_by_tag] | [FilterSpec](#FilterSpec) | {} | Matching filters are called with the node. |

| [opts.remove_tags_direct] | [ParentChildSpec](#ParentChildSpec) | {} | Matching nodes which are a direct child of the matching parent node are removed. |

| [opts.remove_tags_deep] | [ParentChildSpec](#ParentChildSpec) | {'.*': ['style','script','textarea','noscript']} | Matching nodes which are anywhere below the matching parent node are removed. |

| [opts.flatten_tags_direct] | [ParentChildSpec](#ParentChildSpec) | {} | Matching nodes which are a direct child of the matching parent node are flattened. |

| [opts.flatten_tags_deep] | [ParentChildSpec](#ParentChildSpec) | {} | Matching nodes which are anywhere below the matching parent node are flattened. |

| [opts.allow_tags_direct] | [ParentChildSpec](#ParentChildSpec) | {} | Matching nodes which are a direct child of the matching parent node are kept. |

| [opts.allow_tags_deep] | [ParentChildSpec](#ParentChildSpec) | {} | Matching nodes which are anywhere below the matching parent node are kept. |

| [opts.allow_attributes_by_tag] | [TagAttributeNameSpec](#TagAttributeNameSpec) | {} | Matching attribute names of a matching node are kept. Other attributes are removed. |

| [opts.allow_classes_by_tag] | [TagClassNameSpec](#TagClassNameSpec) | {} | Matching class names of a matching node are kept. Other class names are removed. If no class names are remaining, the class attribute is removed. |

| [opts.remove_empty] | boolean | false | Remove nodes which are completely empty |

| [opts.join_siblings] | [Array.<Tagname>](#Tagname) | [] | Join same-tag sibling nodes of given tag names, unless they are separated by non-whitespace textNodes. |

| [childrenOnly] | Bool | false | If false, then the node itself and its descendants are processed recursively. If true, then only the children and its descendants are processed recursively, but not the node itself (use when `node` is `BODY` or `DocumentFragment`). |

| [nodePropertyMap] | WeakMap.<DomNode, Object> | new WeakMap() | Additional properties for a [DomNode](#DomNode) can be stored in an object and will be looked up in this map. The properties of the object and their meaning: `skip`: If truthy, disables all processing for this node. `skip_filters`: If truthy, disables all filters for this node. `skip_classes`: If truthy, disables processing classes of this node.  `skip_attributes`: If truthy, disables processing attributes of this node. See tests for usage details. |



## DomDocument : Object

Implements the WHATWG DOM Document interface.

In the browser, this is `window.document`. In Node.js, this may for example be

[new JSDOM().window.document](https://github.com/tmpvar/jsdom).

**Kind**: global typedef  

**See**: [https://dom.spec.whatwg.org/#interface-document](https://dom.spec.whatwg.org/#interface-document)  



## DomNode : Object

Implements the WHATWG DOM Node interface.

Custom properties for each node can be stored in a `WeakMap` passed as option `nodePropertyMap`

to one of the sanitize functions.

**Kind**: global typedef  

**See**: [https://dom.spec.whatwg.org/#interface-node](https://dom.spec.whatwg.org/#interface-node)  



## Tagname : string

Node tag name.

Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name `#text`,

these are referred to by the simpler string 'TEXT' for convenience.

**Kind**: global typedef  

**Example**  

```js

'DIV'

'H1'

'TEXT'

```



## Regex : string

A string which is compiled to a case-insensitive regular expression `new RegExp(regex, 'i')`.

The regular expression is used to match a [Tagname](#Tagname).

**Kind**: global typedef  

**Example**  

```js

'.*'           // matches any tag

'DIV'          // matches DIV

'(DIV|H[1-3])' // matches DIV, H1, H2 and H3

'P'            // matches P and SPAN

'^P$'          // matches P but not SPAN

'TEXT'         // matches text nodes (nodeType 3)

```



## ParentChildSpec : Object.<Regex, Array.<Regex>>

Property names are matched against a (direct or ancestral) parent node's [Tagname](#Tagname).

Associated values are matched against the current nodes [Tagname](#Tagname).

**Kind**: global typedef  

**Example**  

```js

{

  '(DIV|SPAN)': ['H[1-3]', 'B'], // matches H1, H2, H3 and B within DIV or SPAN

  'STRONG': ['.*'] // matches all tags within STRONG

}

```



## TagAttributeNameSpec : Object.<Regex, Array.<Regex>>

Property names are matched against the current nodes [Tagname](#Tagname). Associated values are

used to match its attribute names.

**Kind**: global typedef  

**Example**  

```js

{

  'H[1-3]': ['id', 'class'], // matches 'id' and 'class' attributes of all H1, H2 and H3 nodes

  'STRONG': ['data-.*'] // matches all 'data-.*' attributes of STRONG nodes.

}

```



## TagClassNameSpec : Object.<Regex, Array.<Regex>>

Property names are matched against the current nodes [Tagname](#Tagname). Associated values are used

to match its class names.

**Kind**: global typedef  

**Example**  

```js

{

  'DIV|SPAN': ['blue', 'red'] // matches 'blue' and 'red' class names of all DIV and SPAN nodes

}

```



## FilterSpec : Object.<Regex, Array.<filter>>

Property names are matched against node [Tagname](#Tagname)s. Associated values

are the [filter](#filter)s which are run on the node.

**Kind**: global typedef  



## filter ⇒ [DomNode](#DomNode) \| [Array.<DomNode>](#DomNode) \| null

Filter functions can either...

1. return the same node (the first argument),

2. return a single, or an Array of, newly created [DomNode](#DomNode)(s), in which case `node` is

replaced with the new node(s),

3. return `null`, in which case `node` is removed.

Note that newly generated [DomNode](#DomNode)(s) are processed by running [sanitizeDom](#sanitizeDom)

on them, as if they had been part of the original tree. This has the following implication:

If a filter returns a newly generated [DomNode](#DomNode) with the same [Tagname](#Tagname) as `node`, it

would cause the same filter to be called again, which may lead to an infinite loop if the filter

is always returning the same result (this would be a badly behaved filter). To protect against

infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean

property called 'skip_filters' for the [DomNode](#DomNode)) (in a `WeakMap` which the caller must

provide to one of the sanitize functions as the argument `nodePropertyMap`). If 'skip_filters' is

not set, an error is thrown. With well-behaved filters it is possible to continue subsequent

processing of the returned node without causing an infinite loop.

**Kind**: global typedef  

| Param | Type | Description |

| --- | --- | --- |

| node | [DomNode](#DomNode) | Currently processed node |

| opts | Object |  |

| opts.parents | [Array.<DomNode>](#DomNode) | The parent nodes of `node`. |

| opts.parentNodenames | [Array.<Tagname>](#Tagname) | The tag names of the parent nodes |

| opts.siblingIndex | Integer | The number of the current node amongst its siblings |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/michaelfranzl/sanitize-dom

Awesome Lists containing this project

README