Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/michaelfranzl/sanitize-dom
Isomorphic library for recursive manipulation of live WHATWG DOMs.
https://github.com/michaelfranzl/sanitize-dom
dom html recursive-algorithm sanitization sanitize-html sanitizer whatwg-dom
Last synced: 3 months ago
JSON representation
Isomorphic library for recursive manipulation of live WHATWG DOMs.
- Host: GitHub
- URL: https://github.com/michaelfranzl/sanitize-dom
- Owner: michaelfranzl
- License: mit
- Created: 2017-11-20T11:44:10.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-03-04T05:52:42.000Z (almost 2 years ago)
- Last Synced: 2024-08-09T03:53:49.160Z (6 months ago)
- Topics: dom, html, recursive-algorithm, sanitization, sanitize-html, sanitizer, whatwg-dom
- Language: JavaScript
- Homepage:
- Size: 398 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# sanitize-dom
![Test](https://github.com/michaelfranzl/sanitize-dom/workflows/Test/badge.svg?branch=master)
Recursive sanitizer/filter to manipulate live [WHATWG DOM](https://dom.spec.whatwg.org)s rather than HTML, for the browser and Node.js.
## Rationale
Direct DOM manipulation has gotten a bad reputation in the last decade of web development. From Ruby on Rails to React, the DOM was seen as something to gloriously destroy and re-render from the server or even from the browser. Never mind that the browser already exerted a lot of effort parsing HTML and constructing this tree! Mind-numbingly complex HTML string regular expression tests and manipulations had to deal with low-level details of the HTML syntax to insert, delete and change elements, sometimes on every keystroke! Contrasting to that, functions like `createElement`, `remove` and `insertBefore` from the DOM world were largely unknown and unused, except perhaps in jQuery.
Processing of HTML is **destructive**: The original DOM is destroyed and garbage collected with a certain time delay. Attached event handlers are detached and garbage collected. A completely new DOM is created from parsing new HTML set via `.innerHTML =`. Event listeners will have to be re-attached from the user-land (this is no issue when using `on*` HTML attributes, but this has disadvantages as well).
*It doesn't have to be this way. Do not eliminate, but manipulate!*
### Save the (DOM) trees!
`sanitize-dom` crawls a DOM subtree (beginning from a given node, all the way down to its ancestral leaves) and filters and manipulates it non-destructively. This is very efficient: The browser doesn't have to re-render everything; it only re-renders what has been *changed* (sound familiar from React?).
The benefits of direct DOM manipulation:
* Nodes stay alive.
* References to nodes (i.e. stored in a `Map` or `WeakMap`) stay alive.
* Already attached event handlers stay alive.
* The browser doesn't have to re-render entire sections of a page; thus no flickering, no scroll jumping, no big CPU spikes.
* CPU cycles for repeatedly parsing and dumping of HTML are eliminated.`sanitize-dom`s further advantages:
* No dependencies.
* Small footprint (only about 7 kB minimized).
* Faster than other HTML sanitizers because there is no HTML parsing and serialization.## Use cases
Aside from the browser, `sanitize-dom` can also be used in Node.js by supplying WHATWG DOM implementations like [jsdom](https://github.com/tmpvar/jsdom).
The [test file](test/run-tests.js) describes additional usage patterns and features.
For the usage examples below, I'll use `sanitizeHtml` just to be able to illustrate the HTML output.
By default, all tags are 'flattened', i.e. only their inner text is kept:
```javascript
sanitizeHtml(document, '');abc def
"abc def"
```Selective joining of same-tag siblings:
```javascript
// Joins the two I tags.
sanitizeHtml(document, 'Hello world! Goodbye world!', {
allow_tags_deep: { '.*': '.*' },
join_siblings: ['I'],
});
"Hello world! Goodbye world!"
```Removal of redundant nested nodes (ubiquitous when using a WYSIWYG `contenteditable` editor):
```javascript
sanitizeHtml(document, 'Hello world! Goodbye world!', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: { i: 'i' },
});
"Hello world! Goodbye world!"
```Remove redundant empty tags:
```javascript
sanitizeHtml(document, 'Hello world!', {
allow_tags_deep: { '.*': '.*' },
remove_empty: true,
});
"Hello world!"
```By default, all classes and attributes are removed:
```javascript
// Keep all nodes, but remove all of their attributes and classes:
sanitizeHtml(document, '', {abc def
allow_tags_deep: { '.*': '.*' },
});
""abc def
```Keep all nodes and all their attributes and classes:
```javascript
sanitizeHtml(document, '', {abc def
allow_tags_deep: { '.*': '.*' },
allow_attributes_by_tag: { '.*': '.*' },
allow_classes_by_tag: { '.*': '.*' },
});
''abc def
```White-listing of classes and attributes:
```javascript
// Keep only data- attributes and 'green' classes
sanitizeHtml(document, '', {abc def
allow_tags_deep: { '.*': '.*' },
allow_attributes_by_tag: { '.*': 'data-.*' },
allow_classes_by_tag: { '.*': 'green' },
});
''abc def
```White-listing of node tags to keep:
```javascript
// Keep only B tags anywhere in the document.
sanitizeHtml(document, 'abc def ghi', {
allow_tags_deep: { '.*': '^b$' },
});
"abc def ghi"// Keep only DIV children of BODY and I children of DIV.
sanitizeHtml(document, 'abc defghi', {
allow_tags_direct: {
body: 'div',
div: '^i',
},
});
"abc defghi"
```Selective flattening of nodes:
```javascript
// Flatten only EM children of DIV.
sanitizeHtml(document, 'abc defghi', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_direct: {
div: 'em',
},
});
"abc defghi"// Flatten I tags anywhere in the document.
sanitizeHtml(document, 'abc defghi', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: {
'.*': '^i',
},
});
"abc defghi"
```Selective removal of tags:
```javascript
// Remove I children of DIVs.
sanitizeHtml(document, 'abc defghi', {
allow_tags_deep: { '.*': '.*' },
remove_tags_direct: {
'div': 'i',
},
});
"defghi"
```Then, sometimes there are more than one way to accomplish the same, as shown in this advanced
example:```javascript
// Keep all tags except B, anywhere in the document. Two different solutions:sanitizeHtml(document, '
abc def ghi', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: { '.*': 'B' },
});
"abc def ghi"sanitizeHtml(document, '
abc def ghi', {
allow_tags_deep: { '.*': '^((?!b).)*$' }
});
"abc def ghi"
```And finally, filter functions allow ultimate flexibility:
```javascript
// change B node to EM node with contextual inner text; attach an event listener.
sanitizeHtml(document, 'abc def ghi
', {
allow_tags_direct: {
'.*': '.*',
},
filters_by_tag: {
B: [
function changesToEm(node, { parentNodes, parentNodenames, siblingIndex }) {
const em = document.createElement('em');
const text = `${parentNodenames.join(', ')} - ${siblingIndex}`;
em.innerHTML = text;
em.addEventListener('click', () => alert(text));
return em;
},
],
},
});
// In a browser, the EM tags would be clickable and an alert box would pop up.
"abc I, P, BODY - 0 I, P, BODY - 2
"
```## Tests
Run in Node.js:
```sh
npm test
```For the browser, run:
```sh
cd sanitize-dom
npm i -g [email protected] http-server
jspm install @jspm/[email protected]
http-server
```Then, in a browser which supports `` (e.g. Google Chrome
version >= 81), browse to http://127.0.0.1:8080/test# API Reference
## Functions
- sanitizeNode(doc, node, [opts], [nodePropertyMap])
-
Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.
- sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])
-
Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not
the node itself. -
sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒String
-
Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.
- sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])
-
This function is not exported: Please use the wrapper functions instead:
sanitizeHtml, sanitizeNode, and sanitizeChildNodes.
Recursively processes a tree with
node
at the root.In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes.
For example, if the B node in<i>abc<b>def<u>ghi</u></b></i>
is flattened, the result is<i>abcdef<u>ghi</u></i>
.Each node is processed in the following sequence:
- Filters matching the
opts.filters_by_tag
spec are called. If the filter returnsnull
, the
node is removed and processing stops (see filters). - If the
opts.remove_tags_*
spec matches, the node is removed and processing stops. - If the
opts.flatten_tags_*
spec matches, the node is flattened and processing stops. - If the
opts.allow_tags_*
spec matches:- All attributes not matching
opts.allow_attributes_by_tag
are removed. - All class names not matching
opts.allow_classes_by_tag
are removed. - The node is kept and processing stops.
- All attributes not matching
- The node is flattened.
- Filters matching the
## Typedefs
-
DomDocument :Object
-
Implements the WHATWG DOM Document interface.
In the browser, this is
window.document
. In Node.js, this may for example be
new JSDOM().window.document. -
DomNode :Object
-
Implements the WHATWG DOM Node interface.
Custom properties for each node can be stored in a
WeakMap
passed as optionnodePropertyMap
to one of the sanitize functions. -
Tagname :string
-
Node tag name.
Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name
#text
,
these are referred to by the simpler string 'TEXT' for convenience. -
Regex :string
-
A string which is compiled to a case-insensitive regular expression
new RegExp(regex, 'i')
.
The regular expression is used to match a Tagname. -
ParentChildSpec :Object.<Regex, Array.<Regex>>
-
Property names are matched against a (direct or ancestral) parent node's Tagname.
Associated values are matched against the current nodes Tagname. -
TagAttributeNameSpec :Object.<Regex, Array.<Regex>>
-
Property names are matched against the current nodes Tagname. Associated values are
used to match its attribute names. -
TagClassNameSpec :Object.<Regex, Array.<Regex>>
-
Property names are matched against the current nodes Tagname. Associated values are used
to match its class names. -
FilterSpec :Object.<Regex, Array.<filter>>
-
Property names are matched against node Tagnames. Associated values
are the filters which are run on the node. -
filter ⇒DomNode
|Array.<DomNode>
|null
-
Filter functions can either...
- return the same node (the first argument),
- return a single, or an Array of, newly created DomNode(s), in which case
node
is
replaced with the new node(s), - return
null
, in which casenode
is removed.
Note that newly generated DomNode(s) are processed by running sanitizeDom
on them, as if they had been part of the original tree. This has the following implication:If a filter returns a newly generated DomNode with the same Tagname as
node
, it
would cause the same filter to be called again, which may lead to an infinite loop if the filter
is always returning the same result (this would be a badly behaved filter). To protect against
infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean
property called 'skip_filters' for the DomNode) (in aWeakMap
which the caller must
provide to one of the sanitize functions as the argumentnodePropertyMap
). If 'skip_filters' is
not set, an error is thrown. With well-behaved filters it is possible to continue subsequent
processing of the returned node without causing an infinite loop.
## sanitizeNode(doc, node, [opts], [nodePropertyMap])
Simple wrapper for [sanitizeDom](#sanitizeDom). Processes the node and its childNodes recursively.
**Kind**: global function
| Param | Type | Default | Description |
| --- | --- | --- | --- |
| doc | [DomDocument
](#DomDocument) | | |
| node | [DomNode
](#DomNode) | | |
| [opts] | Object
| {}
| |
| [nodePropertyMap] | WeakMap.<DomNode, Object>
| new WeakMap()
| Additional node properties |
## sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])
Simple wrapper for [sanitizeDom](#sanitizeDom). Processes only the node's childNodes recursively, but not
the node itself.
**Kind**: global function
| Param | Type | Default | Description |
| --- | --- | --- | --- |
| doc | [DomDocument
](#DomDocument) | | |
| node | [DomNode
](#DomNode) | | |
| [opts] | Object
| {}
| |
| [nodePropertyMap] | WeakMap.<DomNode, Object>
| new WeakMap()
| Additional node properties |
## sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒ String
Simple wrapper for [sanitizeDom](#sanitizeDom). Instead of a DomNode, it takes an HTML string.
**Kind**: global function
**Returns**: String
- The processed HTML
| Param | Type | Default | Description |
| --- | --- | --- | --- |
| doc | [DomDocument
](#DomDocument) | | |
| html | string
| | |
| [opts] | Object
| {}
| |
| [isDocument] | Boolean
| false
| Set this to `true` if you are passing an entire HTML document (beginning with the tag). The context node name will be HTML. If `false`, then the context node name will be BODY. |
| [nodePropertyMap] | WeakMap.<DomNode, Object>
| new WeakMap()
| Additional node properties |
## sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])
This function is not exported: Please use the wrapper functions instead:
[sanitizeHtml](#sanitizeHtml), [sanitizeNode](#sanitizeNode), and [sanitizeChildNodes](#sanitizeChildNodes).
Recursively processes a tree with `node` at the root.
In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes.
For example, if the B node in `abcdefghi` is flattened, the result is
`abcdefghi`.
Each node is processed in the following sequence:
1. Filters matching the `opts.filters_by_tag` spec are called. If the filter returns `null`, the
node is removed and processing stops (see [filter](#filter)s).
2. If the `opts.remove_tags_*` spec matches, the node is removed and processing stops.
3. If the `opts.flatten_tags_*` spec matches, the node is flattened and processing stops.
4. If the `opts.allow_tags_*` spec matches:
* All attributes not matching `opts.allow_attributes_by_tag` are removed.
* All class names not matching `opts.allow_classes_by_tag` are removed.
* The node is kept and processing stops.
5. The node is flattened.
**Kind**: global function
| Param | Type | Default | Description |
| --- | --- | --- | --- |
| doc | [DomDocument
](#DomDocument) | | The document |
| contextNode | [DomNode
](#DomNode) | | The root node |
| [opts] | Object
| {}
| Options for processing. |
| [opts.filters_by_tag] | [FilterSpec
](#FilterSpec) | {}
| Matching filters are called with the node. |
| [opts.remove_tags_direct] | [ParentChildSpec
](#ParentChildSpec) | {}
| Matching nodes which are a direct child of the matching parent node are removed. |
| [opts.remove_tags_deep] | [ParentChildSpec
](#ParentChildSpec) | {'.*': ['style','script','textarea','noscript']}
| Matching nodes which are anywhere below the matching parent node are removed. |
| [opts.flatten_tags_direct] | [ParentChildSpec
](#ParentChildSpec) | {}
| Matching nodes which are a direct child of the matching parent node are flattened. |
| [opts.flatten_tags_deep] | [ParentChildSpec
](#ParentChildSpec) | {}
| Matching nodes which are anywhere below the matching parent node are flattened. |
| [opts.allow_tags_direct] | [ParentChildSpec
](#ParentChildSpec) | {}
| Matching nodes which are a direct child of the matching parent node are kept. |
| [opts.allow_tags_deep] | [ParentChildSpec
](#ParentChildSpec) | {}
| Matching nodes which are anywhere below the matching parent node are kept. |
| [opts.allow_attributes_by_tag] | [TagAttributeNameSpec
](#TagAttributeNameSpec) | {}
| Matching attribute names of a matching node are kept. Other attributes are removed. |
| [opts.allow_classes_by_tag] | [TagClassNameSpec
](#TagClassNameSpec) | {}
| Matching class names of a matching node are kept. Other class names are removed. If no class names are remaining, the class attribute is removed. |
| [opts.remove_empty] | boolean
| false
| Remove nodes which are completely empty |
| [opts.join_siblings] | [Array.<Tagname>
](#Tagname) | []
| Join same-tag sibling nodes of given tag names, unless they are separated by non-whitespace textNodes. |
| [childrenOnly] | Bool
| false
| If false, then the node itself and its descendants are processed recursively. If true, then only the children and its descendants are processed recursively, but not the node itself (use when `node` is `BODY` or `DocumentFragment`). |
| [nodePropertyMap] | WeakMap.<DomNode, Object>
| new WeakMap()
| Additional properties for a [DomNode](#DomNode) can be stored in an object and will be looked up in this map. The properties of the object and their meaning: `skip`: If truthy, disables all processing for this node. `skip_filters`: If truthy, disables all filters for this node. `skip_classes`: If truthy, disables processing classes of this node. `skip_attributes`: If truthy, disables processing attributes of this node. See tests for usage details. |
## DomDocument : Object
Implements the WHATWG DOM Document interface.
In the browser, this is `window.document`. In Node.js, this may for example be
[new JSDOM().window.document](https://github.com/tmpvar/jsdom).
**Kind**: global typedef
**See**: [https://dom.spec.whatwg.org/#interface-document](https://dom.spec.whatwg.org/#interface-document)
## DomNode : Object
Implements the WHATWG DOM Node interface.
Custom properties for each node can be stored in a `WeakMap` passed as option `nodePropertyMap`
to one of the sanitize functions.
**Kind**: global typedef
**See**: [https://dom.spec.whatwg.org/#interface-node](https://dom.spec.whatwg.org/#interface-node)
## Tagname : string
Node tag name.
Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name `#text`,
these are referred to by the simpler string 'TEXT' for convenience.
**Kind**: global typedef
**Example**
```js
'DIV'
'H1'
'TEXT'
```
## Regex : string
A string which is compiled to a case-insensitive regular expression `new RegExp(regex, 'i')`.
The regular expression is used to match a [Tagname](#Tagname).
**Kind**: global typedef
**Example**
```js
'.*' // matches any tag
'DIV' // matches DIV
'(DIV|H[1-3])' // matches DIV, H1, H2 and H3
'P' // matches P and SPAN
'^P$' // matches P but not SPAN
'TEXT' // matches text nodes (nodeType 3)
```
## ParentChildSpec : Object.<Regex, Array.<Regex>>
Property names are matched against a (direct or ancestral) parent node's [Tagname](#Tagname).
Associated values are matched against the current nodes [Tagname](#Tagname).
**Kind**: global typedef
**Example**
```js
{
'(DIV|SPAN)': ['H[1-3]', 'B'], // matches H1, H2, H3 and B within DIV or SPAN
'STRONG': ['.*'] // matches all tags within STRONG
}
```
## TagAttributeNameSpec : Object.<Regex, Array.<Regex>>
Property names are matched against the current nodes [Tagname](#Tagname). Associated values are
used to match its attribute names.
**Kind**: global typedef
**Example**
```js
{
'H[1-3]': ['id', 'class'], // matches 'id' and 'class' attributes of all H1, H2 and H3 nodes
'STRONG': ['data-.*'] // matches all 'data-.*' attributes of STRONG nodes.
}
```
## TagClassNameSpec : Object.<Regex, Array.<Regex>>
Property names are matched against the current nodes [Tagname](#Tagname). Associated values are used
to match its class names.
**Kind**: global typedef
**Example**
```js
{
'DIV|SPAN': ['blue', 'red'] // matches 'blue' and 'red' class names of all DIV and SPAN nodes
}
```
## FilterSpec : Object.<Regex, Array.<filter>>
Property names are matched against node [Tagname](#Tagname)s. Associated values
are the [filter](#filter)s which are run on the node.
## filter ⇒ [DomNode
](#DomNode) \| [Array.<DomNode>
](#DomNode) \| null
Filter functions can either...
1. return the same node (the first argument),
2. return a single, or an Array of, newly created [DomNode](#DomNode)(s), in which case `node` is
replaced with the new node(s),
3. return `null`, in which case `node` is removed.
Note that newly generated [DomNode](#DomNode)(s) are processed by running [sanitizeDom](#sanitizeDom)
on them, as if they had been part of the original tree. This has the following implication:
If a filter returns a newly generated [DomNode](#DomNode) with the same [Tagname](#Tagname) as `node`, it
would cause the same filter to be called again, which may lead to an infinite loop if the filter
is always returning the same result (this would be a badly behaved filter). To protect against
infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean
property called 'skip_filters' for the [DomNode](#DomNode)) (in a `WeakMap` which the caller must
provide to one of the sanitize functions as the argument `nodePropertyMap`). If 'skip_filters' is
not set, an error is thrown. With well-behaved filters it is possible to continue subsequent
processing of the returned node without causing an infinite loop.
**Kind**: global typedef
| Param | Type | Description |
| --- | --- | --- |
| node | [DomNode
](#DomNode) | Currently processed node |
| opts | Object
| |
| opts.parents | [Array.<DomNode>
](#DomNode) | The parent nodes of `node`. |
| opts.parentNodenames | [Array.<Tagname>
](#Tagname) | The tag names of the parent nodes |
| opts.siblingIndex | Integer
| The number of the current node amongst its siblings |