An open API service indexing awesome lists of open source software.

https://github.com/atgreen/cl-sanitize-html

A Common Lisp library for sanitizing HTML using OWASP-style policies
https://github.com/atgreen/cl-sanitize-html

Last synced: 2 months ago
JSON representation

A Common Lisp library for sanitizing HTML using OWASP-style policies

Awesome Lists containing this project

README

          

# cl-sanitize-html

OWASP-style HTML sanitization library for Common Lisp, designed for safely rendering untrusted HTML content (like HTML emails or user-generated content).

## Features

- **Whitelist-based sanitization** - Only explicitly allowed tags and attributes pass through
- **Multiple security policies** - Default, Strict, and Email policies included
- **XSS prevention** - Blocks script tags, event handlers, javascript: URLs, and other attack vectors
- **CSS sanitization** - Optional CSS property filtering for email content
- **Safe defaults** - Automatically adds `rel="noopener noreferrer"` and `target="_blank"` to links
- **Plump-based** - Built on the robust Plump HTML parser
- **Well-tested** - Comprehensive test suite covering OWASP attack vectors

## Quick Start

```lisp
(use-package :sanitize-html)

;; Basic usage with default policy
(sanitize "alert('XSS')

Hello

")
;; => "

Hello

"

;; Remove event handlers
(sanitize "Click me")
;; => "Click me"

;; Use email policy for HTML emails
(sanitize "Cell" *email-policy*)
;; => "Cell"
```

## Security Policies

### Default Policy (*default-policy*)

Balanced security and usability for general web content:
- **Allowed tags**: Common formatting and semantic tags (p, div, span, a, strong, em, lists, tables, etc.)
- **Allowed protocols**: http, https, mailto, ftp
- **Inline styles**: Blocked
- **Comments**: Removed

### Strict Policy (*strict-policy*)

Maximum security with minimal formatting:
- **Allowed tags**: Only basic formatting (a, b, em, strong, ul, ol, li, p, br, code, pre)
- **Allowed protocols**: https, mailto only
- **Very limited attributes**: Only href, title, and class

### Email Policy (*email-policy*)

Designed for HTML emails with legacy formatting:
- **Allowed tags**: All email-safe tags including tables, font, center
- **Allowed protocols**: http, https, mailto, cid (inline images), data (base64)
- **Inline styles**: Allowed with filtered CSS properties
- **Table attributes**: bgcolor, cellpadding, cellspacing, etc.

## API

### Main Functions

```lisp
(sanitize html-string &optional policy)
(sanitize-html html-string &optional policy)
```

Sanitize HTML string according to policy. Returns sanitized HTML string.

**Parameters:**
- `html-string` - String containing HTML to sanitize
- `policy` - Security policy to apply (defaults to `*default-policy*`)

**Returns:** Sanitized HTML string

**Example:**
```lisp
(sanitize "bad

good

")
;; => "

good

"
```

### Utility Functions

```lisp
(safe-url-p url &optional policy)
```

Check if URL uses a safe protocol according to policy.

```lisp
(sanitize-url url &optional policy)
```

Return URL if safe, nil otherwise.

### Custom Policies

```lisp
(make-policy &key allowed-tags allowed-attributes allowed-protocols
allowed-css-properties remove-comments escape-cdata)
```

Create a custom security policy.

**Example:**
```lisp
(defparameter *my-policy*
(make-policy
:allowed-tags '("p" "br" "a" "strong" "em")
:allowed-attributes '(("a" . ("href" "title")))
:allowed-protocols '("https")
:remove-comments t))

(sanitize html-string *my-policy*)
```

## Security Features

### XSS Prevention

- ✅ Script tags removed
- ✅ Event handlers (onclick, onload, etc.) removed
- ✅ javascript: protocol blocked
- ✅ data: protocol blocked (except in email policy with validation)
- ✅ Inline styles blocked (except in email policy with CSS filtering)
- ✅ Form elements blocked
- ✅ iframe/object/embed blocked
- ✅ meta/link/style/base blocked

### CSS Injection Prevention

- CSS properties filtered by whitelist (email policy only)
- `javascript:`, `expression()`, `@import` blocked in CSS values
- `behavior:` property blocked (IE-specific XSS vector)

### Safe Defaults

- Links automatically get `rel="noopener noreferrer"` (prevents tabnabbing)
- Links automatically get `target="_blank"` (open in new tab)
- Comments removed by default
- CDATA sections escaped by default

## Email HTML Example

```lisp
(defun render-email-html (email-html-body)
"Safely render HTML email content"
(sanitize-html email-html-body *email-policy*))

;; Typical email HTML with inline styles and tables
(render-email-html "




Welcome to our newsletter!






Visit our site


")
```

## Running Tests

```lisp
(asdf:test-system :sanitize-html)
```

Or manually:

```lisp
(asdf:load-system :sanitize-html/tests)
(fiveam:run! :sanitize-html-tests)
```

## Dependencies

- **plump** - Lenient HTML/XML parser
- **lquery** - DOM manipulation
- **cl-ppcre** - Regular expressions for CSS parsing
- **alexandria** - Utilities library

**Test dependencies:**
- **fiveam** - Unit testing framework

## Architecture

1. **Parser** - Uses Plump to parse HTML into a DOM tree
2. **Tree Walker** - Recursively visits each node in the DOM
3. **Policy Enforcer** - Checks each element/attribute against whitelist
4. **Sanitizer** - Removes or modifies unsafe content
5. **Serializer** - Converts sanitized DOM back to HTML string

## Comparison with Other Libraries

| Feature | sanitize-html | bluemonday (Go) | ammonia (Rust) | bleach (Python) |
|---------|------------------|-----------------|----------------|-----------------|
| Whitelist-based | ✅ | ✅ | ✅ | ✅ |
| Multiple policies | ✅ | ✅ | ✅ | ❌ |
| CSS sanitization | ✅ | ✅ | ✅ | ✅ |
| URL validation | ✅ | ✅ | ✅ | ✅ |
| Link safety | ✅ | ✅ | ❌ | ❌ |
| OWASP-aligned | ✅ | ✅ | ✅ | ✅ |

## References

- [OWASP XSS Prevention Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html)
- [HTML5 Security Cheat Sheet](https://html5sec.org/)
- [Plump Documentation](https://shinmera.github.io/plump/)

## Author and License

``sanitize-html`` was written by [Anthony Green](https://github.com/atgreen)
and is distributed under the terms of the MIT license.