An open API service indexing awesome lists of open source software.

https://github.com/lemonadejs/html-to-json

Convert an HTML string to a general JSON format.
https://github.com/lemonadejs/html-to-json

Last synced: 28 days ago
JSON representation

Convert an HTML string to a general JSON format.

Awesome Lists containing this project

README

          

# HTML/XML to JSON Converter

> A lightweight, zero-dependency library for bidirectional conversion between HTML/XML and JSON

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://img.shields.io/badge/tests-58%20passing-brightgreen.svg)]()

Transform HTML/XML markup into clean JSON trees and render them back to markup with full fidelity. Perfect for parsing, manipulating, and generating HTML/XML programmatically.

## Features

- **Zero Dependencies** - Pure JavaScript, no external libraries required
- **TypeScript Support** - Fully typed with comprehensive type definitions
- **Bidirectional** - Parse HTML/XML to JSON and render JSON back to HTML/XML
- **High Fidelity** - Preserves structure, attributes, text nodes, and comments
- **Lightweight** - Minimal footprint, fast parsing
- **Flexible** - Works with HTML and XML, supports namespaces
- **Sanitization Ready** - Built-in option to ignore unwanted tags (script, style, etc.)
- **Pretty Printing** - Optional formatted output with customizable indentation
- **Well Tested** - 58 comprehensive tests covering all features

## Installation

```bash
npm install @lemonadejs/html-to-json
```

## Import Options

You can import both functions from the main package:

```javascript
// Recommended: Import both from main package
import { parser, render } from '@lemonadejs/html-to-json';
```

## TypeScript Usage

The library includes comprehensive type definitions:

```typescript
import { parser, render, type Node, type ParserOptions, type RenderOptions } from '@lemonadejs/html-to-json';

// Fully typed parser with options
const options: ParserOptions = { ignore: ['script', 'style'] };
const tree: Node | undefined = parser('

Hello
', options);

// Fully typed renderer with options
const renderOpts: RenderOptions = { pretty: true, indent: ' ' };
const html: string = render(tree, renderOpts);
```

## Quick Start

### Parse HTML/XML to JSON

```javascript
import { parser } from '@lemonadejs/html-to-json';

const html = '


Title


Content


';
const tree = parser(html);

console.log(JSON.stringify(tree, null, 2));
```

**Output:**
```json
{
"type": "div",
"props": [
{ "name": "class", "value": "card" }
],
"children": [
{
"type": "h1",
"children": [
{
"type": "#text",
"props": [{ "name": "textContent", "value": "Title" }]
}
]
},
{
"type": "p",
"children": [
{
"type": "#text",
"props": [{ "name": "textContent", "value": "Content" }]
}
]
}
]
}
```

### Render JSON back to HTML/XML

```javascript
import { parser, render } from '@lemonadejs/html-to-json';

const tree = parser('

Hello World
');
const html = render(tree);

console.log(html);
// Output:

Hello World

```

### Pretty Printing

```javascript
import { render } from '@lemonadejs/html-to-json';

const tree = {
type: 'article',
props: [{ name: 'class', value: 'post' }],
children: [
{
type: 'h2',
children: [
{ type: '#text', props: [{ name: 'textContent', value: 'Article Title' }] }
]
},
{
type: 'p',
children: [
{ type: '#text', props: [{ name: 'textContent', value: 'Article content here.' }] }
]
}
]
};

const html = render(tree, { pretty: true, indent: ' ' });

console.log(html);
```

**Output:**
```html


Article Title



Article content here.

```

## ๐Ÿ“– API Reference

### `parser(html, options)`

Parses HTML or XML string into a JSON tree structure.

**Parameters:**
- `html` (string) - The HTML or XML string to parse
- `options` (Object, optional) - Parser options

**Options:**

| Option | Type | Default | Description |
|----------|----------|---------|------------------------------------------------|
| `ignore` | string[] | `[]` | Array of tag names to ignore during parsing |

**Returns:** `Object` - JSON tree representation

**Examples:**

```javascript
// Basic parsing
const tree = parser('

Hello
');

// Ignore script and style tags
const clean = parser(html, { ignore: ['script', 'style'] });

// Case-insensitive tag matching
const tree = parser('

bad
', { ignore: ['script'] });
```

### `render(tree, options)`

Renders a JSON tree back into HTML or XML markup.

**Parameters:**
- `tree` (Object|Array) - The JSON tree to render
- `options` (Object, optional) - Rendering options

**Options:**

| Option | Type | Default | Description |
|-------------------|----------|------------|------------------------------------------------------|
| `pretty` | boolean | `false` | Format output with newlines and indentation |
| `indent` | string | `' '` | Indentation string (used when `pretty` is `true`) |
| `selfClosingTags` | string[] | See below* | Override default void elements list |
| `xmlMode` | boolean | `false` | Self-close all empty elements using `` syntax |

*Default self-closing tags: `area`, `base`, `br`, `col`, `embed`, `hr`, `img`, `input`, `link`, `meta`, `source`, `track`, `wbr`

**Returns:** `string` - Rendered HTML/XML markup

**Examples:**

```javascript
// Basic rendering
const html = render(tree);

// Pretty printing
const formatted = render(tree, { pretty: true });

// Custom indentation
const tabbed = render(tree, { pretty: true, indent: '\t' });

// XML mode
const xml = render(tree, { xmlMode: true });

// Custom self-closing tags
const custom = render(tree, {
selfClosingTags: ['br', 'hr', 'img', 'custom-element']
});
```

## ๐ŸŽฏ JSON Tree Structure

### Element Node
```json
{
"type": "tagName",
"props": [
{ "name": "attributeName", "value": "attributeValue" }
],
"children": [...]
}
```

### Text Node
```json
{
"type": "#text",
"props": [
{ "name": "textContent", "value": "text content here" }
]
}
```

### Comment Node
```json
{
"type": "#comments",
"props": [
{ "name": "text", "value": " comment text " }
]
}
```

### Template Wrapper (Multiple Root Elements)
```json
{
"type": "template",
"children": [
{ "type": "div", ... },
{ "type": "span", ... }
]
}
```

## ๐Ÿ“ฆ TypeScript Types

The library exports the following TypeScript types:

### Core Types
- **`Node`** - Union type for all possible node types (ElementNode | TextNode | CommentNode | TemplateNode)
- **`ElementNode`** - HTML/XML element with type, props, and children
- **`TextNode`** - Text content node with `type: '#text'`
- **`CommentNode`** - Comment node with `type: '#comments'`
- **`TemplateNode`** - Wrapper for multiple root elements with `type: 'template'`
- **`NodeProp`** - Property object with name and value

### Options Types
- **`ParserOptions`** - Options for the parser function
- **`RenderOptions`** - Options for the render function

```typescript
import type {
Node,
ElementNode,
TextNode,
CommentNode,
TemplateNode,
NodeProp,
ParserOptions,
RenderOptions
} from '@lemonadejs/html-to-json';
```

## ๐Ÿ’ก Use Cases

### 1. HTML Sanitization

```javascript
import { parser, render } from '@lemonadejs/html-to-json';

// Remove potentially dangerous tags using the ignore option
function sanitizeHTML(html) {
const tree = parser(html, {
ignore: ['script', 'style', 'iframe', 'object', 'embed']
});
return render(tree);
}

const dirty = '

Helloalert("xss")bad{}World
';
const clean = sanitizeHTML(dirty);
console.log(clean); //
HelloWorld

```

### 2. HTML Transformation

```javascript
// Add class to all divs
function addClassToAllDivs(tree, className) {
if (tree.type === 'div') {
if (!tree.props) tree.props = [];
const classAttr = tree.props.find(p => p.name === 'class');
if (classAttr) {
classAttr.value += ` ${className}`;
} else {
tree.props.push({ name: 'class', value: className });
}
}

if (tree.children) {
tree.children.forEach(child => addClassToAllDivs(child, className));
}

return tree;
}

const html = '

Nested
';
const tree = parser(html);
addClassToAllDivs(tree, 'highlight');
console.log(render(tree));
//
Nested

```

### 3. XML Processing

```javascript
// Parse and extract data from XML
const xml = `


Sample Book
John Doe
29.99

`;

const tree = parser(xml);

function extractBooks(node) {
if (node.type === 'book') {
const isbn = node.props?.find(p => p.name === 'isbn')?.value;
const title = node.children?.find(c => c.type === 'title')
?.children?.[0]?.props?.[0]?.value;
const author = node.children?.find(c => c.type === 'author')
?.children?.[0]?.props?.[0]?.value;

return { isbn, title, author };
}

if (node.children) {
return node.children.map(extractBooks).filter(Boolean).flat();
}

return [];
}

const books = extractBooks(tree);
console.log(books);
// [{ isbn: '978-0-123456-78-9', title: 'Sample Book', author: 'John Doe' }]
```

### 4. Complex HTML with Inline CSS

```javascript
const complexHTML = `


Welcome


Beautiful styled content


`;

const tree = parser(complexHTML);
const rendered = render(tree, { pretty: true });

console.log(rendered);
// Perfectly preserves all inline CSS with gradients, rgba colors, etc.
```

## ๐Ÿ” Advanced Features

### XML Namespaces Support

```javascript
const xml = 'Value';
const tree = parser(xml);
const output = render(tree);
// Preserves namespace colons in tag names
```

### Self-Closing Tags

```javascript
const html = '




';
const tree = parser(html);
const output = render(tree);
// Properly handles void elements
```

### Comments Preservation

```javascript
const html = '

Content
';
const tree = parser(html);
const output = render(tree);
// Comments are preserved in the output
```

### Multiple Root Elements

```javascript
const html = '

First
Second';
const tree = parser(html);
// Returns: { type: 'template', children: [...] }
```

## ๐Ÿงช Testing

Run the comprehensive test suite:

```bash
npm test
```

**Test Coverage:**
- โœ… Basic HTML elements (div, span, nested structures)
- โœ… Self-closing tags (br, img, input, hr, meta, link)
- โœ… Attributes (single, multiple, special characters, quotes)
- โœ… Text content with escaping
- โœ… HTML comments
- โœ… XML documents with namespaces
- โœ… Complex real-world examples (forms, navigation, tables)
- โœ… Edge cases (empty input, whitespace, consecutive tags)
- โœ… Parser behavior (no parent references, unclosed tags)
- โœ… Parser options (ignore tags - script, style, nested, case-insensitive)
- โœ… Renderer options (pretty printing, XML mode)
- โœ… Complex HTML with extensive inline CSS (11,000+ characters)

**58 tests passing** โ€ข 1 skipped

## โšก Performance

The parser is designed for speed and efficiency:

- **Streaming parser** - Single-pass character-by-character parsing
- **No regex in main loop** - Only simple character matching
- **Minimal allocations** - Reuses objects where possible
- **Stack-based** - Efficient memory usage for deeply nested structures

Typical performance:
- Small HTML (< 1KB): < 1ms
- Medium HTML (10KB): ~5ms
- Large HTML (100KB+): ~50ms
- Complex HTML with CSS (11KB): ~10ms

## โš ๏ธ Known Limitations

1. **HTML Entities**: Not decoded during parsing. They are stored as-is and escaped on render.
- Input: `

&

` โ†’ Stored: `"&"` โ†’ Output: `

&amp;

`
- **Workaround**: Use raw characters instead of entities in source

2. **Whitespace**: Fully preserved in text nodes, no normalization applied.

3. **Doctype**: `` declarations are parsed as text nodes, not special nodes.

4. **CDATA**: `` sections are not specially handled.

5. **Processing Instructions**: `` are not parsed.

6. **Error Reporting**: Parser is lenient and produces a tree even for malformed HTML. No detailed error messages.

7. **Attribute Order**: May differ from source in rendered output.

8. **Quotes**: Renderer always uses double quotes for attributes.

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

### Development Setup

```bash
# Clone the repository
git clone https://github.com/lemonadejs/html-to-json.git
cd html-to-json

# Install dependencies
npm install

# Run tests
npm test

# Run tests in watch mode
npm test -- --watch
```

## ๐Ÿ“„ License

MIT ยฉ [Jspreadsheet Team](https://github.com/lemonadejs)

## ๐Ÿ”— Links

- **Repository**: https://github.com/lemonadejs/html-to-json
- **NPM Package**: https://www.npmjs.com/package/@lemonadejs/html-to-json
- **Issues**: https://github.com/lemonadejs/html-to-json/issues
- **Documentation**: https://github.com/lemonadejs/html-to-json#readme

## ๐Ÿ™ Acknowledgments

Built with โค๏ธ by the [Jspreadsheet Team](https://jspreadsheet.com/)

---

**Star this repo** โญ if you find it useful!