Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mykolaharmash/hyntax

Straightforward HTML parser for JavaScript
https://github.com/mykolaharmash/hyntax

dom html html-parser javascript

Last synced: 4 days ago
JSON representation

Straightforward HTML parser for JavaScript

Awesome Lists containing this project

README

        

Hyntax project logo — lego bricks in the shape of a capital letter H

# Hyntax

Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).

- **Simple.** API is straightforward, output is clear.
- **Forgiving.** Just like a browser, normally parses invalid HTML.
- **Supports streaming.** Can process HTML while it's still being loaded.
- **No dependencies.**

## Table Of Contents

- [Usage](#usage)
- [TypeScript Typings](#typescript-typings)
- [Streaming](#streaming)
- [Tokens](#tokens)
- [AST Format](#ast-format)
- [API Reference](#api-reference)
- [Types Reference](#types-reference)

## Usage

```bash
npm install hyntax
```

```javascript
const { tokenize, constructTree } = require('hyntax')
const util = require('util')

const inputHTML = `



Don't press

`

const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)

console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
```

## TypeScript Typings

Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.

## Streaming

Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.

```javascript
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')

http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()

let resultTokens = []
let resultAst

res.pipe(streamTokenizer).pipe(streamTreeConstructor)

streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})

streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
```

## Tokens

Here are all kinds of tokens which Hyntax will extract out of HTML string.

![Overview of all possible tokens](./tokens-list.png)

Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.

## AST Format

Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.

```javascript
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
}
]
}
}
```

Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.

## API Reference

### Tokenizer

Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.

#### Interface

```typescript
tokenize(html: String): Tokenizer.Result
```

#### Arguments

- `html`
HTML string to process
Required.
Type: string.

#### Returns [Tokenizer.Result](#TokenizerResult)

### Tree Constructor

After you've got an array of tokens, you can pass them into tree constructor to build an AST.

#### Interface

```typescript
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
```

#### Arguments

- `tokens`
Array of tokens received from the tokenizer.
Required.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)

#### Returns [TreeConstructor.Result](#TreeConstructorResult)

## Types Reference

#### Tokenizer.Result

```typescript
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
```

- `state`
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- `tokens`
Array of resulting tokens.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)

#### TreeConstructor.Result

```typescript
interface Result {
state: State
ast: AST
}
```

- `state`
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.

- `ast`
Resulting AST.
Type: [TreeConstructor.AST](#treeconstructorast)

#### Tokenizer.Token

Generic Token, other interfaces use it to create a specific Token type.

```typescript
interface Token {
type: T
content: string
startPosition: number
endPosition: number
}
```

- `type`
One of the [Token types](#TokenizerTokenTypesAnyTokenType).

- `content `
Piece of original HTML string which was recognized as a token.

- `startPosition `
Index of a character in the input HTML string where the token starts.

- `endPosition`
Index of a character in the input HTML string where the token ends.

#### Tokenizer.TokenTypes.AnyTokenType

Shortcut type of all possible tokens.

```typescript
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
```

#### Tokenizer.AnyToken

Shortcut to reference any possible token.

```typescript
type AnyToken = Token
```

#### TreeConstructor.AST

Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)

```typescript
type AST = TreeConstructor.DocumentNode
```

### AST Node Types

There are 7 possible types of Node. Each type has a specific content.

```typescript
type DocumentNode = Node
```

```typescript
type DoctypeNode = Node
```

```typescript
type TextNode = Node
```

```typescript
type TagNode = Node
```

```typescript
type CommentNode = Node
```

```typescript
type ScriptNode = Node
```

```typescript
type StyleNode = Node
```

Interfaces for each content type:

- [Document](#TreeConstructorNodeContentsDocument)
- [Doctype](#TreeConstructorNodeContentsDoctype)
- [Text](#TreeConstructorNodeContentsText)
- [Tag](#TreeConstructorNodeContentsTag)
- [Comment](#TreeConstructorNodeContentsComment)
- [Script](#TreeConstructorNodeContentsScript)
- [Style](#TreeConstructorNodeContentsStyle)

#### TreeConstructor.Node

Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.

```typescript
interface Node {
nodeType: T
content: C
}
```

#### TreeConstructor.NodeTypes.AnyNodeType

Shortcut type of all possible Node types.

```typescript
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
```

### Node Content Types

#### TreeConstructor.NodeTypes.AnyNodeContent

Shortcut type of all possible types of content inside a Node.

```typescript
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
```

#### TreeConstructor.NodeContents.Document

```typescript
interface Document {
children: AnyNode[]
}
```

#### TreeConstructor.NodeContents.Doctype

```typescript
interface Doctype {
start: Tokenizer.Token
attributes?: DoctypeAttribute[]
end: Tokenizer.Token
}
```

#### TreeConstructor.NodeContents.Text

```typescript
interface Text {
value: Tokenizer.Token
}
```

#### TreeConstructor.NodeContents.Tag

```typescript
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
children?: AnyNode[]
close?: Tokenizer.Token
}
```

#### TreeConstructor.NodeContents.Comment

```typescript
interface Comment {
start: Tokenizer.Token
value: Tokenizer.Token
end: Tokenizer.Token
}
```

#### TreeConstructor.NodeContents.Script

```typescript
interface Script {
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
value: Tokenizer.Token
close: Tokenizer.Token
}
```

#### TreeConstructor.NodeContents.Style

```typescript
interface Style {
openStart: Tokenizer.Token,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token,
value: Tokenizer.Token,
close: Tokenizer.Token
}
```

#### TreeConstructor.DoctypeAttribute

```typescript
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token,
value: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```

#### TreeConstructor.TagAttribute

```typescript
interface TagAttribute {
key?: Tokenizer.Token,
startWrapper?: Tokenizer.Token,
value?: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```