https://github.com/mykolaharmash/hyntax

Straightforward HTML parser for JavaScript
https://github.com/mykolaharmash/hyntax
dom html html-parser javascript
Last synced: 6 months ago
JSON representation
Straightforward HTML parser for JavaScript
Host: GitHub
URL: https://github.com/mykolaharmash/hyntax
Owner: mykolaharmash
License: mit
Created: 2017-08-01T20:08:54.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2024-07-12T14:31:21.000Z (over 1 year ago)
Last Synced: 2025-03-29T11:09:48.051Z (7 months ago)
Topics: dom, html, html-parser, javascript
Language: JavaScript
Homepage:
Size: 2.3 MB
Stars: 139
Watchers: 7
Forks: 8
Open Issues: 14
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          






# Hyntax

Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).

-   **Simple.** API is straightforward, output is clear.

-   **Forgiving.** Just like a browser, normally parses invalid HTML.

-   **Supports streaming.** Can process HTML while it's still being loaded.

-   **No dependencies.**

## Table Of Contents

-   [Usage](#usage)

-   [TypeScript Typings](#typescript-typings)

-   [Streaming](#streaming)

-   [Tokens](#tokens)

-   [AST Format](#ast-format)

-   [API Reference](#api-reference)

-   [Types Reference](#types-reference)

## Usage

```bash

npm install hyntax

```

```javascript

const { tokenize, constructTree } = require('hyntax')

const util = require('util')

const inputHTML = `

  

      

      Don't press

  

`

const { tokens } = tokenize(inputHTML)

const { ast } = constructTree(tokens)

console.log(JSON.stringify(tokens, null, 2))

console.log(util.inspect(ast, { showHidden: false, depth: null }))

```

## TypeScript Typings

Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.

## Streaming

Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.

```javascript

const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')

const http = require('http')

const util = require('util')

http.get('http://info.cern.ch', (res) => {

  const streamTokenizer = new StreamTokenizer()

  const streamTreeConstructor = new StreamTreeConstructor()

  let resultTokens = []

  let resultAst

  res.pipe(streamTokenizer).pipe(streamTreeConstructor)

  streamTokenizer

    .on('data', (tokens) => {

      resultTokens = resultTokens.concat(tokens)

    })

    .on('end', () => {

      console.log(JSON.stringify(resultTokens, null, 2))

    })

  streamTreeConstructor

    .on('data', (ast) => {

      resultAst = ast

    })

    .on('end', () => {

      console.log(util.inspect(resultAst, { showHidden: false, depth: null }))

    })

}).on('error', (err) => {

  throw err;

})

```

## Tokens

Here are all kinds of tokens which Hyntax will extract out of HTML string.

![Overview of all possible tokens](./tokens-list.png)

Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.

## AST Format

Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.

```javascript

{

  nodeType: TreeConstructor.NodeTypes.Document,

  content: {

    children: [

      {

        nodeType: TreeConstructor.NodeTypes.AnyNodeType,

        content: {…}

      },

      {

        nodeType: TreeConstructor.NodeTypes.AnyNodeType,

        content: {…}

      }

    ]

  }

}

```

Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.

## API Reference

### Tokenizer

Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.

#### Interface

```typescript

tokenize(html: String): Tokenizer.Result

```

#### Arguments

-   `html`  

HTML string to process  

  Required.  

Type: string.

#### Returns [Tokenizer.Result](#TokenizerResult)

### Tree Constructor

After you've got an array of tokens, you can pass them into tree constructor to build an AST.

#### Interface

```typescript

constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result

```

#### Arguments

-   `tokens`  

Array of tokens received from the tokenizer.  

  Required.  

Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)

#### Returns [TreeConstructor.Result](#TreeConstructorResult)

## Types Reference

#### Tokenizer.Result

```typescript

interface Result {

  state: Tokenizer.State

  tokens: Tokenizer.AnyToken[]

}

```

-   `state`   

The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.

-   `tokens`  

  Array of resulting tokens.  

  Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)

#### TreeConstructor.Result

```typescript

interface Result {

  state: State

  ast: AST

}

```

-   `state`  

The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.

  

-   `ast`  

  Resulting AST.  

  Type: [TreeConstructor.AST](#treeconstructorast)  

#### Tokenizer.Token

Generic Token, other interfaces use it to create a specific Token type.

```typescript

interface Token {

  type: T

  content: string

  startPosition: number

  endPosition: number

}

```

-   `type`  

One of the [Token types](#TokenizerTokenTypesAnyTokenType).

  

-   `content `   

Piece of original HTML string which was recognized as a token.

  

-   `startPosition `   

Index of a character in the input HTML string where the token starts.

  

-   `endPosition`  

Index of a character in the input HTML string where the token ends.

#### Tokenizer.TokenTypes.AnyTokenType

Shortcut type of all possible tokens.

```typescript

type AnyTokenType =

  | Text

  | OpenTagStart

  | AttributeKey

  | AttributeAssigment

  | AttributeValueWrapperStart

  | AttributeValue

  | AttributeValueWrapperEnd

  | OpenTagEnd

  | CloseTag

  | OpenTagStartScript

  | ScriptTagContent

  | OpenTagEndScript

  | CloseTagScript

  | OpenTagStartStyle

  | StyleTagContent

  | OpenTagEndStyle

  | CloseTagStyle

  | DoctypeStart

  | DoctypeEnd

  | DoctypeAttributeWrapperStart

  | DoctypeAttribute

  | DoctypeAttributeWrapperEnd

  | CommentStart

  | CommentContent

  | CommentEnd

```

#### Tokenizer.AnyToken

Shortcut to reference any possible token.

```typescript

type AnyToken = Token

```

#### TreeConstructor.AST

Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)

```typescript

type AST = TreeConstructor.DocumentNode

```

### AST Node Types

There are 7 possible types of Node. Each type has a specific content.

```typescript

type DocumentNode = Node	

```

```typescript

type DoctypeNode = Node

```

```typescript

type TextNode = Node

```

```typescript

type TagNode = Node

```

```typescript

type CommentNode = Node

```

```typescript

type ScriptNode = Node

```

```typescript

type StyleNode = Node

```

Interfaces for each content type:

- [Document](#TreeConstructorNodeContentsDocument)

- [Doctype](#TreeConstructorNodeContentsDoctype)

- [Text](#TreeConstructorNodeContentsText)

- [Tag](#TreeConstructorNodeContentsTag)

- [Comment](#TreeConstructorNodeContentsComment)

- [Script](#TreeConstructorNodeContentsScript)

- [Style](#TreeConstructorNodeContentsStyle)

#### TreeConstructor.Node

Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.

```typescript

interface Node {

  nodeType: T

  content: C

}

```

#### TreeConstructor.NodeTypes.AnyNodeType

Shortcut type of all possible Node types.

```typescript

type AnyNodeType =

  | Document

  | Doctype

  | Tag

  | Text

  | Comment

  | Script

  | Style

```

### Node Content Types

#### TreeConstructor.NodeTypes.AnyNodeContent

Shortcut type of all possible types of content inside a Node.

```typescript

type AnyNodeContent =

  | Document

  | Doctype

  | Text

  | Tag

  | Comment

  | Script

  | Style

```

#### TreeConstructor.NodeContents.Document

```typescript

interface Document {

  children: AnyNode[]

}

```

#### TreeConstructor.NodeContents.Doctype

```typescript

interface Doctype {

  start: Tokenizer.Token

  attributes?: DoctypeAttribute[]

  end: Tokenizer.Token

}

```

#### TreeConstructor.NodeContents.Text

```typescript

interface Text {

  value: Tokenizer.Token

}

```

#### TreeConstructor.NodeContents.Tag

```typescript

interface Tag {

  name: string

  selfClosing: boolean

  openStart: Tokenizer.Token

  attributes?: TagAttribute[]

  openEnd: Tokenizer.Token

  children?: AnyNode[]

  close?: Tokenizer.Token

}

```

#### TreeConstructor.NodeContents.Comment

```typescript

interface Comment {

  start: Tokenizer.Token

  value: Tokenizer.Token

  end: Tokenizer.Token

}

```

#### TreeConstructor.NodeContents.Script

```typescript

interface Script {

  openStart: Tokenizer.Token

  attributes?: TagAttribute[]

  openEnd: Tokenizer.Token

  value: Tokenizer.Token

  close: Tokenizer.Token

}

```

#### TreeConstructor.NodeContents.Style

```typescript

interface Style {

  openStart: Tokenizer.Token,

  attributes?: TagAttribute[],

  openEnd: Tokenizer.Token,

  value: Tokenizer.Token,

  close: Tokenizer.Token

}

```

#### TreeConstructor.DoctypeAttribute

```typescript

interface DoctypeAttribute {

  startWrapper?: Tokenizer.Token,

  value: Tokenizer.Token,

  endWrapper?: Tokenizer.Token

}

```

#### TreeConstructor.TagAttribute

```typescript

interface TagAttribute {

  key?: Tokenizer.Token,

  startWrapper?: Tokenizer.Token,

  value?: Tokenizer.Token,

  endWrapper?: Tokenizer.Token

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mykolaharmash/hyntax

Awesome Lists containing this project

README