Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mykolaharmash/hyntax
Straightforward HTML parser for JavaScript
https://github.com/mykolaharmash/hyntax
dom html html-parser javascript
Last synced: 4 days ago
JSON representation
Straightforward HTML parser for JavaScript
- Host: GitHub
- URL: https://github.com/mykolaharmash/hyntax
- Owner: mykolaharmash
- License: mit
- Created: 2017-08-01T20:08:54.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-12T14:31:21.000Z (6 months ago)
- Last Synced: 2025-01-12T05:04:58.783Z (11 days ago)
- Topics: dom, html, html-parser, javascript
- Language: JavaScript
- Homepage:
- Size: 2.3 MB
- Stars: 139
- Watchers: 8
- Forks: 8
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hyntax
Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).
- **Simple.** API is straightforward, output is clear.
- **Forgiving.** Just like a browser, normally parses invalid HTML.
- **Supports streaming.** Can process HTML while it's still being loaded.
- **No dependencies.**## Table Of Contents
- [Usage](#usage)
- [TypeScript Typings](#typescript-typings)
- [Streaming](#streaming)
- [Tokens](#tokens)
- [AST Format](#ast-format)
- [API Reference](#api-reference)
- [Types Reference](#types-reference)## Usage
```bash
npm install hyntax
``````javascript
const { tokenize, constructTree } = require('hyntax')
const util = require('util')const inputHTML = `
Don't press
`
const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
```## TypeScript Typings
Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.
## Streaming
Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
```javascript
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()let resultTokens = []
let resultAstres.pipe(streamTokenizer).pipe(streamTreeConstructor)
streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
```## Tokens
Here are all kinds of tokens which Hyntax will extract out of HTML string.
![Overview of all possible tokens](./tokens-list.png)
Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.
## AST Format
Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.
```javascript
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
}
]
}
}
```Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.
## API Reference
### Tokenizer
Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
#### Interface
```typescript
tokenize(html: String): Tokenizer.Result
```#### Arguments
- `html`
HTML string to process
Required.
Type: string.#### Returns [Tokenizer.Result](#TokenizerResult)
### Tree Constructor
After you've got an array of tokens, you can pass them into tree constructor to build an AST.
#### Interface
```typescript
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
```#### Arguments
- `tokens`
Array of tokens received from the tokenizer.
Required.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)#### Returns [TreeConstructor.Result](#TreeConstructorResult)
## Types Reference
#### Tokenizer.Result
```typescript
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
```- `state`
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- `tokens`
Array of resulting tokens.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)#### TreeConstructor.Result
```typescript
interface Result {
state: State
ast: AST
}
```- `state`
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
- `ast`
Resulting AST.
Type: [TreeConstructor.AST](#treeconstructorast)#### Tokenizer.Token
Generic Token, other interfaces use it to create a specific Token type.
```typescript
interface Token {
type: T
content: string
startPosition: number
endPosition: number
}
```- `type`
One of the [Token types](#TokenizerTokenTypesAnyTokenType).
- `content `
Piece of original HTML string which was recognized as a token.
- `startPosition `
Index of a character in the input HTML string where the token starts.
- `endPosition`
Index of a character in the input HTML string where the token ends.#### Tokenizer.TokenTypes.AnyTokenType
Shortcut type of all possible tokens.
```typescript
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
```#### Tokenizer.AnyToken
Shortcut to reference any possible token.
```typescript
type AnyToken = Token
```#### TreeConstructor.AST
Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)
```typescript
type AST = TreeConstructor.DocumentNode
```### AST Node Types
There are 7 possible types of Node. Each type has a specific content.
```typescript
type DocumentNode = Node
``````typescript
type DoctypeNode = Node
``````typescript
type TextNode = Node
``````typescript
type TagNode = Node
``````typescript
type CommentNode = Node
``````typescript
type ScriptNode = Node
``````typescript
type StyleNode = Node
```Interfaces for each content type:
- [Document](#TreeConstructorNodeContentsDocument)
- [Doctype](#TreeConstructorNodeContentsDoctype)
- [Text](#TreeConstructorNodeContentsText)
- [Tag](#TreeConstructorNodeContentsTag)
- [Comment](#TreeConstructorNodeContentsComment)
- [Script](#TreeConstructorNodeContentsScript)
- [Style](#TreeConstructorNodeContentsStyle)#### TreeConstructor.Node
Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
```typescript
interface Node {
nodeType: T
content: C
}
```#### TreeConstructor.NodeTypes.AnyNodeType
Shortcut type of all possible Node types.
```typescript
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
```### Node Content Types
#### TreeConstructor.NodeTypes.AnyNodeContent
Shortcut type of all possible types of content inside a Node.
```typescript
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
```#### TreeConstructor.NodeContents.Document
```typescript
interface Document {
children: AnyNode[]
}
```#### TreeConstructor.NodeContents.Doctype
```typescript
interface Doctype {
start: Tokenizer.Token
attributes?: DoctypeAttribute[]
end: Tokenizer.Token
}
```#### TreeConstructor.NodeContents.Text
```typescript
interface Text {
value: Tokenizer.Token
}
```#### TreeConstructor.NodeContents.Tag
```typescript
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
children?: AnyNode[]
close?: Tokenizer.Token
}
```#### TreeConstructor.NodeContents.Comment
```typescript
interface Comment {
start: Tokenizer.Token
value: Tokenizer.Token
end: Tokenizer.Token
}
```#### TreeConstructor.NodeContents.Script
```typescript
interface Script {
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
value: Tokenizer.Token
close: Tokenizer.Token
}
```#### TreeConstructor.NodeContents.Style
```typescript
interface Style {
openStart: Tokenizer.Token,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token,
value: Tokenizer.Token,
close: Tokenizer.Token
}
```#### TreeConstructor.DoctypeAttribute
```typescript
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token,
value: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```#### TreeConstructor.TagAttribute
```typescript
interface TagAttribute {
key?: Tokenizer.Token,
startWrapper?: Tokenizer.Token,
value?: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```