https://github.com/iwillspeak/teasel
Teasing HTML Elements from Text
https://github.com/iwillspeak/teasel
html parser
Last synced: over 1 year ago
JSON representation
Teasing HTML Elements from Text
- Host: GitHub
- URL: https://github.com/iwillspeak/teasel
- Owner: iwillspeak
- Created: 2022-01-22T15:19:28.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2025-02-25T12:24:07.000Z (over 1 year ago)
- Last Synced: 2025-03-19T22:12:36.622Z (over 1 year ago)
- Topics: html, parser
- Language: TypeScript
- Homepage:
- Size: 754 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Teasel
> Teasing HTML elements from plain text

Teasel is an HTML syntax tree parser written in TypeScript. Teasel aims to be
a fast and reliable full-fidelity parser for HTML linters and refactoring tools.
## Key Features
* **Full-fidelity tree** - Every byte in the input text will be represented
somewhere in the output syntax tree, in the order it was in the source text.
* **Fault tolerant perser** - All input texts produce an output tree, and a
set of errors. The closer the input is to a standards-compliant HTML document
the fewer error diagnostics.
* **Syntax, not Semantic** - Teasel parses HTML as a _syntax_ tree. The end
result is not an HTML DOM. This means that all the warts of the origional
document are avilable to dig into; ideal for linters.
## Docs and Getting Started
To get started using Teasel it can be [installed from GitHub packages][pkg]:
```
$ npm install @iwillspeak/teasel@0.3.0
```
Once installed you can then parse any string containing HTML into a syntax tree:
```typescript
import {Parser} from '@iwillspeak/teasel/lib/parse/Parser.js';
const result = Parser.parseDocument('
Hello World');
```
Check out the [`teasel` docs][pkg-teasel] for where to go next.
## Repo Structure
This repository contains three main packages:
* [`teasel`][pkg-teasel] - The main parser libary. This is the package
you want to reference as a consumer.
* [`pyracantha`][pkg-pyracantha] - The language agnostic low-level syntax
tree library used by `teasel` to represent parsed documents.
* [`teasel-cli`][pkg-teasel-cli] - A command line tool to test parsing
HTML documents with teasel.
## 🐲 TODO 🐲:
* [x] Handle attributes on opening tags
* [x] Better error recovery when `expect` fails.
* [x] Tolerate and warn on some malformed whitespace. e.g.: `< p>`.
* [x] Malformed attribute lists synchronise on `>`.
* [x] Node cache should cache nodes in the green tree builder.
* [x] Node cache interface and implementation.
* [x] Parser should accept optional cache.
* [x] Handle Closing of outer tags correctly. e.g.: `
hello
`.
* [x] Handle Closing of non-nesting siblings. e.g.: `* [x] Handling for implicit self closing of 'void' elements `
` etc.
* [x] Support for esoteric DOCTYPEs e.g. `SYSTEM 'about:legacy-compat'`.
* [x] Document and fragment parse APIs.
* [x] Syntax builder / factory API for creating and updating nodes.
* [x] Handling of raw text elements. e.g. `script`, and `style`.
* [ ] Support for character references. e.g. `&`.
* [ ] HTML / XML crossover
* [ ] Support for *processing instructions*, e.g. ``.
* [ ] Support for `CDATA` values / tokens.
[pkg]: https://github.com/iwillspeak/Teasel/packages/1313956
[pkg-teasel]: packages/teasel/README.md
[pkg-teasel-cli]: packages/teasel-cli/README.md
[pkg-pyracantha]: packages/pyracantha/README.md