https://github.com/xcrap-dev/html-parser
Xcrap HTML Parser is an experimental library written in Rust, built with the NAPI-RS framework for compatibility with Node.js. Its goal is to be fast, lightweight, and support both CSS and XPath queries. Designed for the Xcrap framework ecosystem — but not limited to it — it natively provides query options and limits on processed elements.
https://github.com/xcrap-dev/html-parser
fast html javascript napi-rs nodejs parser rust typescript xcrap
Last synced: 1 day ago
JSON representation
Xcrap HTML Parser is an experimental library written in Rust, built with the NAPI-RS framework for compatibility with Node.js. Its goal is to be fast, lightweight, and support both CSS and XPath queries. Designed for the Xcrap framework ecosystem — but not limited to it — it natively provides query options and limits on processed elements.
- Host: GitHub
- URL: https://github.com/xcrap-dev/html-parser
- Owner: xcrap-dev
- License: mit
- Created: 2026-02-20T05:13:03.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2026-02-22T00:05:05.000Z (4 months ago)
- Last Synced: 2026-03-12T14:52:33.031Z (4 months ago)
- Topics: fast, html, javascript, napi-rs, nodejs, parser, rust, typescript, xcrap
- Language: Rust
- Homepage: https://www.npmjs.com/package/@xcrap/html-parser
- Size: 1.12 MB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# 🕷️ @xcrap/html-parser
**A blazing-fast HTML parser for Node.js, powered by Rust and NAPI-RS**
[](https://www.npmjs.com/package/@xcrap/html-parser)
[](./LICENSE)
[](https://nodejs.org)
[](https://www.rust-lang.org/)
[@xcrap/html-parser](https://www.npmjs.com/package/@xcrap/html-parser) is an **experimental** HTML parsing library written in **Rust**, exposed to Node.js through the [NAPI-RS](https://napi.rs/) framework. It is designed to be **fast**, **lightweight**, and to support both **CSS selectors** and **XPath** queries — with built-in support for result limits and element nesting.
Although part of the [Xcrap](https://github.com/Xcrap-Cloud) scraping ecosystem, this library can be used as a **standalone package** in any Node.js project.
---
## 📋 Table of Contents
- [✨ Features](#-features)
- [⚡ Performance](#-performance)
- [📦 Installation](#-installation)
- [🚀 Quick Start](#-quick-start)
- [📖 API Reference](#-api-reference)
- [`HtmlParser` / `HTMLParser`](#htmlparser--htmlparser)
- [`HTMLElement`](#htmlelement)
- [`css()` and `xpath()`](#css-and-xpath)
- [Types](#types)
- [🔍 Usage Examples](#-usage-examples)
- [CSS Selectors](#css-selectors)
- [XPath Queries](#xpath-queries)
- [Navigating Nested Elements](#navigating-nested-elements)
- [Working with Attributes](#working-with-attributes)
- [🏗️ Architecture](#️-architecture)
- [🛠️ Development](#️-development)
- [🤝 Contributing](#-contributing)
- [📝 License](#-license)
---
## ✨ Features
- **⚡ Blazing Fast** — Core parsing done in Rust; significantly faster than JS-based parsers at instance initialization.
- **🎯 Dual Query Support** — Query elements using both **CSS selectors** (via `scraper`) and **XPath** expressions (via `sxd-xpath`).
- **🦥 Lazy Loading** — Internal CSS and XPath engines are only initialized when first needed, reducing unnecessary overhead.
- **🔢 Built-in Limits** — Pass a `limit` option to `selectMany` to cap the number of returned elements.
- **🌲 Element Traversal** — Navigate nested elements using `selectFirst` and `selectMany` directly on `HTMLElement` instances.
- **🔒 Type-Safe** — Fully typed TypeScript declarations included (`index.d.ts`).
- **🖥️ Platform Support** — Pre-built native binary currently available for **Windows x64** only. Other platforms require compilation from source (see [Development](#️-development)).
---
## ⚡ Performance
Benchmarks below compare parser **initialization speed** (instantiation time per file):
```
@xcrap/html-parser : 0.246214 ms/file ± 0.136808 ✅ Fastest
html-parser : 36.825500 ms/file ± 28.855100
htmljs-parser : 0.501577 ms/file ± 1.210800
html-dom-parser : 2.180280 ms/file ± 1.796170
html5parser : 1.674640 ms/file ± 1.222790
cheerio : 8.679980 ms/file ± 6.328520
parse5 : 4.821180 ms/file ± 2.668220
htmlparser2 : 1.497390 ms/file ± 1.398040
htmlparser : 16.171200 ms/file ± 109.076000
high5 : 2.982290 ms/file ± 1.927480
node-html-parser : 2.901670 ms/file ± 1.908040
```
> Benchmarks sourced from [node-html-parser repository](https://github.com/taoqf/node-html-parser).
The performance advantage comes from lazy loading: the internal `Html` (CSS engine) and `Package` (XPath engine) instances are only initialized on first use and reused across subsequent calls on the same parser instance.
---
## 📦 Installation
Install via your preferred package manager:
```bash
# npm
npm install @xcrap/html-parser
# yarn
yarn add @xcrap/html-parser
# pnpm
pnpm add @xcrap/html-parser
```
**Requirements:**
- Node.js **>= 18.0.0**
Native binaries are pre-built and distributed for the following platforms:
| Platform | Architecture | Support |
|------------------|--------------|-----------------|
| Windows | x64 | ✅ Pre-built |
| macOS | x64 | 🔧 Build from source |
| macOS | ARM64 | 🔧 Build from source |
| Linux | x64 (GNU) | 🔧 Build from source |
> **⚠️ Note:** Currently only the **Windows x64** binary is pre-built and included in the published package. Users on other platforms must compile the native addon locally — see the [Development](#️-development) section for instructions.
---
## 🚀 Quick Start
```ts
import { HtmlParser, css, xpath } from "@xcrap/html-parser"
const html = `
Hello World
- Item 1
- Item 2
- Item 3
`
const parser = new HtmlParser(html)
// Select a single element using a CSS selector
const heading = parser.selectFirst({ query: css("h1") })
console.log(heading?.text) // "Hello World"
// Select multiple elements and limit results
const items = parser.selectMany({ query: css("li.item"), limit: 2 })
console.log(items.map(el => el.text)) // ["Item 1", "Item 2"]
// Use XPath instead
const firstItem = parser.selectFirst({ query: xpath("//li[@class='item']") })
console.log(firstItem?.text) // "Item 1"
```
> **CommonJS** is also fully supported via `require`:
>
> ```js
> const { parse, css, xpath } = require("@xcrap/html-parser")
> const parser = parse(html)
> ```
---
## 📖 API Reference
### `HtmlParser` / `HTMLParser`
The main entry point for parsing an HTML string. CSS and XPath engines are lazily initialized on first use and reused across subsequent queries.
#### Constructor
```ts
new HtmlParser(content: string): HtmlParser
```
| Parameter | Type | Description |
|-----------|----------|--------------------------------|
| `content` | `string` | The raw HTML string to parse. |
> **Alias:** You can also use the `parse(content: string)` function as a convenience wrapper:
> ```ts
> import { parse } from "@xcrap/html-parser"
> const parser = parse(html)
> ```
#### `selectFirst(options)`
Selects the **first** element matching the given query.
```ts
parser.selectFirst(options: SelectFirstOptions): HTMLElement | null
```
| Parameter | Type | Description |
|------------------|-------------------|------------------------------------------|
| `options.query` | `QueryConfig` | A query config built with `css()` or `xpath()`. |
Returns `HTMLElement | null` — `null` if no element matches.
#### `selectMany(options)`
Selects **all** elements matching the given query.
```ts
parser.selectMany(options: SelectManyOptions): HTMLElement[]
```
| Parameter | Type | Description |
|------------------|-------------------|------------------------------------------|
| `options.query` | `QueryConfig` | A query config built with `css()` or `xpath()`. |
| `options.limit` | `number?` | Optional. Maximum number of elements to return. Values `<= 0` are ignored (returns all). |
Returns `HTMLElement[]` — an empty array if no matches.
---
### `HTMLElement`
Represents a matched DOM element. Provides properties and methods to inspect and traverse its content.
> **Note:** `HTMLElement` instances also support `selectFirst` and `selectMany`, allowing scoped queries within a found element.
#### Properties
| Property | Type | Description |
|--------------|---------------------------|--------------------------------------------------------------------|
| `outerHTML` | `string` | The full HTML of the element, including its opening and closing tags. |
| `innerHTML` | `string` *(getter)* | The inner HTML content (children only, excluding the element's own tags). |
| `text` | `string` *(getter)* | The concatenated plain-text content of the element and its descendants. |
| `id` | `string \| null` *(getter)* | The element's `id` attribute, or `null` if not present. |
| `tagName` | `string` *(getter)* | The element's tag name in **UPPERCASE** (e.g., `"DIV"`, `"H1"`). |
| `className` | `string` *(getter)* | The full `class` attribute string (e.g., `"post featured"`). |
| `classList` | `string[]` *(getter)* | An array of individual class names. Empty array if no class. |
| `attributes` | `Record` *(getter)* | All attributes as a key-value object. |
| `firstChild` | `HTMLElement \| null` *(getter)* | The first child element, or `null` if none. |
| `lastChild` | `HTMLElement \| null` *(getter)* | The last child element, or `null` if none. |
#### Methods
##### `getAttribute(name)`
```ts
element.getAttribute(name: string): string | null
```
Returns the value of the named attribute, or `null` if the attribute does not exist.
##### `selectFirst(options)`
```ts
element.selectFirst(options: SelectFirstOptions): HTMLElement | null
```
Scoped version of `HtmlParser.selectFirst`. Searches **within** the current element.
##### `selectMany(options)`
```ts
element.selectMany(options: SelectManyOptions): HTMLElement[]
```
Scoped version of `HtmlParser.selectMany`. Searches **within** the current element.
##### `toString()`
```ts
element.toString(): string
```
Returns the `outerHTML` string of the element.
---
### `css()` and `xpath()`
Helper functions to create typed `QueryConfig` objects.
```ts
css(query: string): QueryConfig
xpath(query: string): QueryConfig
```
These functions are the **recommended way** to build query configurations. They ensure the correct query type is set.
```ts
import { css, xpath } from "@xcrap/html-parser"
css("article.post") // → { query: "article.post", type: QueryType.CSS }
xpath("//article[@class]") // → { query: "//article[@class]", type: QueryType.XPath }
```
---
### Types
```ts
// Identifies the query engine to use
export declare const enum QueryType {
CSS = 0,
XPath = 1,
}
// Holds a raw query string and its associated engine type
export interface QueryConfig {
query: string
type: QueryType
}
// Options for single-element selection
export interface SelectFirstOptions {
query: QueryConfig
}
// Options for multi-element selection
export interface SelectManyOptions {
query: QueryConfig
limit?: number // <= 0 or undefined means no limit
}
```
---
## 🔍 Usage Examples
### CSS Selectors
```ts
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
First Post
A short description.
Second Post
Another description.
`
const parser = new HtmlParser(html)
// Select by tag name
const firstArticle = parser.selectFirst({ query: css("article") })
console.log(firstArticle?.id) // "post-1"
// Select by class
const allPosts = parser.selectMany({ query: css(".post") })
console.log(allPosts.length) // 2
// Select by attribute
const featuredPost = parser.selectFirst({ query: css("[data-author='alice']") })
console.log(featuredPost?.getAttribute("data-author")) // "alice"
// Select with limit
const limited = parser.selectMany({ query: css("article"), limit: 1 })
console.log(limited.length) // 1
```
### XPath Queries
```ts
import { HtmlParser, xpath } from "@xcrap/html-parser"
const html = `
- rust
- napi
- nodejs
`
const parser = new HtmlParser(html)
// Select all
const tags = parser.selectMany({ query: xpath("//li[@class='tag']") })
console.log(tags.map(t => t.text)) // ["rust", "napi", "nodejs"]
// Limit XPath results
const limited = parser.selectMany({ query: xpath("//li"), limit: 2 })
console.log(limited.length) // 2
```
### Navigating Nested Elements
```ts
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
`
const parser = new HtmlParser(html)
// Find the nav, then narrow down inside it
const nav = parser.selectFirst({ query: css("#main-nav") })
if (nav) {
const links = nav.selectMany({ query: css("a") })
links.forEach(link => {
console.log(`${link.text} → ${link.getAttribute("href")}`)
// "Home → /home"
// "About → /about"
// "Contact → /contact"
})
// First and last child shortcuts
console.log(nav.firstChild?.tagName) // "UL"
console.log(nav.lastChild?.tagName) // "UL"
}
```
### Working with Attributes
```ts
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
Click here
`
const parser = new HtmlParser(html)
const link = parser.selectFirst({ query: css("a") })
if (link) {
console.log(link.id) // "cta"
console.log(link.tagName) // "A"
console.log(link.className) // "btn btn-primary"
console.log(link.classList) // ["btn", "btn-primary"]
console.log(link.getAttribute("href")) // "https://example.com"
console.log(link.getAttribute("target")) // "_blank"
console.log(link.getAttribute("missing")) // null
console.log(link.attributes)
// {
// id: "cta",
// class: "btn btn-primary",
// href: "https://example.com",
// target: "_blank",
// "data-track": "click"
// }
}
```
---
## 🏗️ Architecture
The library is structured as a native Node.js addon written in Rust, bridged via [NAPI-RS](https://napi.rs/).
```
src/
├── lib.rs # Crate entry point; exposes the `parse()` function via NAPI
├── parser.rs # HTMLParser struct — lazy-loads CSS (scraper) and XPath (sxd) engines
├── types.rs # HTMLElement struct — all DOM properties and methods
├── engines.rs # Internal: select_first/many by CSS and XPath (pure Rust)
└── query_builders.rs # css() and xpath() helper functions exposed to JS
```
### Key Design Decisions
- **Lazy Initialization**: `HTMLParser` holds `Option` and `Option` fields. Each engine is only allocated on first use and reused automatically, so calling `selectFirst` (CSS) and then `selectMany` (XPath) on the same parser creates only two parsing passes total — one per engine.
- **Dual Engine**: CSS queries use the [`scraper`](https://crates.io/crates/scraper) crate; XPath queries use [`sxd-xpath`](https://crates.io/crates/sxd-xpath) with [`sxd_html`](https://crates.io/crates/sxd_html) for HTML→XML normalization.
- **Zero-copy Approach**: Elements are represented by their `outerHTML` string, avoiding complex lifetime management across the FFI boundary.
### Internal Rust Dependencies
| Crate | Version | Role |
|---------------|----------|-------------------------------------------|
| `napi` | `3.0.0` | NAPI-RS runtime for Node.js integration |
| `napi-derive` | `3.0.0` | Procedural macros for NAPI bindings |
| `scraper` | `0.25.0` | HTML parsing and CSS selector engine |
| `sxd-document`| `0.3.2` | XML document model (used for XPath) |
| `sxd-xpath` | `0.4.2` | XPath expression evaluator |
| `sxd_html` | `0.1.2` | HTML → sxd document converter |
---
## 🛠️ Development
### Prerequisites
- **Rust** (stable toolchain) — [Install](https://rustup.rs/)
- **Node.js** >= 18 — [Install](https://nodejs.org/)
- **Yarn** >= 4 — `npm install -g yarn`
- **NAPI-RS CLI** — installed automatically via dev dependencies
### Setup
```bash
# Clone the repository
git clone https://github.com/Xcrap-Cloud/html-parser.git
cd html-parser
# Install Node.js dependencies
yarn install
```
### Building
```bash
# Build native addon in release mode
yarn build
# Build in debug mode (faster compilation, slower runtime)
yarn build:debug
```
The output binary (`html-parser..node`) will be placed in the project root.
### Running Tests
```bash
yarn test
```
Tests are written with [AVA](https://github.com/avajs/ava) and located in the `__test__/` directory.
### Formatting
```bash
# Format all (TypeScript/JS, Rust, TOML)
yarn format
# Individual formatters
yarn format:prettier # Prettier for TS/JS/JSON/YAML/Markdown
yarn format:rs # cargo fmt for Rust
yarn format:toml # Taplo for TOML files
```
### Linting
```bash
yarn lint # OXLint for TypeScript/JavaScript files
```
---
## 🤝 Contributing
Contributions are welcome! Please follow these steps:
1. **Fork** the repository.
2. **Create a branch**: `git checkout -b feat/your-feature` or `git checkout -b fix/your-bug`.
3. **Make your changes**, ensuring all tests pass: `yarn test`.
4. **Format your code**: `yarn format`.
5. **Commit** with a descriptive message: `git commit -m "feat: add support for XYZ"`.
6. **Push** your branch: `git push origin feat/your-feature`.
7. **Open a Pull Request** with a clear description of the changes.
Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidelines.
---
## 📝 License
Distributed under the [MIT License](./LICENSE).
© [Marcuth](https://github.com/Marcuth) and contributors.