An open API service indexing awesome lists of open source software.

https://github.com/hanivan/nestjs-html-parser


https://github.com/hanivan/nestjs-html-parser

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# โš ๏ธ **IMPORTANT: This package is archived and no longer maintained**

> **This package has been renamed and migrated.** Please use the new package instead:
> **[@hanivanrizky/nestjs-xpath-parser](https://github.com/Hanivan/nestjs-xpath-parser)**
>
> This repository is kept for historical reference only. No updates, bug fixes, or support will be provided.

## Migration Guide

### Old Package (Archived)
```bash
npm uninstall @hanivanrizky/nestjs-html-parser
```

### New Package (Active)
```bash
npm install @hanivanrizky/nestjs-xpath-parser
```

The new package has improved features, better documentation, and active maintenance.

---

# @hanivanrizky/nestjs-html-parser


Nest Logo

A powerful NestJS package for parsing HTML content using XPath (primary) and CSS selectors (secondary) with comprehensive extraction capabilities.


NPM Version
Package License
NPM Downloads

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Import the Module](#import-the-module)
- [Async Configuration](#async-configuration)
- [Inject the Service](#inject-the-service)
- [Core Features](#core-features)
- [HTML Fetching with Response Metadata](#html-fetching-with-response-metadata)
- [Data Extraction Methods](#data-extraction-methods)
- [Proxy Support](#proxy-support)
- [Error Handling](#error-handling)
- [TypeScript Definitions & Types](#typescript-definitions--types)
- [Complete Interface Definitions](#complete-interface-definitions)
- [Implementation Guide](#implementation-guide)
- [API Reference](#api-reference)
- [Core Methods](#core-methods)
- [Advanced Methods](#advanced-methods)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)

## Features

- **๐ŸŽฏ XPath Support (Primary)**: Full XPath 1.0 support for precise element selection
- **๐ŸŽจ CSS Selectors (Secondary)**: jQuery-style CSS selectors for familiar syntax
- **๐Ÿ“‹ Multiple Extraction Methods**: Single values, multiple values, attributes, and structured data
- **๐Ÿ” Element Analysis**: Check existence and count elements
- **๐Ÿ“Š Structured Extraction**: Extract data using schema objects for complex data structures
- **๐Ÿ“š List Extraction**: Extract arrays of structured data
- **๐ŸŒ HTTP Fetching**: Built-in HTML fetching with customizable options
- **๐Ÿ›ก๏ธ Error Handling**: Graceful error handling and fallbacks
- **๐Ÿ”„ Random User Agents**: Built-in random user agent generation for stealth scraping
- **๐Ÿ”— Proxy Support**: HTTP, HTTPS, and SOCKS proxy support with authentication
- **๐Ÿ” Retry Logic**: Configurable retry mechanism with exponential backoff
- **๐Ÿ”‡ Verbose & Logger Level Control**: Optional verbose mode for debugging and configurable logger level (debug, log, warn, error, verbose)
- **๐Ÿ”’ SSL Error Handling**: Comprehensive SSL certificate error handling and bypass options
- **๐Ÿ’€ Dead Domain Support**: Advanced error categorization for dead/unreachable domains
- **๐Ÿ”„ Smart Retry Logic**: Error-type-specific retry strategies for different network issues
- **๐ŸŽฏ TypeScript Generics**: Full generic type support for compile-time type safety
- **๐Ÿงช Fully Tested**: Comprehensive test suite with real-world examples

## Installation

```bash
yarn add @hanivanrizky/nestjs-html-parser
# or
npm install @hanivanrizky/nestjs-html-parser
```

## Quick Start

### Import the Module

```typescript
import { Module } from '@nestjs/common';
import { HtmlParserModule } from '@hanivanrizky/nestjs-html-parser';

@Module({
imports: [
HtmlParserModule.forRoot(), // Default: loggerLevel: ['log', 'error'] (production ready)
// Or override for development:
HtmlParserModule.forRoot({ loggerLevel: 'debug' }),
// Or enable multiple levels:
HtmlParserModule.forRoot({ loggerLevel: ['debug', 'warn'] }),
],
})
export class AppModule {}
```

#### Async Configuration

```typescript
import { Module } from '@nestjs/common';
import { ConfigModule, ConfigService } from '@nestjs/config';
import { HtmlParserModule } from '@hanivanrizky/nestjs-html-parser';

@Module({
imports: [
ConfigModule.forRoot(),
HtmlParserModule.forRootAsync({
imports: [ConfigModule],
useFactory: (configService: ConfigService) => ({
loggerLevel: configService.get<'debug'|'log'|'warn'|'error'|'verbose'|('debug'|'log'|'warn'|'error'|'verbose')[]>('HTML_PARSER_LOGGER_LEVEL', 'warn'),
}),
inject: [ConfigService],
}),
],
})
export class AppModule {}
```

### Inject the Service

```typescript
import { Injectable } from '@nestjs/common';
import { HtmlParserService } from '@hanivanrizky/nestjs-html-parser';

@Injectable()
export class YourService {
constructor(private readonly htmlParser: HtmlParserService) {}

async parseHackerNews() {
const response = await this.htmlParser.fetchHtml('https://news.ycombinator.com/');

// Extract page title
const title = this.htmlParser.extractSingle(response.data, '//title/text()');

// Extract all story titles
const storyTitles = this.htmlParser.extractMultiple(
response.data,
'//span[@class="titleline"]/a/text()'
);

return { title, storyTitles, status: response.status };
}
}
```

## Core Features

### HTML Fetching with Response Metadata

```typescript
const response = await htmlParser.fetchHtml('https://example.com', {
timeout: 10000,
headers: { 'User-Agent': 'Custom Agent' },
retryOnErrors: {
ssl: true,
timeout: true,
dns: true,
connectionRefused: true
}
});

// Response includes:
// - data: HTML content
// - headers: Response headers
// - status: HTTP status code
// - statusText: HTTP status text
```

### Data Extraction Methods

```typescript
// Single value extraction
const title = htmlParser.extractSingle(html, '//title/text()');

// Multiple values
const links = htmlParser.extractMultiple(html, '//a/text()');

// Attribute extraction
const urls = htmlParser.extractAttributes(html, '//a', 'href');

// Structured data extraction with advanced transform
class UppercasePipe {
transform(value: string) {
return value.toUpperCase();
}
}
class SuffixPipe {
constructor(private suffix: string) {}
transform(value: string) {
return value + this.suffix;
}
}
const schema = {
title: {
selector: '//h1/text()',
type: 'xpath',
transform: [
(title: string) => title.trim(),
UppercasePipe,
new SuffixPipe(' [ADVANCED]'),
],
},
episode: {
selector: '//div[@class="epz"]',
type: 'xpath',
transform: [
(text: any) => {
if (typeof text !== 'string') return 0;
let match = text.match(/Episode\s+(\d+)/i);
if (!match) match = text.match(/(\d+)/);
return match ? parseInt(match[1]) : 0;
},
new SuffixPipe(' (ep)'),
],
},
};
const data = htmlParser.extractStructured(html, schema);
```

### Proxy Support

```typescript
const proxyConfig = {
url: 'http://proxy.example.com:8080',
type: 'http',
username: 'user',
password: 'pass'
};

const html = await htmlParser.fetchHtml('https://example.com', {
proxy: proxyConfig,
useRandomUserAgent: true
});
```

### SSL/TLS Configuration & Security

The HTML Parser Service provides three levels of SSL configuration for handling different certificate scenarios:

#### ๐Ÿ”’ SSL Configuration Options (Independent Controls)

The service provides **three independent SSL configuration options** that can be used alone or in combination:

1. **`rejectUnauthorized: false`** - Accept invalid/self-signed certificates
2. **`disableServerIdentityCheck: true`** - Skip hostname validation (independent option)
3. **`ignoreSSLErrors: true`** - Disable all SSL validation (โš ๏ธ Use with extreme caution)

**๐Ÿ”‘ Key Point:** `disableServerIdentityCheck` is **fully independent** of `ignoreSSLErrors` and only controls hostname validation.

```typescript
// Default: Full SSL validation (recommended for production)
const response = await htmlParser.fetchHtml('https://trusted-site.com');

// Accept self-signed certificates only
const response = await htmlParser.fetchHtml('https://self-signed-site.com', {
rejectUnauthorized: false
});

// Skip only hostname validation (certificate still validated)
const response = await htmlParser.fetchHtml('https://hostname-mismatch-site.com', {
disableServerIdentityCheck: true // Works independently
});

// Combine: Accept invalid certs + skip hostname validation
const response = await htmlParser.fetchHtml('https://problematic-site.com', {
rejectUnauthorized: false,
disableServerIdentityCheck: true // Both options work together
});

// Disable all SSL validation (โš ๏ธ DANGEROUS - development only)
const response = await htmlParser.fetchHtml('https://any-ssl-issue-site.com', {
ignoreSSLErrors: true // Overrides all SSL checks
});

// Mixed configuration: Disable all SSL but explicitly control hostname check
const response = await htmlParser.fetchHtml('https://mixed-config-site.com', {
ignoreSSLErrors: true,
disableServerIdentityCheck: false // Independent: hostname check still works
});
```

#### โš ๏ธ **CRITICAL SECURITY WARNING: `disableServerIdentityCheck`**

The `disableServerIdentityCheck` parameter bypasses server name indication (SNI) validation, which is a **critical security mechanism** that:

- **Prevents man-in-the-middle attacks** by ensuring you're connecting to the intended server
- **Validates hostname matches** between the certificate and the requested domain
- **Protects against certificate spoofing** and domain impersonation

**๐Ÿšจ NEVER use `disableServerIdentityCheck: true` in production unless:**
- You fully understand the security implications
- You have other security measures in place (e.g., certificate pinning)
- You are connecting to a known, trusted internal service with hostname mismatches
- You are in a controlled testing environment

**โœ… Safe Use Cases:**
- Development environments with self-hosted services
- Testing against staging servers with certificate issues
- Internal corporate networks with hostname mismatches
- Temporary workarounds during certificate renewal periods

**โŒ NEVER Use In:**
- Production applications handling sensitive data
- Public-facing services
- Financial or healthcare applications
- Any scenario where security is paramount

```typescript
// โŒ DANGEROUS: Complete SSL bypass (never in production)
const response = await htmlParser.fetchHtml(url, {
ignoreSSLErrors: true // Disables ALL SSL validation including hostname check
});

// โš ๏ธ SELECTIVE: Independent hostname validation control
const response = await htmlParser.fetchHtml(url, {
ignoreSSLErrors: true,
disableServerIdentityCheck: false // Still enforces hostname validation despite ignoreSSLErrors
});

// โœ… BETTER: Minimal SSL relaxation
const response = await htmlParser.fetchHtml(url, {
rejectUnauthorized: false, // Accept invalid certificates only
disableServerIdentityCheck: false // Keep hostname validation (default)
});

// โœ… TARGETED: Skip only hostname validation
const response = await htmlParser.fetchHtml(url, {
disableServerIdentityCheck: true // Only bypasses hostname check, certificate still validated
});

// โœ… PRODUCTION: Full SSL validation (default)
const response = await htmlParser.fetchHtml(url, {
// All SSL validations enabled by default
});
```

### Error Handling

```typescript
try {
const response = await htmlParser.fetchHtml('https://example.com', {
rejectUnauthorized: false,
retryOnErrors: {
ssl: true,
timeout: true,
dns: true
}
});
} catch (error) {
// Error is categorized by type (ssl, dns, timeout, etc.)
console.error(`Failed: ${error.message}`);
}
```

## TypeScript Definitions & Types

### Complete Interface Definitions

```typescript
// ===== CORE SERVICE INTERFACE =====
interface HtmlParserService {
// Main HTML fetching method
fetchHtml(url: string, options?: HtmlParserOptions): Promise;

// Single value extraction methods
extractSingle(
html: string,
selector: string,
type?: 'xpath' | 'css', // Default: 'xpath'
attribute?: string,
options?: ExtractionOptions
): T | null;

extractText(
html: string,
selector: string,
type?: 'xpath' | 'css', // Default: 'xpath'
options?: ExtractionOptions
): T | null;

// Multiple value extraction methods
extractMultiple(
html: string,
selector: string,
type?: 'xpath' | 'css', // Default: 'xpath'
attribute?: string,
options?: ExtractionOptions
): T[];

extractAttributes(
html: string,
selector: string,
attribute: string,
type?: 'xpath' | 'css', // Default: 'xpath'
options?: ExtractionOptions
): T[];

// Structured extraction methods
extractStructured>(
html: string,
schema: ExtractionSchema,
options?: { verbose?: boolean }
): T;

extractStructuredList>(
html: string,
containerSelector: string,
schema: ExtractionSchema,
containerType?: 'xpath' | 'css', // Default: 'xpath'
options?: { verbose?: boolean }
): T[];

// Utility methods
exists(html: string, selector: string, type?: 'xpath' | 'css', options?: { verbose?: boolean }): boolean;
count(html: string, selector: string, type?: 'xpath' | 'css', options?: { verbose?: boolean }): number;

// Advanced utility methods
getRandomUserAgent(): Promise;
testProxy(proxy: ProxyConfig, testUrl?: string): Promise;
}

// ===== CONFIGURATION TYPES =====
interface HtmlParserOptions {
timeout?: number; // Request timeout in milliseconds (default: 10000)
headers?: Record; // Custom headers to send with request
userAgent?: string; // Custom user agent string (default: Mozilla/5.0...)
useRandomUserAgent?: boolean; // Use random user agent (default: false)
proxy?: ProxyConfig; // Proxy configuration
retries?: number; // Number of retry attempts (default: 3)
retryDelay?: number; // Delay between retries in ms (default: 1000)
verbose?: boolean; // Enable verbose logging (default: false)
rejectUnauthorized?: boolean; // Reject unauthorized SSL certificates (default: true)
ignoreSSLErrors?: boolean; // Skip SSL certificate verification (default: false)
disableServerIdentityCheck?: boolean; // โš ๏ธ SECURITY WARNING: Disable server name indication (SNI) validation (default: false)
maxRedirects?: number; // Maximum redirects to follow (default: 5)
retryOnErrors?: { // Configure retry behavior for specific error types
ssl?: boolean; // Retry on SSL/TLS errors (default: false)
timeout?: boolean; // Retry on connection timeout (default: true)
dns?: boolean; // Retry on DNS resolution errors (default: true)
connectionRefused?: boolean; // Retry on connection refused errors (default: true)
};
}

interface HtmlFetchResponse {
data: string; // HTML content of the fetched page
headers: Record; // HTTP response headers as key-value pairs
status: number; // HTTP status code (e.g., 200, 404, 500)
statusText: string; // HTTP status text (e.g., 'OK', 'Not Found')
}

interface ProxyConfig {
url: string; // Proxy server URL (e.g., 'http://proxy.example.com:8080')
type?: 'http' | 'https' | 'socks4' | 'socks5'; // Type of proxy server (auto-detected from URL)
username?: string; // Username for proxy authentication
password?: string; // Password for proxy authentication
}

// ===== EXTRACTION TYPES =====
type TransformFunction = (value: any) => any;
type TransformObject = { transform: (value: any) => any };
type TransformClass = new (...args: any[]) => TransformObject;
type TransformType =
| TransformFunction
| TransformObject
| TransformClass
| Array;

interface ExtractionOptions {
verbose?: boolean; // Enable verbose logging for this extraction
transform?: TransformType; // Transform to apply to extracted value
}

interface ExtractionField {
selector: string; // CSS selector or XPath expression
type: 'xpath' | 'css'; // Type of selector being used
attribute?: string; // HTML attribute to extract from selected element
transform?: TransformType; // Transform to apply to extracted value
multiple?: boolean; // If true, extract array of values instead of single value
raw?: boolean; // If true, return raw HTML of matched element(s)
}

interface ExtractionSchema> {
[K in keyof T]: ExtractionField;
}

// ===== MODULE CONFIGURATION TYPES =====
type LogLevel = 'debug' | 'log' | 'warn' | 'error' | 'verbose'; // same LogLevel type from @nestjs/common

interface HtmlParserConfig {
loggerLevel?: LogLevel | Array; // Default: ['log', 'error']
}

interface HtmlParserModuleAsyncOptions extends Pick {
useExisting?: Type;
useClass?: Type;
useFactory?: (...args: any[]) => Promise | HtmlParserConfig;
inject?: any[];
}

interface HtmlParserConfigFactory {
createHtmlParserConfig(): Promise | HtmlParserConfig;
}
```

### Implementation Guide

#### โœ… **Production-Ready Configuration Patterns**

**For Health Checks / Monitoring:**
```typescript
const healthCheckOptions: HtmlParserOptions = {
timeout: 15000, // Shorter timeout for health checks
useRandomUserAgent: true, // Avoid being blocked
retries: 1, // Fast fail for health checks
retryDelay: 500, // Quick retry for transient issues
verbose: false, // Keep logging minimal in production
rejectUnauthorized: false, // Accept self-signed certificates
ignoreSSLErrors: true, // โš ๏ธ Ignore SSL errors for monitoring (development only)
disableServerIdentityCheck: false, // โœ… Independent: Keep hostname validation even with ignoreSSLErrors
maxRedirects: 2, // Limit redirects for performance
retryOnErrors: {
ssl: false, // Don't retry SSL errors
timeout: false, // Don't retry timeouts in health checks
dns: true, // Retry DNS errors only
connectionRefused: false, // Don't retry connection refused
},
};

const response: HtmlFetchResponse = await htmlParser.fetchHtml(url, healthCheckOptions);
```

**For Web Scraping / Data Extraction:**
```typescript
const scrapingOptions: HtmlParserOptions = {
timeout: 30000, // Longer timeout for content loading
useRandomUserAgent: true, // Rotate user agents to avoid blocking
retries: 3, // More persistent for data extraction
retryDelay: 2000, // Respect rate limits
verbose: false, // Enable only for debugging
rejectUnauthorized: false, // Handle various SSL configurations
disableServerIdentityCheck: true, // โš ๏ธ Independent: Skip only hostname validation
ignoreSSLErrors: false, // Prefer minimal SSL relaxation (keeps certificate validation)
maxRedirects: 5, // Follow redirects for content
retryOnErrors: {
ssl: false, // SSL errors usually permanent
timeout: true, // Retry timeouts for slow sites
dns: true, // Retry DNS resolution failures
connectionRefused: false, // Usually indicates server issues
},
};

const response: HtmlFetchResponse = await htmlParser.fetchHtml(url, scrapingOptions);
```

**For Development / Testing:**
```typescript
const devOptions: HtmlParserOptions = {
timeout: 10000,
useRandomUserAgent: false, // Consistent user agent for testing
retries: 1, // Fail fast during development
retryDelay: 1000,
verbose: true, // Enable detailed logging
rejectUnauthorized: false, // Handle local/test SSL certificates
disableServerIdentityCheck: true, // โœ… Independent: OK for development/testing only
ignoreSSLErrors: false, // Prefer targeted SSL relaxation (keeps certificate validation)
maxRedirects: 3,
retryOnErrors: {
ssl: false,
timeout: false, // Don't retry to see issues quickly
dns: true,
connectionRefused: false,
},
};
```

#### โœ… **Type-Safe Extraction Patterns**

**Single Value Extraction with Transformations:**
```typescript
// Extract and transform to number
const pageId = htmlParser.extractSingle(
html,
'//meta[@name="page-id"]',
'xpath',
'content',
{ transform: (value: string) => parseInt(value, 10) }
);

// Extract and validate boolean
const isPublished = htmlParser.extractSingle(
html,
'//meta[@property="article:published"]',
'xpath',
'content',
{ transform: (value: string) => value.toLowerCase() === 'true' }
);

// Extract date with validation
const publishedDate = htmlParser.extractSingle(
html,
'//time[@datetime]',
'xpath',
'datetime',
{
transform: (value: string) => {
const date = new Date(value);
return isNaN(date.getTime()) ? null : date;
}
}
);
```

**Multiple Value Extraction with Type Safety:**
```typescript
// Extract numeric arrays with validation
const prices = htmlParser.extractMultiple(
html,
'//span[@class="price"]/text()',
'xpath',
undefined,
{
transform: (value: string) => {
const price = parseFloat(value.replace(/[$,]/g, ''));
return isNaN(price) ? 0 : price;
}
}
);

// Extract URLs with validation
const imageUrls = htmlParser.extractAttributes(
html,
'//img[@src]',
'src',
'xpath',
{
transform: (url: string) => {
try {
return new URL(url, 'https://example.com').href;
} catch {
return '';
}
}
}
).filter(url => url !== '');
```

**Advanced Structured Extraction:**
```typescript
// Define comprehensive interfaces
interface Article {
title: string;
author: string;
publishedDate: Date | null;
tags: string[];
excerpt: string;
content: string;
wordCount: number;
socialShares: number;
isSponsored: boolean;
metadata: {
description: string;
keywords: string[];
};
}

// Create production-ready schema
const articleSchema: ExtractionSchema = {
title: {
selector: '//h1[@class="article-title"]/text() | //title/text()',
type: 'xpath',
transform: (title: string) => title.trim().replace(/\s+/g, ' ')
},
author: {
selector: '//meta[@name="author"]',
type: 'xpath',
attribute: 'content',
transform: (author: string) => author || 'Unknown'
},
publishedDate: {
selector: '//time[@datetime] | //meta[@property="article:published_time"]',
type: 'xpath',
attribute: 'datetime',
transform: (dateStr: string) => {
if (!dateStr) return null;
const date = new Date(dateStr);
return isNaN(date.getTime()) ? null : date;
}
},
tags: {
selector: '//meta[@name="keywords"]',
type: 'xpath',
attribute: 'content',
transform: (keywords: string) =>
keywords ? keywords.split(',').map(k => k.trim()).filter(k => k) : []
},
excerpt: {
selector: '//meta[@name="description"]',
type: 'xpath',
attribute: 'content',
transform: (desc: string) => desc || ''
},
content: {
selector: '//article | //div[@class="content"]',
type: 'xpath',
raw: true
},
wordCount: {
selector: '//article//text() | //div[@class="content"]//text()',
type: 'xpath',
multiple: true,
transform: (texts: string[]) =>
texts.join(' ').split(/\s+/).filter(word => word.length > 0).length
},
socialShares: {
selector: '//span[@class="share-count"]/text()',
type: 'xpath',
transform: (shares: string) => parseInt(shares?.replace(/[^0-9]/g, '') || '0', 10)
},
isSponsored: {
selector: '//div[contains(@class, "sponsored")] | //span[contains(text(), "Sponsored")]',
type: 'xpath',
transform: () => true
},
metadata: {
selector: '//head',
type: 'xpath',
transform: (headElement: any) => {
// Extract nested metadata
const description = htmlParser.extractSingle(
headElement,
'//meta[@name="description"]',
'xpath',
'content'
) || '';

const keywords = htmlParser.extractSingle(
headElement,
'//meta[@name="keywords"]',
'xpath',
'content'
) || '';

return {
description,
keywords: keywords.split(',').map(k => k.trim()).filter(k => k)
};
}
}
};

const article: Article = htmlParser.extractStructured(html, articleSchema);
```

**Advanced Transform Pipeline:**
```typescript
// Define reusable transform classes
class UppercasePipe {
transform(value: string): string {
return value.toUpperCase();
}
}

class TrimPipe {
transform(value: string): string {
return value.trim().replace(/\s+/g, ' ');
}
}

class NumberPipe {
constructor(private defaultValue: number = 0) {}

transform(value: string): number {
const num = parseFloat(value.replace(/[^0-9.-]/g, ''));
return isNaN(num) ? this.defaultValue : num;
}
}

// Use in extraction schema
const productSchema: ExtractionSchema = {
name: {
selector: '//h1/text()',
type: 'xpath',
transform: [
TrimPipe,
UppercasePipe,
(name: string) => name.substring(0, 100) // Limit length
]
},
price: {
selector: '//span[@class="price"]/text()',
type: 'xpath',
transform: new NumberPipe(0)
}
};
```

#### โš ๏ธ **Common Implementation Mistakes to Avoid**

**Type and Method Signature Errors:**
```typescript
// โŒ WRONG: Missing type parameters and incorrect XPath
const result = htmlParser.extractSingle(html, '//wrongtag');

// โœ… CORRECT: Proper type and XPath for text content
const result: string | null = htmlParser.extractSingle(html, '//title/text()');

// โŒ WRONG: Using wrong method for attribute extraction
const urls = htmlParser.extractSingle(html, '//a', 'xpath', 'href', { multiple: true });

// โœ… CORRECT: Use dedicated method for attributes
const urls: string[] = htmlParser.extractAttributes(html, '//a', 'href');

// โŒ WRONG: Mixing CSS and XPath syntax
const links = htmlParser.extractMultiple(html, 'a//text()', 'css');

// โœ… CORRECT: Use appropriate selector type
const links = htmlParser.extractMultiple(html, '//a/text()', 'xpath');
// OR
const links = htmlParser.extractMultiple(html, 'a', 'css');
```

**Configuration and Error Handling Mistakes:**
```typescript
// โŒ WRONG: Missing proper response typing and error handling
const response = await htmlParser.fetchHtml(url);
const title = response.data.match(/(.*?)<\/title>/)?.[1];

// โœ… CORRECT: Proper typing and extraction
const response: HtmlFetchResponse = await htmlParser.fetchHtml(url, options);
const title: string | null = htmlParser.extractSingle(
response.data,
'//title/text()'
);

// โŒ WRONG: Ignoring status codes and error types
try {
const html = await htmlParser.fetchHtml(url);
} catch (error) {
console.log('Failed to fetch');
}

// โœ… CORRECT: Comprehensive error handling
try {
const response: HtmlFetchResponse = await htmlParser.fetchHtml(url, options);

if (response.status >= 400) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}

// Process response.data
} catch (error: any) {
if (error.code === 'ETIMEDOUT') {
// Handle timeout specifically
} else if (error.code === 'ECONNREFUSED') {
// Handle connection refused
} else if (error.code?.includes('CERT_')) {
// Handle SSL certificate errors
} else if (error.code === 'ENOTFOUND') {
// Handle DNS resolution errors
} else {
// Handle other errors
}
}
```

#### ๐Ÿ”ง **Advanced Usage Patterns**

**Proxy Testing and Validation:**
```typescript
const proxyConfig: ProxyConfig = {
url: 'http://proxy.example.com:8080',
type: 'http',
username: 'user',
password: 'pass'
};

// Test proxy before use
const isProxyWorking = await htmlParser.testProxy(proxyConfig);
if (!isProxyWorking) {
throw new Error('Proxy connection failed');
}

// Use proxy for requests
const response = await htmlParser.fetchHtml(url, { proxy: proxyConfig });
```

**User Agent Management:**
```typescript
// Get random user agent for stealth scraping
const randomUA = await htmlParser.getRandomUserAgent();
console.log('Using User Agent:', randomUA);

// Use in options
const options: HtmlParserOptions = {
userAgent: randomUA,
// OR use built-in random generation
useRandomUserAgent: true
};
```

**Conditional Extraction and Fallbacks:**
```typescript
// Check existence before extraction
if (htmlParser.exists(html, '//div[@class="premium-content"]')) {
const premiumContent = htmlParser.extractText(html, '//div[@class="premium-content"]');
} else {
const freeContent = htmlParser.extractText(html, '//div[@class="free-content"]');
}

// Count elements for validation
const commentCount = htmlParser.count(html, '//div[@class="comment"]');
console.log(`Found ${commentCount} comments`);

// Multiple selector fallback pattern
const title = htmlParser.extractSingle(html, '//h1/text()') ||
htmlParser.extractSingle(html, '//title/text()') ||
htmlParser.extractSingle(html, '//meta[@property="og:title"]', 'xpath', 'content') ||
'No title found';
```

## API Reference

### Core Methods

#### `fetchHtml(url: string, options?: HtmlParserOptions): Promise`

Fetch HTML content from a URL with comprehensive error handling and SSL configuration.

```typescript
const response = await htmlParser.fetchHtml('https://example.com', {
timeout: 10000,
headers: { 'User-Agent': 'Custom Agent' },
rejectUnauthorized: false, // Accept self-signed certificates
disableServerIdentityCheck: true, // โš ๏ธ Skip hostname validation (use with caution)
retryOnErrors: {
ssl: true,
timeout: true,
dns: true,
connectionRefused: true
}
});
```

#### `extractSingle(html: string, selector: string, type?: 'xpath' | 'css', attribute?: string, options?: { verbose?: boolean; transform?: (value: string) => T }): T | null`

Extract a single value using XPath or CSS selector with type safety.

```typescript
// Using XPath (default)
const title = htmlParser.extractSingle(html, '//title/text()');

// Using CSS selector
const title = htmlParser.extractSingle(html, 'title', 'css');

// Extract attribute with transformation
const id = htmlParser.extractSingle(html, '//div[@data-id]', 'xpath', 'data-id', {
transform: (value: string) => parseInt(value)
});

// Extract with boolean transformation
const isActive = htmlParser.extractSingle(html, '//div/@data-active', 'xpath', undefined, {
transform: (value: string) => value === 'true'
});
```

#### `extractMultiple(html: string, selector: string, type?: 'xpath' | 'css', attribute?: string, options?: { verbose?: boolean; transform?: (value: string) => T }): T[]`

Extract multiple matching values with type safety.

```typescript
// Extract all links
const links = htmlParser.extractMultiple(html, '//a/text()');

// Extract all href attributes
const urls = htmlParser.extractMultiple(html, '//a', 'xpath', 'href');

// Extract with transformation
const prices = htmlParser.extractMultiple(html, '//span[@class="price"]/text()', 'xpath', undefined, {
transform: (value: string) => parseFloat(value.replace('$', ''))
});
```

#### `extractText(html: string, selector: string, type?: 'xpath' | 'css', options?: { verbose?: boolean; transform?: (value: string) => T }): T | null`

Extract text content specifically with type safety.

```typescript
const text = htmlParser.extractText(html, '//p[@class="content"]');

// Extract with transformation
const wordCount = htmlParser.extractText(html, '//p[@class="content"]', 'xpath', {
transform: (text: string) => text.split(' ').length
});
```

#### `extractAttributes(html: string, selector: string, attribute: string, type?: 'xpath' | 'css', options?: { verbose?: boolean; transform?: (value: string) => T }): T[]`

Extract attribute values from multiple elements with type safety.

```typescript
const imgSources = htmlParser.extractAttributes(html, '//img', 'src');

// Extract with transformation
const ids = htmlParser.extractAttributes(html, '//div', 'data-id', 'xpath', {
transform: (value: string) => parseInt(value)
});
```

#### `exists(html: string, selector: string, type?: 'xpath' | 'css'): boolean`

Check if elements exist.

```typescript
const hasComments = htmlParser.exists(html, '//div[@class="comments"]');
```

#### `count(html: string, selector: string, type?: 'xpath' | 'css'): number`

Count matching elements.

```typescript
const commentCount = htmlParser.count(html, '//div[@class="comment"]');
```

#### `getRandomUserAgent(): Promise`

Generate a random user agent string.

```typescript
const randomUA = await htmlParser.getRandomUserAgent();
console.log(randomUA);
// Output: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
```

#### `testProxy(proxy: ProxyConfig, testUrl?: string): Promise`

Test if a proxy connection is working.

```typescript
const proxyConfig = {
url: 'http://proxy.example.com:8080',
type: 'http' as const,
username: 'user',
password: 'pass'
};

const isWorking = await htmlParser.testProxy(proxyConfig);
console.log(`Proxy is ${isWorking ? 'working' : 'not working'}`);
```

#### `transform` option in schema fields

The `transform` property in a schema field is highly flexible. You can use:
- A single function: `(value: string) => any`
- A single class (constructor with a `transform` method): `class MyPipe { transform(value) { ... } }` (the parser will instantiate it automatically)
- A single instance (object with a `transform` method): `new MyPipe()`
- A class constructor with a `transform` method (e.g., `MyPipe`)
- An array of any of the above (functions, classes, instances), which will be applied in order

**Note:** The parser will always convert DOM elements to their text content before applying the transform, so your transform functions can safely expect a string.

**Important:**
- If you use a class or object for `transform`, it **must** have a method named `transform(value)`. The parser will call this method with the extracted value.
- Custom class transforms must also have a constructor method (either a default constructor or one that accepts arguments if you instantiate it yourself). The parser will instantiate the class using its constructor if you pass the class itself (not an instance).

**Example of a valid custom class transform:**
```typescript
class MyCustomPipe {
// Default constructor
constructor() {}
transform(value: string) {
// your transformation logic
return value + '!';
}
}

class SuffixPipe {
constructor(private suffix: string) {}
transform(value: string) {
return value + this.suffix;
}
}

// Usage:
transform: MyCustomPipe
// or
transform: new MyCustomPipe()
```

**Examples:**
```typescript
// Single function
transform: (value: string) => value.toUpperCase()

// Single class
transform: UppercasePipe

// Single instance
transform: new SuffixPipe('!')

// Array of functions
transform: [
(value: string) => value.trim(),
(value: string) => value.toUpperCase(),
]

// Array of classes and/or instances and/or functions
transform: [
(value: string) => value.trim(),
UppercasePipe,
new SuffixPipe(' [ADVANCED]'),
]
```

### Advanced Methods

#### `extractStructured>(html: string, schema: ExtractionSchema, options?: { verbose?: boolean }): T`

Extract data using a typed schema object. Supports `multiple: true` for array extraction and `raw: true` for raw HTML extraction in any field.

```typescript
import { ExtractionSchema } from '@hanivanrizky/nestjs-html-parser';

// Define typed interface
interface Article {
title: string;
author: string;
links: string[];
titleHtml: string;
}

// Create typed schema
const schema: ExtractionSchema = {
title: {
selector: '//title/text()',
type: 'xpath'
},
author: {
selector: '//meta[@name="author"]',
type: 'xpath',
attribute: 'content'
},
links: {
selector: '//a/@href',
type: 'xpath',
multiple: true
},
titleHtml: {
selector: '//title',
type: 'xpath',
raw: true
}
};

const result = htmlParser.extractStructured(html, schema);
// Result: { title: "Page Title", author: "John Doe", links: ["/home", "/about", ...], titleHtml: "Page Title" }
```

#### `extractStructuredList>(html: string, containerSelector: string, schema: ExtractionSchema, containerType?: 'xpath' | 'css', options?: { verbose?: boolean }): T[]`

Extract arrays of typed structured data. Supports `multiple: true` for array extraction and `raw: true` for raw HTML extraction in any field.

```typescript
// Define typed interface
interface Product {
name: string;
price: number;
tags: string[];
nameHtml: string;
}

// Create typed schema
const productSchema: ExtractionSchema = {
name: {
selector: './/h2/text()',
type: 'xpath'
},
price: {
selector: './/span[@class="price"]/text()',
type: 'xpath',
transform: (value: string) => parseFloat(value.replace('$', ''))
},
tags: {
selector: './/span[@class="tag"]/text()',
type: 'xpath',
multiple: true
},
nameHtml: {
selector: './/h2',
type: 'xpath',
raw: true
}
};

const products = htmlParser.extractStructuredList(
html,
'//div[@class="product"]',
productSchema
);
// Result: Product[] with tags as array and nameHtml as raw HTML for each product
// [
// { name: "Product A", price: 19.99, tags: ["electronics", "gadget"], nameHtml: "

Product A

" },
// { name: "Product B", price: 29.99, tags: ["accessory"], nameHtml: "

Product B

" }
// ]
```

## Development

```bash
# Install dependencies
yarn install

# Build
yarn build

# Test
yarn test
yarn test:cov
yarn test:watch
```

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/yourusername/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/yourusername/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.