https://github.com/hanivan/nestjs-browser-parser
https://github.com/hanivan/nestjs-browser-parser
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/hanivan/nestjs-browser-parser
- Owner: Hanivan
- License: mit
- Created: 2025-06-17T08:32:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-26T08:22:39.000Z (12 months ago)
- Last Synced: 2025-06-26T09:30:40.626Z (12 months ago)
- Language: TypeScript
- Size: 174 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NestJS Browser Parser
A powerful NestJS module for parsing HTML content with JavaScript support using Playwright Core. This module provides comprehensive features for web scraping, data extraction, and automation with both CDP (Chrome DevTools Protocol) and built-in browser support.
## ๐ Features
- **๐ญ Playwright Integration**: Uses playwright-core for reliable JavaScript-enabled HTML parsing
- **๐ Dual Browser Mode**: Support for both CDP connection and built-in browser
- **๐ฑ Responsive**: Full viewport and device emulation support
- **๐ CSS & Limited XPath**: Extract data using CSS selectors (XPath support planned)
- **๐ธ Screenshots & PDFs**: Generate screenshots and PDF documents
- **โก JavaScript Execution**: Execute custom JavaScript on pages
- **๐ก๏ธ Proxy Support**: HTTP, HTTPS, SOCKS proxies with authentication
- **๐จ Rich Configuration**: Extensive customization options
- **๐ Response Metadata**: Headers, cookies, timing, and metrics
- **๐ง TypeScript**: Full type safety and IntelliSense support
- **๐งน Auto Cleanup**: Automatic resource management and cleanup
## ๐ฆ Installation
```bash
npm install playwright-core cheerio
# or
yarn add playwright-core cheerio
```
## ๐ ๏ธ Quick Start
### Basic Setup
```typescript
import { Module } from '@nestjs/common';
import { BrowserParserModule } from './browser-parser.module';
@Module({
imports: [BrowserParserModule.forRoot()],
})
export class AppModule {}
```
### Using the Service
```typescript
import { Injectable } from '@nestjs/common';
import { BrowserParserService } from './browser-parser.service';
@Injectable()
export class ScrapingService {
constructor(private readonly browserParser: BrowserParserService) {}
async scrapeWebsite(url: string) {
const response = await this.browserParser.fetchHtml(url, {
verbose: true,
timeout: 30000,
});
const title = this.browserParser.extractSingle(response.html, 'title');
return { title, status: response.status };
}
}
```
## ๐๏ธ Configuration
### Built-in Browser (Default)
```typescript
BrowserParserModule.forRoot({
loggerLevel: 'debug',
headless: true,
browserConnection: {
type: 'builtin',
args: ['--no-sandbox', '--disable-dev-shm-usage'],
},
})
```
### CDP (Chrome DevTools Protocol)
```typescript
JSParserModule.forRoot({
loggerLevel: 'debug',
browserConnection: {
type: 'cdp',
cdpUrl: 'ws://localhost:9222/devtools/browser',
},
})
```
### Async Configuration
```typescript
BrowserParserModule.forRootAsync({
useFactory: (configService: ConfigService) => ({
loggerLevel: configService.get('LOG_LEVEL', 'error'),
headless: configService.get('HEADLESS', 'true') === 'true',
browserConnection: {
type: configService.get('BROWSER_TYPE', 'builtin'),
cdpUrl: configService.get('CDP_URL'),
},
}),
inject: [ConfigService],
})
```
## ๐ API Reference
### BrowserParserService Methods
#### `fetchHtml(url, options?)`
Fetch HTML content from a URL with JavaScript execution.
```typescript
const response = await browserParser.fetchHtml('https://example.com', {
timeout: 30000,
waitForSelector: '.dynamic-content',
userAgent: 'Custom Bot 1.0',
viewport: { width: 1024, height: 768 },
proxy: {
server: 'http://proxy.example.com:8080',
username: 'user',
password: 'pass',
},
});
```
#### `extractSingle(html, selector, type?, attribute?, options?)`
Extract a single value from HTML.
```typescript
const title = jsParser.extractSingle(html, 'title');
const description = jsParser.extractSingle(html, 'meta[name="description"]', 'css', 'content');
```
#### `extractMultiple(html, selector, type?, attribute?, options?)`
Extract multiple values from HTML.
```typescript
const links = jsParser.extractMultiple(html, 'a', 'css', 'href');
const headings = jsParser.extractMultiple(html, 'h1, h2, h3');
```
#### `extractStructuredFromHtml(html, schema)`
Extract structured data using a schema.
```typescript
const data = jsParser.extractStructuredFromHtml(html, {
title: { selector: 'title', type: 'css' },
links: { selector: 'a', type: 'css', attribute: 'href', multiple: true },
price: {
selector: '.price',
type: 'css',
transform: (text) => parseFloat(text.replace('$', ''))
},
});
```
#### `takeScreenshot(url, options?)`
Capture a screenshot of a webpage.
```typescript
const screenshot = await jsParser.takeScreenshot('https://example.com', {
type: 'png',
fullPage: true,
clip: { x: 0, y: 0, width: 800, height: 600 },
});
```
#### `generatePDF(url, options?)`
Generate a PDF of a webpage.
```typescript
const pdf = await jsParser.generatePDF('https://example.com', {
format: 'A4',
printBackground: true,
margin: { top: '1cm', bottom: '1cm' },
});
```
#### `evaluateOnPage(url, evaluationFunction, options?)`
Execute JavaScript on a page.
```typescript
const result = await jsParser.evaluateOnPage(
'https://example.com',
'() => ({ title: document.title, elementCount: document.querySelectorAll("*").length })'
);
```
## ๐ Browser Configuration
### Built-in Browser
Uses Playwright's bundled Chromium:
```typescript
{
browserConnection: {
type: 'builtin',
executablePath: '/path/to/chrome', // Optional custom Chrome
args: ['--no-sandbox', '--disable-dev-shm-usage'],
ignoreDefaultArgs: false,
}
}
```
### CDP Connection
Connect to existing Chrome instance:
```bash
# Start Chrome with remote debugging
google-chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check
```
```typescript
{
browserConnection: {
type: 'cdp',
cdpUrl: 'ws://localhost:9222/devtools/browser',
}
}
```
## ๐ Run the Demo
```bash
npm run start:dev
```
## ๐ License
MIT