An open API service indexing awesome lists of open source software.

https://github.com/hanivan/nestjs-browser-parser


https://github.com/hanivan/nestjs-browser-parser

Last synced: 11 months ago
JSON representation

Awesome Lists containing this project

README

          

# NestJS Browser Parser

A powerful NestJS module for parsing HTML content with JavaScript support using Playwright Core. This module provides comprehensive features for web scraping, data extraction, and automation with both CDP (Chrome DevTools Protocol) and built-in browser support.

## ๐Ÿš€ Features

- **๐ŸŽญ Playwright Integration**: Uses playwright-core for reliable JavaScript-enabled HTML parsing
- **๐Ÿ”— Dual Browser Mode**: Support for both CDP connection and built-in browser
- **๐Ÿ“ฑ Responsive**: Full viewport and device emulation support
- **๐Ÿ” CSS & Limited XPath**: Extract data using CSS selectors (XPath support planned)
- **๐Ÿ“ธ Screenshots & PDFs**: Generate screenshots and PDF documents
- **โšก JavaScript Execution**: Execute custom JavaScript on pages
- **๐Ÿ›ก๏ธ Proxy Support**: HTTP, HTTPS, SOCKS proxies with authentication
- **๐ŸŽจ Rich Configuration**: Extensive customization options
- **๐Ÿ“Š Response Metadata**: Headers, cookies, timing, and metrics
- **๐Ÿ”ง TypeScript**: Full type safety and IntelliSense support
- **๐Ÿงน Auto Cleanup**: Automatic resource management and cleanup

## ๐Ÿ“ฆ Installation

```bash
npm install playwright-core cheerio
# or
yarn add playwright-core cheerio
```

## ๐Ÿ› ๏ธ Quick Start

### Basic Setup

```typescript
import { Module } from '@nestjs/common';
import { BrowserParserModule } from './browser-parser.module';

@Module({
imports: [BrowserParserModule.forRoot()],
})
export class AppModule {}
```

### Using the Service

```typescript
import { Injectable } from '@nestjs/common';
import { BrowserParserService } from './browser-parser.service';

@Injectable()
export class ScrapingService {
constructor(private readonly browserParser: BrowserParserService) {}

async scrapeWebsite(url: string) {
const response = await this.browserParser.fetchHtml(url, {
verbose: true,
timeout: 30000,
});

const title = this.browserParser.extractSingle(response.html, 'title');
return { title, status: response.status };
}
}
```

## ๐ŸŽ›๏ธ Configuration

### Built-in Browser (Default)

```typescript
BrowserParserModule.forRoot({
loggerLevel: 'debug',
headless: true,
browserConnection: {
type: 'builtin',
args: ['--no-sandbox', '--disable-dev-shm-usage'],
},
})
```

### CDP (Chrome DevTools Protocol)

```typescript
JSParserModule.forRoot({
loggerLevel: 'debug',
browserConnection: {
type: 'cdp',
cdpUrl: 'ws://localhost:9222/devtools/browser',
},
})
```

### Async Configuration

```typescript
BrowserParserModule.forRootAsync({
useFactory: (configService: ConfigService) => ({
loggerLevel: configService.get('LOG_LEVEL', 'error'),
headless: configService.get('HEADLESS', 'true') === 'true',
browserConnection: {
type: configService.get('BROWSER_TYPE', 'builtin'),
cdpUrl: configService.get('CDP_URL'),
},
}),
inject: [ConfigService],
})
```

## ๐Ÿ“– API Reference

### BrowserParserService Methods

#### `fetchHtml(url, options?)`

Fetch HTML content from a URL with JavaScript execution.

```typescript
const response = await browserParser.fetchHtml('https://example.com', {
timeout: 30000,
waitForSelector: '.dynamic-content',
userAgent: 'Custom Bot 1.0',
viewport: { width: 1024, height: 768 },
proxy: {
server: 'http://proxy.example.com:8080',
username: 'user',
password: 'pass',
},
});
```

#### `extractSingle(html, selector, type?, attribute?, options?)`

Extract a single value from HTML.

```typescript
const title = jsParser.extractSingle(html, 'title');
const description = jsParser.extractSingle(html, 'meta[name="description"]', 'css', 'content');
```

#### `extractMultiple(html, selector, type?, attribute?, options?)`

Extract multiple values from HTML.

```typescript
const links = jsParser.extractMultiple(html, 'a', 'css', 'href');
const headings = jsParser.extractMultiple(html, 'h1, h2, h3');
```

#### `extractStructuredFromHtml(html, schema)`

Extract structured data using a schema.

```typescript
const data = jsParser.extractStructuredFromHtml(html, {
title: { selector: 'title', type: 'css' },
links: { selector: 'a', type: 'css', attribute: 'href', multiple: true },
price: {
selector: '.price',
type: 'css',
transform: (text) => parseFloat(text.replace('$', ''))
},
});
```

#### `takeScreenshot(url, options?)`

Capture a screenshot of a webpage.

```typescript
const screenshot = await jsParser.takeScreenshot('https://example.com', {
type: 'png',
fullPage: true,
clip: { x: 0, y: 0, width: 800, height: 600 },
});
```

#### `generatePDF(url, options?)`

Generate a PDF of a webpage.

```typescript
const pdf = await jsParser.generatePDF('https://example.com', {
format: 'A4',
printBackground: true,
margin: { top: '1cm', bottom: '1cm' },
});
```

#### `evaluateOnPage(url, evaluationFunction, options?)`

Execute JavaScript on a page.

```typescript
const result = await jsParser.evaluateOnPage(
'https://example.com',
'() => ({ title: document.title, elementCount: document.querySelectorAll("*").length })'
);
```

## ๐ŸŒ Browser Configuration

### Built-in Browser

Uses Playwright's bundled Chromium:

```typescript
{
browserConnection: {
type: 'builtin',
executablePath: '/path/to/chrome', // Optional custom Chrome
args: ['--no-sandbox', '--disable-dev-shm-usage'],
ignoreDefaultArgs: false,
}
}
```

### CDP Connection

Connect to existing Chrome instance:

```bash
# Start Chrome with remote debugging
google-chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check
```

```typescript
{
browserConnection: {
type: 'cdp',
cdpUrl: 'ws://localhost:9222/devtools/browser',
}
}
```

## ๐Ÿš€ Run the Demo

```bash
npm run start:dev
```

## ๐Ÿ“„ License

MIT