Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper


https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

Enterprise AI Recursive Web Scraper

Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction




πŸ‘ͺ All Contributors: 1


🀝 Code of Conduct: Kept
πŸ§ͺ Coverage
πŸ“ License: MIT
πŸ“¦ npm version
πŸ’ͺ TypeScript: Strict

## ✨ Features

* πŸš€ **High Performance**: Blazing fast multi-threaded scraping with concurrent processing
* πŸ€– **AI-Powered**: Intelligent content extraction using Groq LLMs
* 🌐 **Multi-Browser**: Support for Chromium, Firefox, and WebKit
* πŸ“Š **Smart Extraction**:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
* 🎯 **Advanced Capabilities**:
- Recursive domain crawling with boundary respect
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
* πŸ”’ **Enterprise Ready**:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling
- Flexible timeout management

## πŸš€ Quick Start

To install the package, run:

```bash
npm install enterprise-ai-recursive-web-scraper
```

### Using the CLI

The `enterprise-ai-recursive-web-scraper` package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

#### Installation

Ensure that the package is installed globally to use the CLI:

```bash
npm install -g enterprise-ai-recursive-web-scraper
```

#### Running the CLI

Once installed, you can use the `web-scraper` command to start scraping. Here’s a basic example of how to use it:

```bash
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output
```

#### CLI Options

- `-k, --api-key `: **(Required)** Your Google Gemini API key.
- `-u, --url `: **(Required)** The URL of the website you want to scrape.
- `-o, --output `: The directory where the scraped data will be saved. Default is `scraping_output`.
- `-d, --depth `: Maximum crawl depth. Default is `3`.
- `-c, --concurrency `: Concurrent scraping limit. Default is `5`.
- `-t, --timeout `: Request timeout in seconds. Default is `30`.
- `-f, --format `: Output format (`json`, `csv`, `markdown`). Default is `json`.
- `--screenshot`: Capture screenshots of pages.
- `--no-headless`: Run the browser in non-headless mode.
- `--proxy `: Use a proxy server.
- `-v, --verbose`: Enable verbose logging.
- `--config `: Path to a configuration file.

#### Example Command

```bash
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output --depth 5 --concurrency 10 --format csv --verbose
```

This command will scrape the specified URL with a maximum depth of 5, using 10 concurrent requests, and save the output in CSV format in the `./output` directory with verbose logging enabled.

## πŸ”§ Advanced Usage

### Structured Data Extraction

To extract structured data using a JSON schema, you can use the `JsonExtractionStrategy`:

```typescript
import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
baseSelector: "article",
fields: [
{ name: "title", selector: "h1" },
{ name: "content", selector: ".content" },
{ name: "date", selector: "time", attribute: "datetime" }
]
};

const scraper = new WebScraper({
extractionStrategy: new JsonExtractionStrategy(schema)
});
```

### Custom Browser Session

You can customize the browser session with specific configurations:

```typescript
import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
browserConfig: {
headless: false,
proxy: "http://proxy.example.com",
userAgent: "Custom User Agent"
}
});
```

## 🀝 Contributors





Mike Odnis

Mike Odnis



πŸ’»
πŸ–‹
πŸ€”
πŸš‡


## πŸ“„ License

MIT Β© [Mike Odnis](https://github.com/WomB0ComB0)

> πŸ’™ Built with [`create-typescript-app`](https://github.com/JoshuaKGoldberg/create-typescript-app)