Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper

gemini-api npm-package playwright puppeteer typescript

Last synced: 21 days ago
JSON representation

Host: GitHub
URL: https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper
Owner: WomB0ComB0
License: mit
Created: 2024-11-15T22:08:52.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-11-26T00:51:38.000Z (3 months ago)
Last Synced: 2025-01-24T09:43:58.062Z (21 days ago)
Topics: gemini-api, npm-package, playwright, puppeteer, typescript
Language: TypeScript
Homepage: https://www.npmjs.com/package/enterprise-ai-recursive-web-scraper
Size: 7.52 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md

Awesome Lists containing this project

README

        
Enterprise AI Recursive Web Scraper


Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction




	

	

	

	

	

	

	

	

	

	



## ✨ Features

* 🚀 **High Performance**: 

  - Blazing fast multi-threaded scraping with concurrent processing

  - Smart rate limiting to prevent API throttling and server overload

  - Automatic request queuing and retry mechanisms

* 🤖 **AI-Powered**: Intelligent content extraction using Groq LLMs

* 🌐 **Multi-Browser**: Support for Chromium, Firefox, and WebKit

* 📊 **Smart Extraction**: 

  - Structured data extraction without LLMs using CSS selectors

  - Topic-based and semantic chunking strategies

  - Cosine similarity clustering for content deduplication

* 🎯 **Advanced Capabilities**:

  - Recursive domain crawling with boundary respect

  - Intelligent rate limiting with token bucket algorithm

  - Session management for complex multi-page flows

  - Custom JavaScript execution support

  - Enhanced screenshot capture with lazy-load detection

  - iframe content extraction

* 🔒 **Enterprise Ready**:

  - Proxy support with authentication

  - Custom headers and user-agent configuration

  - Comprehensive error handling and retry mechanisms

  - Flexible timeout and rate limit management

  - Detailed logging and monitoring

## 🚀 Quick Start

To install the package, run:

```bash

npm install enterprise-ai-recursive-web-scraper

```

### Using the CLI

The `enterprise-ai-recursive-web-scraper` package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

#### Installation

Ensure that the package is installed globally to use the CLI:

```bash

npm install -g enterprise-ai-recursive-web-scraper

```

#### Running the CLI

Once installed, you can use the `web-scraper` command to start scraping. Here’s a basic example of how to use it:

```bash

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

```

#### CLI Options

- `-k, --api-key `: **(Required)** Your Google Gemini API key

- `-u, --url `: **(Required)** The URL of the website to scrape

- `-o, --output `: Output directory for scraped data (default: `scraping_output`)

- `-d, --depth `: Maximum crawl depth (default: `3`)

- `-c, --concurrency `: Concurrent scraping limit (default: `5`)

- `-r, --rate-limit `: Requests per second (default: `5`)

- `-t, --timeout `: Request timeout in milliseconds (default: `30000`)

- `-f, --format `: Output format: json|csv|markdown (default: `json`)

- `-v, --verbose`: Enable verbose logging

- `--retry-attempts `: Number of retry attempts (default: `3`)

- `--retry-delay `: Delay between retries in ms (default: `1000`)

Example usage with rate limiting:

```bash

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \

  --depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose

```

## 🔧 Advanced Usage

### Rate Limiting Configuration

Configure rate limiting to respect server limits and prevent throttling:

```typescript

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({

    rateLimiter: new RateLimiter({

        maxTokens: 5,      // Maximum number of tokens

        refillRate: 1,     // Tokens refilled per second

        retryAttempts: 3,  // Number of retry attempts

        retryDelay: 1000   // Delay between retries (ms)

    })

});

```

### Structured Data Extraction

To extract structured data using a JSON schema, you can use the `JsonExtractionStrategy`:

```typescript

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {

    baseSelector: "article",

    fields: [

        { name: "title", selector: "h1" },

        { name: "content", selector: ".content" },

        { name: "date", selector: "time", attribute: "datetime" }

    ]

};

const scraper = new WebScraper({

    extractionStrategy: new JsonExtractionStrategy(schema)

});

```

### Custom Browser Session

You can customize the browser session with specific configurations:

```typescript

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({

    browserConfig: {

        headless: false,

        proxy: "http://proxy.example.com",

        userAgent: "Custom User Agent"

    }

});

```

## 🤝 Contributors

  

    

      

        

          

          
_{Mike Odnis}

        

        


        💻 

        🖋

        🤔

        🚇

      

    

  

## 📄 License

MIT © [Mike Odnis](https://github.com/WomB0ComB0)

> 💙 Built with [`create-typescript-app`](https://github.com/JoshuaKGoldberg/create-typescript-app)