Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper
https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper
gemini-api npm-package playwright puppeteer typescript
Last synced: 21 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/womb0comb0/enterprise-ai-recursive-web-scraper
- Owner: WomB0ComB0
- License: mit
- Created: 2024-11-15T22:08:52.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-26T00:51:38.000Z (3 months ago)
- Last Synced: 2025-01-24T09:43:58.062Z (21 days ago)
- Topics: gemini-api, npm-package, playwright, puppeteer, typescript
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/enterprise-ai-recursive-web-scraper
- Size: 7.52 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md
Awesome Lists containing this project
README
Enterprise AI Recursive Web Scraper
Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction
## β¨ Features
* π **High Performance**:
- Blazing fast multi-threaded scraping with concurrent processing
- Smart rate limiting to prevent API throttling and server overload
- Automatic request queuing and retry mechanisms
* π€ **AI-Powered**: Intelligent content extraction using Groq LLMs
* π **Multi-Browser**: Support for Chromium, Firefox, and WebKit
* π **Smart Extraction**:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
* π― **Advanced Capabilities**:
- Recursive domain crawling with boundary respect
- Intelligent rate limiting with token bucket algorithm
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
* π **Enterprise Ready**:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling and retry mechanisms
- Flexible timeout and rate limit management
- Detailed logging and monitoring## π Quick Start
To install the package, run:
```bash
npm install enterprise-ai-recursive-web-scraper
```### Using the CLI
The `enterprise-ai-recursive-web-scraper` package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.
#### Installation
Ensure that the package is installed globally to use the CLI:
```bash
npm install -g enterprise-ai-recursive-web-scraper
```#### Running the CLI
Once installed, you can use the `web-scraper` command to start scraping. Hereβs a basic example of how to use it:
```bash
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output
```#### CLI Options
- `-k, --api-key `: **(Required)** Your Google Gemini API key
- `-u, --url `: **(Required)** The URL of the website to scrape
- `-o, --output `: Output directory for scraped data (default: `scraping_output`)
- `-d, --depth `: Maximum crawl depth (default: `3`)
- `-c, --concurrency `: Concurrent scraping limit (default: `5`)
- `-r, --rate-limit `: Requests per second (default: `5`)
- `-t, --timeout `: Request timeout in milliseconds (default: `30000`)
- `-f, --format `: Output format: json|csv|markdown (default: `json`)
- `-v, --verbose`: Enable verbose logging
- `--retry-attempts `: Number of retry attempts (default: `3`)
- `--retry-delay `: Delay between retries in ms (default: `1000`)Example usage with rate limiting:
```bash
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
--depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose
```## π§ Advanced Usage
### Rate Limiting Configuration
Configure rate limiting to respect server limits and prevent throttling:
```typescript
import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";const scraper = new WebScraper({
rateLimiter: new RateLimiter({
maxTokens: 5, // Maximum number of tokens
refillRate: 1, // Tokens refilled per second
retryAttempts: 3, // Number of retry attempts
retryDelay: 1000 // Delay between retries (ms)
})
});
```### Structured Data Extraction
To extract structured data using a JSON schema, you can use the `JsonExtractionStrategy`:
```typescript
import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";const schema = {
baseSelector: "article",
fields: [
{ name: "title", selector: "h1" },
{ name: "content", selector: ".content" },
{ name: "date", selector: "time", attribute: "datetime" }
]
};const scraper = new WebScraper({
extractionStrategy: new JsonExtractionStrategy(schema)
});
```### Custom Browser Session
You can customize the browser session with specific configurations:
```typescript
import { WebScraper } from "enterprise-ai-recursive-web-scraper";const scraper = new WebScraper({
browserConfig: {
headless: false,
proxy: "http://proxy.example.com",
userAgent: "Custom User Agent"
}
});
```## π€ Contributors
![]()
Mike Odnis
π»
π
π€
π
## π License
MIT Β© [Mike Odnis](https://github.com/WomB0ComB0)
> π Built with [`create-typescript-app`](https://github.com/JoshuaKGoldberg/create-typescript-app)