https://github.com/mzazakeith/puppetmaster

Puppeteer & Crawl4AI microservice for web automation, scraping, and AI processing with Bull queues
https://github.com/mzazakeith/puppetmaster

agent ai automation bull bullmq chrome crawl4ai crawler data data-extraction extraction gemini llm llms openai playwright puppeteer web-automation

Last synced: 5 months ago
JSON representation

Puppeteer & Crawl4AI microservice for web automation, scraping, and AI processing with Bull queues

Host: GitHub
URL: https://github.com/mzazakeith/puppetmaster
Owner: mzazakeith
Created: 2025-05-01T17:54:20.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-05-03T08:26:28.000Z (5 months ago)
Last Synced: 2025-05-03T09:36:12.491Z (5 months ago)
Topics: agent, ai, automation, bull, bullmq, chrome, crawl4ai, crawler, data, data-extraction, extraction, gemini, llm, llms, openai, playwright, puppeteer, web-automation
Language: Python
Homepage:
Size: 39.1 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

README

# PuppetMaster 🤖

A powerful microservice for web automation, scraping, and data processing, integrating Puppeteer for browser control and Crawl4AI for advanced crawling and AI-powered extraction.

[ Ask https://DeepWiki.com ](https://deepwiki.com/mzazakeith/PuppetMaster)

## Features

- **Puppeteer Core:**
- 🌐 Headless browser automation with Puppeteer and Chromium
- 🖱️ Standard browser interactions: navigate, click, type, scroll, select
- 🖼️ Screenshot generation (full page or element)
- 📄 PDF generation
- ⚙️ Custom JavaScript evaluation
- **Crawl4AI Integration:**
- 🕷️ Advanced crawling strategies (schema-based, LLM-driven)
- 🧩 Flexible data extraction (CSS, XPath, LLM)
- 🧠 Dynamic schema generation using LLMs
- ✅ Content verification
- 🔗 Deep link crawling
- ⏳ Element waiting and filtering
- 📄 PDF text extraction
- 📝 Webpage to Markdown conversion
- 🌐 Webpage to PDF conversion (via Crawl4AI)
- **System:**
- 🔄 Bull queue system for robust job management (separate queues for Puppeteer & Crawl4AI)
- 📊 MongoDB for job persistence, status tracking, and results storage
- 💾 Local file storage for generated assets (screenshots, PDFs, Markdown files)
- 📈 API endpoints for job management and queue monitoring

## Key Technologies

* **Backend:** Node.js, Express.js
* **Web Automation:** Puppeteer
* **Crawling & AI:** Python, FastAPI, Crawl4AI
* **Job Queue:** BullMQ, Redis
* **Database:** MongoDB (with Mongoose)
* **Language:** JavaScript, Python

## Installation

### Prerequisites

- Node.js (v18 or later recommended)
- npm or yarn
- Python (v3.8 or later recommended)
- pip
- MongoDB (local instance or Atlas)
- Redis (local instance or cloud provider)

### Setup

1. **Clone the repository:**
```bash
git clone
cd PuppetMaster
```

2. **Install Node.js dependencies:**
```bash
npm install
# or
# yarn install
```

3. **Set up Python environment for Crawl4AI:**
```bash
# Create a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate # On Windows use `.venv\\Scripts\\activate`

# Install Python dependencies
pip install -r requirements.txt
```

4. **Configure Environment Variables:**
Create a `.env` file in the project root and configure the following variables:

```dotenv
# Node.js App Configuration
PORT=3000
NODE_ENV=development # or production
MONGODB_URI=mongodb://localhost:27017/puppet-master # Replace with your MongoDB connection string
REDIS_HOST=localhost
REDIS_PORT=6379
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=100

# Puppeteer Worker Configuration
PUPPETEER_HEADLESS=true # Set to false to run browser in non-headless mode
PUPPETEER_TIMEOUT=60000 # Default timeout for Puppeteer operations (ms)
JOB_CONCURRENCY=2 # Max concurrent Puppeteer jobs

# Crawl4AI Worker & Service Configuration
CRAWL4AI_API_URL=http://localhost:8000 # URL of the Python Crawl4AI service
CRAWL4AI_API_TIMEOUT=120000 # Timeout for requests to Crawl4AI service (ms)
CRAWL4AI_PORT=8000 # Port for the Python Crawl4AI service
JOB_ATTEMPTS=3 # Default Bull queue job attempts
JOB_TIMEOUT=300000 # Default Bull queue job timeout (ms)
# Add necessary API keys for LLM providers if using LLMExtractionStrategy
# Example for OpenAI (only required if using OpenAI models):
# OPENAI_API_KEY=your_openai_api_key

# Example for Google Gemini (only required if using Gemini models):
# GOOGLE_API_KEY=your_google_ai_api_key
```

5. **Start the Services and Workers:**

You can start everything concurrently using the provided npm scripts:

```bash
# For development (with nodemon for Node.js app/worker)
npm run dev:all

# For production
npm run start:all
```

These scripts run the following components:
* Node.js API Server (`src/index.js`) - Also processes jobs from the `crawl4ai-jobs` queue.
* Puppeteer Worker (`src/workers/puppeteer.worker.js`) - Processes jobs from the `puppeteer-jobs` queue.
* Crawl4AI Python Service (`src/crawl4ai/main.py`) - Handles Crawl4AI API requests from the Node.js worker.

Alternatively, you can start components individually:

```bash
# Start Node.js API (Terminal 1)
# This process also handles processing for Crawl4AI jobs.
npm start # or npm run dev

# Start Puppeteer Worker (Terminal 2)
# Processes only Puppeteer-specific jobs.
npm run start:worker # or npm run dev:worker

# Start Crawl4AI Python Service (Terminal 3)
npm run start:crawl4ai
# or directly: ./start-crawl4ai.sh
# or: source .venv/bin/activate && python src/crawl4ai/main.py
```

## Architecture Overview

PuppetMaster uses a microservice architecture:

* **Node.js API Server (`src/index.js`):**
* Exposes REST API endpoints for job management and queue monitoring.
* Uses Express.js, Mongoose (for MongoDB interaction), and Bull for queue management.
* Handles incoming job requests, saving them to MongoDB.
* Adds jobs to either the Puppeteer or Crawl4AI Bull queue based on action types.
* Processes jobs from the `crawl4ai-jobs` queue by interacting with the Crawl4AI Python Service.
* **Puppeteer Worker (`src/workers/puppeteer.worker.js`):**
* A separate Node.js process that listens to the `puppeteer-jobs` Bull queue.
* Executes Puppeteer-specific browser automation tasks (navigate, click, screenshot, etc.).
* Updates job status and results in MongoDB.
* **Crawl4AI Python Service (`src/crawl4ai/`):**
* A FastAPI application providing endpoints for advanced crawling and extraction tasks.
* Uses the `Crawl4AI` library internally.
* Communicates with the Node.js API/worker process via HTTP requests.
* **Bull Queues (Redis):** Manages job processing, ensuring robustness and retries.
* **MongoDB:** Persists job definitions, status, results, and generated asset metadata.
* **Local File Storage (`/public`):** Stores generated files like screenshots, PDFs, and Markdown files.

* **Error Handling:** Uses a centralized error handler (`src/middleware/errorHandler.js`) providing consistent JSON error responses (see `ApiError` class).
* **Validation:** Incoming requests for specific endpoints (like job creation) are validated using Joi schemas (`src/middleware/validation.js`).
* **Job Model:** Job details, including status, results, assets, and progress, are stored in MongoDB using the schema defined in `src/models/Job.js`.

## API Documentation

The API allows you to create, manage, and monitor automation jobs.

### Base URL: `/api`

### Job Management (`/jobs`)

#### `POST /jobs`

Create a new job. The job will be routed to the appropriate queue (Puppeteer or Crawl4AI) based on its actions.

**Request Body:**

```json
{
"name": "Unique Job Name",
"description": "Optional job description",
"priority": 0, // Optional: Bull queue priority (-100 to 100)
"actions": [
{
"type": "action_type_1", // See Action Types section below
"params": { ... } // Parameters specific to the action type
},
{
"type": "action_type_2",
"params": { ... }
}
// ... more actions
],
"metadata": { ... } // Optional: Any additional data to store with the job
}
```

**Response (Success: 201 Created):**

```json
{
"status": "success",
"message": "Job created successfully",
"data": {
"jobId": "unique-job-id",
"name": "Unique Job Name",
"status": "pending"
}
}
```

#### `GET /jobs`

Get a list of jobs with filtering and pagination.

**Query Parameters:**

* `status` (string, optional): Filter by job status (e.g., `pending`, `processing`, `completed`, `failed`, `cancelled`).
* `page` (number, optional, default: 1): Page number for pagination.
* `limit` (number, optional, default: 10): Number of jobs per page.
* `sort` (string, optional, default: `createdAt`): Field to sort by.
* `order` (string, optional, default: `desc`): Sort order (`asc` or `desc`).