https://github.com/mzazakeith/puppetmaster
Puppeteer & Crawl4AI microservice for web automation, scraping, and AI processing with Bull queues
https://github.com/mzazakeith/puppetmaster
agent ai automation bull bullmq chrome crawl4ai crawler data data-extraction extraction gemini llm llms openai playwright puppeteer web-automation
Last synced: 5 months ago
JSON representation
Puppeteer & Crawl4AI microservice for web automation, scraping, and AI processing with Bull queues
- Host: GitHub
- URL: https://github.com/mzazakeith/puppetmaster
- Owner: mzazakeith
- Created: 2025-05-01T17:54:20.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-05-03T08:26:28.000Z (5 months ago)
- Last Synced: 2025-05-03T09:36:12.491Z (5 months ago)
- Topics: agent, ai, automation, bull, bullmq, chrome, crawl4ai, crawler, data, data-extraction, extraction, gemini, llm, llms, openai, playwright, puppeteer, web-automation
- Language: Python
- Homepage:
- Size: 39.1 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# PuppetMaster 🤖
A powerful microservice for web automation, scraping, and data processing, integrating Puppeteer for browser control and Crawl4AI for advanced crawling and AI-powered extraction.
[
](https://deepwiki.com/mzazakeith/PuppetMaster)
## Features
- **Puppeteer Core:**
- 🌐 Headless browser automation with Puppeteer and Chromium
- 🖱️ Standard browser interactions: navigate, click, type, scroll, select
- 🖼️ Screenshot generation (full page or element)
- 📄 PDF generation
- ⚙️ Custom JavaScript evaluation
- **Crawl4AI Integration:**
- 🕷️ Advanced crawling strategies (schema-based, LLM-driven)
- 🧩 Flexible data extraction (CSS, XPath, LLM)
- 🧠 Dynamic schema generation using LLMs
- ✅ Content verification
- 🔗 Deep link crawling
- ⏳ Element waiting and filtering
- 📄 PDF text extraction
- 📝 Webpage to Markdown conversion
- 🌐 Webpage to PDF conversion (via Crawl4AI)
- **System:**
- 🔄 Bull queue system for robust job management (separate queues for Puppeteer & Crawl4AI)
- 📊 MongoDB for job persistence, status tracking, and results storage
- 💾 Local file storage for generated assets (screenshots, PDFs, Markdown files)
- 📈 API endpoints for job management and queue monitoring## Key Technologies
* **Backend:** Node.js, Express.js
* **Web Automation:** Puppeteer
* **Crawling & AI:** Python, FastAPI, Crawl4AI
* **Job Queue:** BullMQ, Redis
* **Database:** MongoDB (with Mongoose)
* **Language:** JavaScript, Python## Installation
### Prerequisites
- Node.js (v18 or later recommended)
- npm or yarn
- Python (v3.8 or later recommended)
- pip
- MongoDB (local instance or Atlas)
- Redis (local instance or cloud provider)### Setup
1. **Clone the repository:**
```bash
git clone
cd PuppetMaster
```2. **Install Node.js dependencies:**
```bash
npm install
# or
# yarn install
```3. **Set up Python environment for Crawl4AI:**
```bash
# Create a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate # On Windows use `.venv\\Scripts\\activate`# Install Python dependencies
pip install -r requirements.txt
```4. **Configure Environment Variables:**
Create a `.env` file in the project root and configure the following variables:```dotenv
# Node.js App Configuration
PORT=3000
NODE_ENV=development # or production
MONGODB_URI=mongodb://localhost:27017/puppet-master # Replace with your MongoDB connection string
REDIS_HOST=localhost
REDIS_PORT=6379
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=100# Puppeteer Worker Configuration
PUPPETEER_HEADLESS=true # Set to false to run browser in non-headless mode
PUPPETEER_TIMEOUT=60000 # Default timeout for Puppeteer operations (ms)
JOB_CONCURRENCY=2 # Max concurrent Puppeteer jobs# Crawl4AI Worker & Service Configuration
CRAWL4AI_API_URL=http://localhost:8000 # URL of the Python Crawl4AI service
CRAWL4AI_API_TIMEOUT=120000 # Timeout for requests to Crawl4AI service (ms)
CRAWL4AI_PORT=8000 # Port for the Python Crawl4AI service
JOB_ATTEMPTS=3 # Default Bull queue job attempts
JOB_TIMEOUT=300000 # Default Bull queue job timeout (ms)
# Add necessary API keys for LLM providers if using LLMExtractionStrategy
# Example for OpenAI (only required if using OpenAI models):
# OPENAI_API_KEY=your_openai_api_key# Example for Google Gemini (only required if using Gemini models):
# GOOGLE_API_KEY=your_google_ai_api_key
```5. **Start the Services and Workers:**
You can start everything concurrently using the provided npm scripts:
```bash
# For development (with nodemon for Node.js app/worker)
npm run dev:all# For production
npm run start:all
```These scripts run the following components:
* Node.js API Server (`src/index.js`) - Also processes jobs from the `crawl4ai-jobs` queue.
* Puppeteer Worker (`src/workers/puppeteer.worker.js`) - Processes jobs from the `puppeteer-jobs` queue.
* Crawl4AI Python Service (`src/crawl4ai/main.py`) - Handles Crawl4AI API requests from the Node.js worker.Alternatively, you can start components individually:
```bash
# Start Node.js API (Terminal 1)
# This process also handles processing for Crawl4AI jobs.
npm start # or npm run dev# Start Puppeteer Worker (Terminal 2)
# Processes only Puppeteer-specific jobs.
npm run start:worker # or npm run dev:worker# Start Crawl4AI Python Service (Terminal 3)
npm run start:crawl4ai
# or directly: ./start-crawl4ai.sh
# or: source .venv/bin/activate && python src/crawl4ai/main.py
```## Architecture Overview
PuppetMaster uses a microservice architecture:
* **Node.js API Server (`src/index.js`):**
* Exposes REST API endpoints for job management and queue monitoring.
* Uses Express.js, Mongoose (for MongoDB interaction), and Bull for queue management.
* Handles incoming job requests, saving them to MongoDB.
* Adds jobs to either the Puppeteer or Crawl4AI Bull queue based on action types.
* Processes jobs from the `crawl4ai-jobs` queue by interacting with the Crawl4AI Python Service.
* **Puppeteer Worker (`src/workers/puppeteer.worker.js`):**
* A separate Node.js process that listens to the `puppeteer-jobs` Bull queue.
* Executes Puppeteer-specific browser automation tasks (navigate, click, screenshot, etc.).
* Updates job status and results in MongoDB.
* **Crawl4AI Python Service (`src/crawl4ai/`):**
* A FastAPI application providing endpoints for advanced crawling and extraction tasks.
* Uses the `Crawl4AI` library internally.
* Communicates with the Node.js API/worker process via HTTP requests.
* **Bull Queues (Redis):** Manages job processing, ensuring robustness and retries.
* **MongoDB:** Persists job definitions, status, results, and generated asset metadata.
* **Local File Storage (`/public`):** Stores generated files like screenshots, PDFs, and Markdown files.* **Error Handling:** Uses a centralized error handler (`src/middleware/errorHandler.js`) providing consistent JSON error responses (see `ApiError` class).
* **Validation:** Incoming requests for specific endpoints (like job creation) are validated using Joi schemas (`src/middleware/validation.js`).
* **Job Model:** Job details, including status, results, assets, and progress, are stored in MongoDB using the schema defined in `src/models/Job.js`.## API Documentation
The API allows you to create, manage, and monitor automation jobs.
### Base URL: `/api`
### Job Management (`/jobs`)
#### `POST /jobs`
Create a new job. The job will be routed to the appropriate queue (Puppeteer or Crawl4AI) based on its actions.
**Request Body:**
```json
{
"name": "Unique Job Name",
"description": "Optional job description",
"priority": 0, // Optional: Bull queue priority (-100 to 100)
"actions": [
{
"type": "action_type_1", // See Action Types section below
"params": { ... } // Parameters specific to the action type
},
{
"type": "action_type_2",
"params": { ... }
}
// ... more actions
],
"metadata": { ... } // Optional: Any additional data to store with the job
}
```**Response (Success: 201 Created):**
```json
{
"status": "success",
"message": "Job created successfully",
"data": {
"jobId": "unique-job-id",
"name": "Unique Job Name",
"status": "pending"
}
}
```#### `GET /jobs`
Get a list of jobs with filtering and pagination.
**Query Parameters:**
* `status` (string, optional): Filter by job status (e.g., `pending`, `processing`, `completed`, `failed`, `cancelled`).
* `page` (number, optional, default: 1): Page number for pagination.
* `limit` (number, optional, default: 10): Number of jobs per page.
* `sort` (string, optional, default: `createdAt`): Field to sort by.
* `order` (string, optional, default: `desc`): Sort order (`asc` or `desc`).**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"jobs": [ ... ], // Array of job objects
"pagination": {
"total": 100,
"page": 1,
"limit": 10,
"pages": 10
}
}
}
```#### `GET /jobs/:id`
Get details of a specific job by its `jobId`.
**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"job": { ... } // Full job object
}
}
```#### `GET /jobs/:id/assets`
Get assets generated by a specific job (e.g., screenshot URLs, PDF URLs).
**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"assets": [
{ "type": "screenshot", "url": "/public/screenshots/...", "createdAt": "..." },
{ "type": "pdf", "url": "/public/pdfs/...", "createdAt": "..." }
// ... other assets like markdown URLs
]
}
}
```#### `POST /jobs/:id/cancel`
Cancel a pending or processing job.
**Response (Success: 200 OK):**
```json
{
"status": "success",
"message": "Job cancelled successfully",
"data": { "jobId": "unique-job-id" }
}
```#### `POST /jobs/:id/retry`
Retry a job that has failed. Resets status to `pending` and adds it back to the queue.
**Response (Success: 200 OK):**
```json
{
"status": "success",
"message": "Job retried successfully",
"data": { "jobId": "unique-job-id" }
}
```#### `DELETE /jobs/:id`
Delete a job from the database and remove it from the queue if pending.
**Response (Success: 200 OK):**
```json
{
"status": "success",
"message": "Job deleted successfully"
}
```### Queue Management (`/queue`)
#### `GET /queue/metrics`
Get statistics about both the Puppeteer and Crawl4AI job queues.
**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"metrics": {
"puppeteer": { "waiting": 0, "active": 1, "completed": 50, "failed": 2, "delayed": 0, "total": 53 },
"crawl4ai": { "waiting": 2, "active": 0, "completed": 25, "failed": 1, "delayed": 0, "total": 28 },
"total": { "waiting": 2, "active": 1, "completed": 75, "failed": 3, "delayed": 0, "total": 81 }
}
}
}
```#### `GET /queue/jobs`
Get jobs currently in the queues based on their state.
**Query Parameters:**
* `types` (string, optional, default: `active,waiting,delayed,failed,completed`): Comma-separated list of job states to retrieve.
* `limit` (number, optional, default: 10): Maximum number of jobs to return across all specified types.**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"jobs": [
{
"id": "bull-job-id", // Bull queue job ID
"name": "Job Name",
"jobId": "unique-db-job-id", // Database job ID
"timestamp": 1678886400000,
// ... other Bull job details
"state": "active" // or waiting, completed, etc.
}
// ... more jobs
]
}
}
```#### `DELETE /queue/clear`
**(Admin/Protected Endpoint)** Clears all jobs from all queues (waiting, active, delayed, failed, completed). Use with caution!
**Response (Success: 200 OK):**
```json
{
"status": "success",
"message": "Queue cleared successfully"
}
```#### `GET /queue/status`
Provides a simple status check for the Node.js API process (not individual workers).
**Response (Success: 200 OK):**
```json
{
"status": "success",
"data": {
"isRunning": true,
"uptime": 12345.67,
"memory": { ... }, // Node.js process memory usage
"cpuUsage": { ... } // Node.js process CPU usage
}
}
```## Action Types
Jobs consist of a sequence of actions. Each action has a `type` and `params`.
### Puppeteer Actions (Handled by `puppeteer.worker.js`)
| Action Type | Description | Parameters (`params`) |
| :----------- | :------------------------------- | :------------------------------------------------------------------------------------------------------------------- |
| `navigate` | Go to a URL | `url` (string, required) |
| `scrape` | Extract content from element(s) | `selector` (string, required), `attribute` (string, optional, default: `textContent`), `multiple` (boolean, optional) | `multiple: true` scrapes all matching elements into an array. |
| `click` | Click an element | `selector` (string, required) |
| `type` | Type text into an input | `selector` (string, required), `value` (string, required), `delay` (number, optional, ms) |
| `screenshot` | Take a screenshot | `selector` (string, optional), `fullPage` (boolean, optional, default: false) | Saves to `/public/screenshots` and returns the URL. |
| `pdf` | Generate PDF of the current page | `format` (string, optional, e.g., `A4`), `margin` (object, optional, e.g., `{top: '10mm', ...}`), `printBackground` (boolean, optional) | Saves to `/public/pdfs` and returns the URL. |
| `wait` | Wait for element or timeout | `selector` (string, optional), `timeout` (number, optional, ms, default: 30000) | Waits for the element to appear or the specified timeout. |
| `evaluate` | Run custom JavaScript on page | `script` (string, required) - *Must be a self-contained function body or expression* | Returns the result of the script evaluation. |
| `scroll` | Scroll page or element | `selector` (string, optional - scrolls element into view), `x` (number, optional - scrolls window), `y` (number, optional - scrolls window) | Scrolls window or brings element into view. |
| `select` | Select an option in a dropdown | `selector` (string, required), `value` (string, required) |### Crawl4AI Actions (Handled by `crawl4ai.worker.js` via Python Service)
*Note: These actions are forwarded to the Crawl4AI Python microservice. Jobs containing any of these actions will be processed by the `crawl4ai-jobs` queue and `crawl4ai.worker.js`.*
| Action Type | Description | Parameters (`params`) | Notes |
| :--------------- | :--------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------- |
| `crawl` | Crawl & extract using schema/strategy | `url` (string, required), `schema` (object, optional), `strategy` (string, optional, e.g., `JsonCssExtractionStrategy`, `LLMExtractionStrategy`), `baseSelector` (string, optional), **For LLM:** `llm_provider` (string, e.g., `openai/gpt-4o-mini`, `gemini/gemini-1.5-pro-latest`), `llm_api_key_env_var` (string, e.g., `OPENAI_API_KEY`, `GOOGLE_API_KEY`), `llm_instruction` (string), `llm_extraction_type` (string, `schema` or `block`), `llm_extra_args` (object, optional) | For `LLMExtractionStrategy`, ensure the corresponding API key (`OPENAI_API_KEY` or `GOOGLE_API_KEY`) is set in the `.env` file if the provider requires it. |
| `extract` | Extract specific content (text, html, attribute) | `url` (string, required), `selector` (string, required), `type` (string, optional, default: `text`), `attribute` (string, optional) | Uses Playwright directly in the Python service for extraction. |
| `generateSchema` | Generate extraction schema using LLM | `url` (string, required), `prompt` (string, required), `model` (string, optional, e.g., `openai/gpt-4o-mini`, `gemini/gemini-1.5-pro-latest`) | Requires appropriate API key in `.env` if the provider requires it. |
| `verify` | Verify element existence or content | `url` (string, required), `selector` (string, required), `expected` (string, optional) | Uses Playwright directly in the Python service. |
| `crawlLinks` | Follow links and extract data | `url` (string, required), `link_selector` (string, required), `schema` (object, optional), `max_depth` (number, optional, default: 1) | |
| `wait` (Crawl4AI)| Wait for an element (delegated to Crawl4AI service) | `url` (string, required), `selector` (string, required), `timeout` (number, optional, ms, default: 30000) | Uses Playwright directly in the Python service. |
| `filter` | Filter elements based on condition | `url` (string, required), `selector` (string, required), `condition` (string, e.g., `href.includes("pdf")`, `text.includes("Report")`) | Uses Playwright directly in the Python service. |
| `extractPDF` | Extract text content from a PDF URL | `url` (string, required) | Fetches and parses PDF content. |
| `toMarkdown` | Convert webpage content to Markdown | `url` (string, required), `options` (object, optional, see Crawl4AI docs) | Saves to `/public/markdown` and returns the URL/path. |
| `toPDF` | Convert webpage to PDF (via Crawl4AI) | `url` (string, required) | Saves to `/public/pdfs` and returns the URL/path. |## Job Action Execution Flow
PuppetMaster processes jobs containing multiple actions sequentially within a single worker process (either `puppeteer.worker.js` or `crawl4ai.worker.js` based on the action types).
- **Sequential Execution:** Actions defined in the `actions` array of a job are executed one after another in the order they are listed.
- **State Management:**
- The Puppeteer worker maintains a single browser page instance across actions within a job (e.g., navigating first, then clicking, then scraping).
- The Crawl4AI worker typically sends each action as a separate request to the Python service, which is stateless between requests for different actions within the same job.
- **Result Passing:** **Currently, the result of one action is *not* automatically passed as input to the `params` of the next action.** The parameters for each action are fixed when the job is initially created.
- **Workaround:** For complex workflows requiring intermediate results, you need to:
1. Create a job for the first action(s).
2. Wait for the job to complete and retrieve its result (e.g., a scraped URL) from the API (`GET /jobs/:id`).
3. Create a *new* job for the subsequent action(s), using the retrieved result in its `params`.
- **Future Enhancement:** A potential future enhancement could involve allowing template variables in action parameters (e.g., `"url": "{{results.action_0.url}}"`), which the worker would resolve before executing the action.### Example: Simple Job (Single Worker)
```json
{
"name": "Login and Scrape Dashboard",
"actions": [
{ "type": "navigate", "params": { "url": "https://example.com/login" } },
{ "type": "type", "params": { "selector": "#username", "value": "user" } },
{ "type": "type", "params": { "selector": "#password", "value": "pass" } },
{ "type": "click", "params": { "selector": "button[type='submit']" } },
{ "type": "wait", "params": { "selector": "#dashboard-title" } }, // Wait for dashboard
{ "type": "scrape", "params": { "selector": ".widget-data", "multiple": true } }
]
}
```
This entire job would be handled by the `puppeteer.worker.js`.### Example: Mixed Job (Requires Manual Chaining)
```json
// --- JOB 1 ---
{
"name": "Navigate and Get PDF Link",
"actions": [
{ "type": "navigate", "params": { "url": "https://www.example.com/some-page-with-pdf-link" } },
{ "type": "scrape", "params": { "selector": "a.pdf-link", "attribute": "href" } }
// Worker executes these, result saved to DB: { "action_0": { "url": "..." }, "action_1": "https://example.com/document.pdf" }
]
}// --- After Job 1 completes, retrieve the result (e.g., "https://example.com/document.pdf") ---
// --- JOB 2 ---
{
"name": "Extract PDF Text",
"actions": [
// Use the result from Job 1 here
{ "type": "extractPDF", "params": { "url": "https://example.com/document.pdf" } }
// Worker sends this to Crawl4AI service
]
}
```## Contributing
Contributions are welcome! Please refer to the contribution guidelines.
## License
MIT
## Author
Keith Mzaza