{"id":28089223,"url":"https://github.com/mzazakeith/puppetmaster","last_synced_at":"2025-05-13T12:55:09.765Z","repository":{"id":291232775,"uuid":"976216021","full_name":"mzazakeith/PuppetMaster","owner":"mzazakeith","description":"Puppeteer \u0026 Crawl4AI microservice for web automation, scraping, and AI processing with Bull queues","archived":false,"fork":false,"pushed_at":"2025-05-03T08:26:28.000Z","size":40,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-03T09:36:12.491Z","etag":null,"topics":["agent","ai","automation","bull","bullmq","chrome","crawl4ai","crawler","data","data-extraction","extraction","gemini","llm","llms","openai","playwright","puppeteer","web-automation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mzazakeith.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-01T17:54:20.000Z","updated_at":"2025-05-03T08:26:32.000Z","dependencies_parsed_at":"2025-05-03T09:37:36.096Z","dependency_job_id":"5b87531e-38ff-4b9e-80a7-7328000e9627","html_url":"https://github.com/mzazakeith/PuppetMaster","commit_stats":null,"previous_names":["mzazakeith/puppetmaster"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzazakeith%2FPuppetMaster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzazakeith%2FPuppetMaster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzazakeith%2FPuppetMaster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzazakeith%2FPuppetMaster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mzazakeith","download_url":"https://codeload.github.com/mzazakeith/PuppetMaster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253948331,"owners_count":21988953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","ai","automation","bull","bullmq","chrome","crawl4ai","crawler","data","data-extraction","extraction","gemini","llm","llms","openai","playwright","puppeteer","web-automation"],"created_at":"2025-05-13T12:55:09.176Z","updated_at":"2025-05-13T12:55:09.737Z","avatar_url":"https://github.com/mzazakeith.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PuppetMaster 🤖\n\nA powerful microservice for web automation, scraping, and data processing, integrating Puppeteer for browser control and Crawl4AI for advanced crawling and AI-powered extraction.\n\n[\u003cimg src=\"https://devin.ai/assets/askdeepwiki.png\" alt=\"Ask https://DeepWiki.com\" height=\"20\"/\u003e](https://deepwiki.com/mzazakeith/PuppetMaster)\n\n## Features\n\n- **Puppeteer Core:**\n  - 🌐 Headless browser automation with Puppeteer and Chromium\n  - 🖱️ Standard browser interactions: navigate, click, type, scroll, select\n  - 🖼️ Screenshot generation (full page or element)\n  - 📄 PDF generation\n  - ⚙️ Custom JavaScript evaluation\n- **Crawl4AI Integration:**\n  - 🕷️ Advanced crawling strategies (schema-based, LLM-driven)\n  - 🧩 Flexible data extraction (CSS, XPath, LLM)\n  - 🧠 Dynamic schema generation using LLMs\n  - ✅ Content verification\n  - 🔗 Deep link crawling\n  - ⏳ Element waiting and filtering\n  - 📄 PDF text extraction\n  - 📝 Webpage to Markdown conversion\n  - 🌐 Webpage to PDF conversion (via Crawl4AI)\n- **System:**\n  - 🔄 Bull queue system for robust job management (separate queues for Puppeteer \u0026 Crawl4AI)\n  - 📊 MongoDB for job persistence, status tracking, and results storage\n  - 💾 Local file storage for generated assets (screenshots, PDFs, Markdown files)\n  - 📈 API endpoints for job management and queue monitoring\n\n## Key Technologies\n\n*   **Backend:** Node.js, Express.js\n*   **Web Automation:** Puppeteer\n*   **Crawling \u0026 AI:** Python, FastAPI, Crawl4AI\n*   **Job Queue:** BullMQ, Redis\n*   **Database:** MongoDB (with Mongoose)\n*   **Language:** JavaScript, Python\n\n## Installation\n\n### Prerequisites\n\n- Node.js (v18 or later recommended)\n- npm or yarn\n- Python (v3.8 or later recommended)\n- pip\n- MongoDB (local instance or Atlas)\n- Redis (local instance or cloud provider)\n\n### Setup\n\n1.  **Clone the repository:**\n    ```bash\n    git clone \u003crepository-url\u003e\n    cd PuppetMaster\n    ```\n\n2.  **Install Node.js dependencies:**\n    ```bash\n    npm install\n    # or\n    # yarn install\n    ```\n\n3.  **Set up Python environment for Crawl4AI:**\n    ```bash\n    # Create a virtual environment (recommended)\n    python3 -m venv .venv\n    source .venv/bin/activate  # On Windows use `.venv\\\\Scripts\\\\activate`\n\n    # Install Python dependencies\n    pip install -r requirements.txt\n    ```\n\n4.  **Configure Environment Variables:**\n    Create a `.env` file in the project root and configure the following variables:\n\n    ```dotenv\n    # Node.js App Configuration\n    PORT=3000\n    NODE_ENV=development # or production\n    MONGODB_URI=mongodb://localhost:27017/puppet-master # Replace with your MongoDB connection string\n    REDIS_HOST=localhost\n    REDIS_PORT=6379\n    RATE_LIMIT_WINDOW_MS=60000\n    RATE_LIMIT_MAX=100\n\n    # Puppeteer Worker Configuration\n    PUPPETEER_HEADLESS=true # Set to false to run browser in non-headless mode\n    PUPPETEER_TIMEOUT=60000 # Default timeout for Puppeteer operations (ms)\n    JOB_CONCURRENCY=2 # Max concurrent Puppeteer jobs\n\n    # Crawl4AI Worker \u0026 Service Configuration\n    CRAWL4AI_API_URL=http://localhost:8000 # URL of the Python Crawl4AI service\n    CRAWL4AI_API_TIMEOUT=120000 # Timeout for requests to Crawl4AI service (ms)\n    CRAWL4AI_PORT=8000 # Port for the Python Crawl4AI service\n    JOB_ATTEMPTS=3 # Default Bull queue job attempts\n    JOB_TIMEOUT=300000 # Default Bull queue job timeout (ms)\n    # Add necessary API keys for LLM providers if using LLMExtractionStrategy\n    # Example for OpenAI (only required if using OpenAI models):\n    # OPENAI_API_KEY=your_openai_api_key\n\n    # Example for Google Gemini (only required if using Gemini models):\n    # GOOGLE_API_KEY=your_google_ai_api_key\n    ```\n\n5.  **Start the Services and Workers:**\n\n    You can start everything concurrently using the provided npm scripts:\n\n    ```bash\n    # For development (with nodemon for Node.js app/worker)\n    npm run dev:all\n\n    # For production\n    npm run start:all\n    ```\n\n    These scripts run the following components:\n    *   Node.js API Server (`src/index.js`) - Also processes jobs from the `crawl4ai-jobs` queue.\n    *   Puppeteer Worker (`src/workers/puppeteer.worker.js`) - Processes jobs from the `puppeteer-jobs` queue.\n    *   Crawl4AI Python Service (`src/crawl4ai/main.py`) - Handles Crawl4AI API requests from the Node.js worker.\n\n    Alternatively, you can start components individually:\n\n    ```bash\n    # Start Node.js API (Terminal 1)\n    # This process also handles processing for Crawl4AI jobs.\n    npm start  # or npm run dev\n\n    # Start Puppeteer Worker (Terminal 2)\n    # Processes only Puppeteer-specific jobs.\n    npm run start:worker # or npm run dev:worker\n\n    # Start Crawl4AI Python Service (Terminal 3)\n    npm run start:crawl4ai\n    # or directly: ./start-crawl4ai.sh\n    # or: source .venv/bin/activate \u0026\u0026 python src/crawl4ai/main.py\n    ```\n\n## Architecture Overview\n\nPuppetMaster uses a microservice architecture:\n\n*   **Node.js API Server (`src/index.js`):** \n    *   Exposes REST API endpoints for job management and queue monitoring.\n    *   Uses Express.js, Mongoose (for MongoDB interaction), and Bull for queue management.\n    *   Handles incoming job requests, saving them to MongoDB.\n    *   Adds jobs to either the Puppeteer or Crawl4AI Bull queue based on action types.\n    *   Processes jobs from the `crawl4ai-jobs` queue by interacting with the Crawl4AI Python Service.\n*   **Puppeteer Worker (`src/workers/puppeteer.worker.js`):**\n    *   A separate Node.js process that listens to the `puppeteer-jobs` Bull queue.\n    *   Executes Puppeteer-specific browser automation tasks (navigate, click, screenshot, etc.).\n    *   Updates job status and results in MongoDB.\n*   **Crawl4AI Python Service (`src/crawl4ai/`):**\n    *   A FastAPI application providing endpoints for advanced crawling and extraction tasks.\n    *   Uses the `Crawl4AI` library internally.\n    *   Communicates with the Node.js API/worker process via HTTP requests.\n*   **Bull Queues (Redis):** Manages job processing, ensuring robustness and retries.\n*   **MongoDB:** Persists job definitions, status, results, and generated asset metadata.\n*   **Local File Storage (`/public`):** Stores generated files like screenshots, PDFs, and Markdown files.\n\n*   **Error Handling:** Uses a centralized error handler (`src/middleware/errorHandler.js`) providing consistent JSON error responses (see `ApiError` class).\n*   **Validation:** Incoming requests for specific endpoints (like job creation) are validated using Joi schemas (`src/middleware/validation.js`).\n*   **Job Model:** Job details, including status, results, assets, and progress, are stored in MongoDB using the schema defined in `src/models/Job.js`.\n\n## API Documentation\n\nThe API allows you to create, manage, and monitor automation jobs.\n\n### Base URL: `/api`\n\n### Job Management (`/jobs`)\n\n#### `POST /jobs`\n\nCreate a new job. The job will be routed to the appropriate queue (Puppeteer or Crawl4AI) based on its actions.\n\n**Request Body:**\n\n```json\n{\n  \"name\": \"Unique Job Name\",\n  \"description\": \"Optional job description\",\n  \"priority\": 0, // Optional: Bull queue priority (-100 to 100)\n  \"actions\": [\n    {\n      \"type\": \"action_type_1\", // See Action Types section below\n      \"params\": { ... } // Parameters specific to the action type\n    },\n    {\n      \"type\": \"action_type_2\",\n      \"params\": { ... }\n    }\n    // ... more actions\n  ],\n  \"metadata\": { ... } // Optional: Any additional data to store with the job\n}\n```\n\n**Response (Success: 201 Created):**\n\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Job created successfully\",\n  \"data\": {\n    \"jobId\": \"unique-job-id\",\n    \"name\": \"Unique Job Name\",\n    \"status\": \"pending\"\n  }\n}\n```\n\n#### `GET /jobs`\n\nGet a list of jobs with filtering and pagination.\n\n**Query Parameters:**\n\n*   `status` (string, optional): Filter by job status (e.g., `pending`, `processing`, `completed`, `failed`, `cancelled`).\n*   `page` (number, optional, default: 1): Page number for pagination.\n*   `limit` (number, optional, default: 10): Number of jobs per page.\n*   `sort` (string, optional, default: `createdAt`): Field to sort by.\n*   `order` (string, optional, default: `desc`): Sort order (`asc` or `desc`).\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"jobs\": [ ... ], // Array of job objects\n    \"pagination\": {\n      \"total\": 100,\n      \"page\": 1,\n      \"limit\": 10,\n      \"pages\": 10\n    }\n  }\n}\n```\n\n#### `GET /jobs/:id`\n\nGet details of a specific job by its `jobId`.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"job\": { ... } // Full job object\n  }\n}\n```\n\n#### `GET /jobs/:id/assets`\n\nGet assets generated by a specific job (e.g., screenshot URLs, PDF URLs).\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"assets\": [\n      { \"type\": \"screenshot\", \"url\": \"/public/screenshots/...\", \"createdAt\": \"...\" },\n      { \"type\": \"pdf\", \"url\": \"/public/pdfs/...\", \"createdAt\": \"...\" }\n      // ... other assets like markdown URLs\n    ]\n  }\n}\n```\n\n#### `POST /jobs/:id/cancel`\n\nCancel a pending or processing job.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Job cancelled successfully\",\n  \"data\": { \"jobId\": \"unique-job-id\" }\n}\n```\n\n#### `POST /jobs/:id/retry`\n\nRetry a job that has failed. Resets status to `pending` and adds it back to the queue.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Job retried successfully\",\n  \"data\": { \"jobId\": \"unique-job-id\" }\n}\n```\n\n#### `DELETE /jobs/:id`\n\nDelete a job from the database and remove it from the queue if pending.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Job deleted successfully\"\n}\n```\n\n### Queue Management (`/queue`)\n\n#### `GET /queue/metrics`\n\nGet statistics about both the Puppeteer and Crawl4AI job queues.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"metrics\": {\n      \"puppeteer\": { \"waiting\": 0, \"active\": 1, \"completed\": 50, \"failed\": 2, \"delayed\": 0, \"total\": 53 },\n      \"crawl4ai\": { \"waiting\": 2, \"active\": 0, \"completed\": 25, \"failed\": 1, \"delayed\": 0, \"total\": 28 },\n      \"total\": { \"waiting\": 2, \"active\": 1, \"completed\": 75, \"failed\": 3, \"delayed\": 0, \"total\": 81 }\n    }\n  }\n}\n```\n\n#### `GET /queue/jobs`\n\nGet jobs currently in the queues based on their state.\n\n**Query Parameters:**\n\n*   `types` (string, optional, default: `active,waiting,delayed,failed,completed`): Comma-separated list of job states to retrieve.\n*   `limit` (number, optional, default: 10): Maximum number of jobs to return across all specified types.\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"jobs\": [\n      {\n        \"id\": \"bull-job-id\", // Bull queue job ID\n        \"name\": \"Job Name\",\n        \"jobId\": \"unique-db-job-id\", // Database job ID\n        \"timestamp\": 1678886400000,\n        // ... other Bull job details\n        \"state\": \"active\" // or waiting, completed, etc.\n      }\n      // ... more jobs\n    ]\n  }\n}\n```\n\n#### `DELETE /queue/clear`\n\n**(Admin/Protected Endpoint)** Clears all jobs from all queues (waiting, active, delayed, failed, completed). Use with caution!\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Queue cleared successfully\"\n}\n```\n\n#### `GET /queue/status`\n\nProvides a simple status check for the Node.js API process (not individual workers).\n\n**Response (Success: 200 OK):**\n\n```json\n{\n  \"status\": \"success\",\n  \"data\": {\n    \"isRunning\": true,\n    \"uptime\": 12345.67,\n    \"memory\": { ... }, // Node.js process memory usage\n    \"cpuUsage\": { ... } // Node.js process CPU usage\n  }\n}\n```\n\n## Action Types\n\nJobs consist of a sequence of actions. Each action has a `type` and `params`.\n\n### Puppeteer Actions (Handled by `puppeteer.worker.js`)\n\n| Action Type  | Description                      | Parameters (`params`)                                                                                                |\n| :----------- | :------------------------------- | :------------------------------------------------------------------------------------------------------------------- |\n| `navigate`   | Go to a URL                      | `url` (string, required)                                                                                             |\n| `scrape`     | Extract content from element(s)  | `selector` (string, required), `attribute` (string, optional, default: `textContent`), `multiple` (boolean, optional) | `multiple: true` scrapes all matching elements into an array.                                                           |\n| `click`      | Click an element                 | `selector` (string, required)                                                                                        |\n| `type`       | Type text into an input          | `selector` (string, required), `value` (string, required), `delay` (number, optional, ms)                             |\n| `screenshot` | Take a screenshot                | `selector` (string, optional), `fullPage` (boolean, optional, default: false)                                        | Saves to `/public/screenshots` and returns the URL.                                                                 |\n| `pdf`        | Generate PDF of the current page | `format` (string, optional, e.g., `A4`), `margin` (object, optional, e.g., `{top: '10mm', ...}`), `printBackground` (boolean, optional) | Saves to `/public/pdfs` and returns the URL.                                                                        |\n| `wait`       | Wait for element or timeout      | `selector` (string, optional), `timeout` (number, optional, ms, default: 30000)                                       | Waits for the element to appear or the specified timeout.                                                           |\n| `evaluate`   | Run custom JavaScript on page    | `script` (string, required) - *Must be a self-contained function body or expression*                                 | Returns the result of the script evaluation.                                                                        |\n| `scroll`     | Scroll page or element           | `selector` (string, optional - scrolls element into view), `x` (number, optional - scrolls window), `y` (number, optional - scrolls window) | Scrolls window or brings element into view.                                                                         |\n| `select`     | Select an option in a dropdown   | `selector` (string, required), `value` (string, required)                                                            |\n\n### Crawl4AI Actions (Handled by `crawl4ai.worker.js` via Python Service)\n\n*Note: These actions are forwarded to the Crawl4AI Python microservice. Jobs containing any of these actions will be processed by the `crawl4ai-jobs` queue and `crawl4ai.worker.js`.*\n\n| Action Type      | Description                                                | Parameters (`params`)                                                                                                                                                                                                                                                                                                                      | Notes                                                                                                                                            |\n| :--------------- | :--------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------- |\n| `crawl`          | Crawl \u0026 extract using schema/strategy                    | `url` (string, required), `schema` (object, optional), `strategy` (string, optional, e.g., `JsonCssExtractionStrategy`, `LLMExtractionStrategy`), `baseSelector` (string, optional), **For LLM:** `llm_provider` (string, e.g., `openai/gpt-4o-mini`, `gemini/gemini-1.5-pro-latest`), `llm_api_key_env_var` (string, e.g., `OPENAI_API_KEY`, `GOOGLE_API_KEY`), `llm_instruction` (string), `llm_extraction_type` (string, `schema` or `block`), `llm_extra_args` (object, optional) | For `LLMExtractionStrategy`, ensure the corresponding API key (`OPENAI_API_KEY` or `GOOGLE_API_KEY`) is set in the `.env` file if the provider requires it. |\n| `extract`        | Extract specific content (text, html, attribute)           | `url` (string, required), `selector` (string, required), `type` (string, optional, default: `text`), `attribute` (string, optional)                                                                                                                                                                                                         | Uses Playwright directly in the Python service for extraction.                                                                                   |\n| `generateSchema` | Generate extraction schema using LLM                     | `url` (string, required), `prompt` (string, required), `model` (string, optional, e.g., `openai/gpt-4o-mini`, `gemini/gemini-1.5-pro-latest`)                                                                                                                                                                                                    | Requires appropriate API key in `.env` if the provider requires it.                                                                              |\n| `verify`         | Verify element existence or content                        | `url` (string, required), `selector` (string, required), `expected` (string, optional)                                                                                                                                                                                                                                                     | Uses Playwright directly in the Python service.                                                                                                  |\n| `crawlLinks`     | Follow links and extract data                              | `url` (string, required), `link_selector` (string, required), `schema` (object, optional), `max_depth` (number, optional, default: 1)                                                                                                                                                                                                         |                                                                                                                                                  |\n| `wait` (Crawl4AI)| Wait for an element (delegated to Crawl4AI service)      | `url` (string, required), `selector` (string, required), `timeout` (number, optional, ms, default: 30000)                                                                                                                                                                                                                                 | Uses Playwright directly in the Python service.                                                                                                  |\n| `filter`         | Filter elements based on condition                         | `url` (string, required), `selector` (string, required), `condition` (string, e.g., `href.includes(\"pdf\")`, `text.includes(\"Report\")`)                                                                                                                                                                                                   | Uses Playwright directly in the Python service.                                                                                                  |\n| `extractPDF`     | Extract text content from a PDF URL                        | `url` (string, required)                                                                                                                                                                                                                                                                                                                   | Fetches and parses PDF content.                                                                                                                  |\n| `toMarkdown`     | Convert webpage content to Markdown                        | `url` (string, required), `options` (object, optional, see Crawl4AI docs)                                                                                                                                                                                                                                                                  | Saves to `/public/markdown` and returns the URL/path.                                                                                            |\n| `toPDF`          | Convert webpage to PDF (via Crawl4AI)                      | `url` (string, required)                                                                                                                                                                                                                                                                                                                   | Saves to `/public/pdfs` and returns the URL/path.                                                                                                |\n\n## Job Action Execution Flow\n\nPuppetMaster processes jobs containing multiple actions sequentially within a single worker process (either `puppeteer.worker.js` or `crawl4ai.worker.js` based on the action types).\n\n- **Sequential Execution:** Actions defined in the `actions` array of a job are executed one after another in the order they are listed.\n- **State Management:**\n    - The Puppeteer worker maintains a single browser page instance across actions within a job (e.g., navigating first, then clicking, then scraping).\n    - The Crawl4AI worker typically sends each action as a separate request to the Python service, which is stateless between requests for different actions within the same job.\n- **Result Passing:** **Currently, the result of one action is *not* automatically passed as input to the `params` of the next action.** The parameters for each action are fixed when the job is initially created.\n    - **Workaround:** For complex workflows requiring intermediate results, you need to:\n        1.  Create a job for the first action(s).\n        2.  Wait for the job to complete and retrieve its result (e.g., a scraped URL) from the API (`GET /jobs/:id`).\n        3.  Create a *new* job for the subsequent action(s), using the retrieved result in its `params`.\n    - **Future Enhancement:** A potential future enhancement could involve allowing template variables in action parameters (e.g., `\"url\": \"{{results.action_0.url}}\"`), which the worker would resolve before executing the action.\n\n### Example: Simple Job (Single Worker)\n\n```json\n{\n  \"name\": \"Login and Scrape Dashboard\",\n  \"actions\": [\n    { \"type\": \"navigate\", \"params\": { \"url\": \"https://example.com/login\" } },\n    { \"type\": \"type\", \"params\": { \"selector\": \"#username\", \"value\": \"user\" } },\n    { \"type\": \"type\", \"params\": { \"selector\": \"#password\", \"value\": \"pass\" } },\n    { \"type\": \"click\", \"params\": { \"selector\": \"button[type='submit']\" } },\n    { \"type\": \"wait\", \"params\": { \"selector\": \"#dashboard-title\" } }, // Wait for dashboard\n    { \"type\": \"scrape\", \"params\": { \"selector\": \".widget-data\", \"multiple\": true } }\n  ]\n}\n```\nThis entire job would be handled by the `puppeteer.worker.js`.\n\n### Example: Mixed Job (Requires Manual Chaining)\n\n```json\n// --- JOB 1 ---\n{\n  \"name\": \"Navigate and Get PDF Link\",\n  \"actions\": [\n    { \"type\": \"navigate\", \"params\": { \"url\": \"https://www.example.com/some-page-with-pdf-link\" } },\n    { \"type\": \"scrape\", \"params\": { \"selector\": \"a.pdf-link\", \"attribute\": \"href\" } }\n    // Worker executes these, result saved to DB: { \"action_0\": { \"url\": \"...\" }, \"action_1\": \"https://example.com/document.pdf\" }\n  ]\n}\n\n// --- After Job 1 completes, retrieve the result (e.g., \"https://example.com/document.pdf\") ---\n\n// --- JOB 2 ---\n{\n  \"name\": \"Extract PDF Text\",\n  \"actions\": [\n    // Use the result from Job 1 here\n    { \"type\": \"extractPDF\", \"params\": { \"url\": \"https://example.com/document.pdf\" } }\n    // Worker sends this to Crawl4AI service\n  ]\n}\n```\n\n## Contributing\n\nContributions are welcome! Please refer to the contribution guidelines.\n\n## License\n\nMIT\n\n## Author\n\nKeith Mzaza\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzazakeith%2Fpuppetmaster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmzazakeith%2Fpuppetmaster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzazakeith%2Fpuppetmaster/lists"}