{"id":25096542,"url":"https://github.com/mjunaidca/prompt_evaluator_ai_workflow","last_synced_at":"2025-04-02T01:40:33.689Z","repository":{"id":275913516,"uuid":"927578466","full_name":"mjunaidca/prompt_evaluator_ai_workflow","owner":"mjunaidca","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-05T08:05:07.000Z","size":61,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-05T09:20:19.909Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mjunaidca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-05T07:32:18.000Z","updated_at":"2025-02-05T08:05:10.000Z","dependencies_parsed_at":"2025-02-05T09:20:26.453Z","dependency_job_id":"615b79e2-5d87-47b1-a756-5547fd96ac60","html_url":"https://github.com/mjunaidca/prompt_evaluator_ai_workflow","commit_stats":null,"previous_names":["mjunaidca/prompt_evaluator_ai_workflow"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjunaidca%2Fprompt_evaluator_ai_workflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjunaidca%2Fprompt_evaluator_ai_workflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjunaidca%2Fprompt_evaluator_ai_workflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mjunaidca%2Fprompt_evaluator_ai_workflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mjunaidca","download_url":"https://codeload.github.com/mjunaidca/prompt_evaluator_ai_workflow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246741187,"owners_count":20826063,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-07T16:40:08.016Z","updated_at":"2025-04-02T01:40:33.656Z","avatar_url":"https://github.com/mjunaidca.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Prompt Evaluations Agentic Workflow\n\nThis project implements a robust **Parallelization Workflow** for evaluating system prompt instructions. Rather than processing a complex task with a single LLM call, the workflow splits the task into multiple parallel subtasks and then aggregates the outputs to produce a final evaluation report.\n\n### Mermaid Flowchart\n\n```mermaid\nflowchart TD\n    A[User Input: System Prompt Instructions \u0026 Gold Standard] --\u003e F[Task Simulated Personas]\n    F --\u003e C[Judge Evaluation]\n    C --\u003e D[Prompt Improvement Research]\n    D --\u003e E[Aggregate Final Report]\n\n    style A fill:#663399,stroke:#333,stroke-width:2px\n    style E fill:#663399,stroke:#333,stroke-width:2px\n```\n\nThe workflow leverages two key strategies from Antropic Parallel Execution Workflow:\n\n- **Sectioning:**  \n  The system prompt instructions are used to simulate multiple persona tests in parallel. Each persona simulation produces a sample user input (derived from the prompt), an expected outcome (from an externally provided gold standard), and an actual outcome (generated by the LLM using the prompt as a system instruction with the sample input as user query).\n\n- **Voting (Repeated Evaluations):**  \n  By running multiple persona simulations (golden examples), the system gathers a range of outputs. These multiple outputs are then compared to identify variations, ensuring that the aggregated result is robust and reliable.\n\nAfter simulating persona interactions, the workflow uses a **Judge Evaluation** task to review and provide an overall assessment. Finally, an **Industry Research** task leverages the judge’s feedback to supply supporting evidence and actionable recommendations for improving the prompt. All results are then aggregated into a professional Markdown report.\n\n\n### How the Workflow Works\n\n1. **User Input:**  \n   The system accepts two main inputs:\n   - **System Prompt Instructions:** A detailed set of instructions for the AI agent (e.g., for a dentist receptionist assistant). These instructions specify how the assistant should process inquiries (e.g., regarding appointment scheduling and patient instructions).\n   - **Gold Standard:** A carefully curated expected outcome that represents the ideal response.\n   - **Number of Persona Runs:** Specifies how many golden examples (persona simulations) to generate.\n\n2. **Simulate Persona Tests (Sectioning):**  \n   Multiple persona simulations are executed concurrently. For each persona:\n   - A sample user query is generated from the system prompt.\n   - The gold standard is used as the expected outcome.\n   - The actual outcome is generated by the LLM using the sample query.\n   This approach mimics real user behavior by using the system prompt as both the set of instructions and (combined with the sample input) as the user query.\n\n3. **Judge Evaluation:**  \n   An expert evaluator (or a dedicated LLM) reviews the concatenated persona simulation details. This task compares the actual outcomes against the gold standard and provides an overall evaluation with actionable recommendations.\n\n4. **Industry Research:**  \n   Using the judge’s evaluation as context, industry research is performed to gather additional supporting evidence and to suggest improvements. This step ensures that recommendations are grounded in best practices and real-world experience.\n\n5. **Final Aggregation:**  \n   All outputs—the system prompt, persona simulation results, judge evaluation, and industry research findings—are aggregated into a final Markdown report. This report is structured into clear sections, making it suitable for internal review, further refinement, or conversion to PDF.\n\n---\n\n### How to Run the Project\n\n#### Prerequisites\n\n- Python 3.10 or higher\n- API Key from Google AI Studio\n- uv (our preferred command-line runner)\n\n#### Installation\n\n1. **Clone the Repository**\n\n   Open your terminal and run:\n\n   ```bash\n   git clone ...\n   ```\n\n2. **Navigate to the Project Directory**\n\n   ```bash\n   cd ...\n   ```\n\n3. **Configure Environment:**\n   - Rename `.env.example` to `.env` and add your `GOOGLE_API_KEY`.\n   - Optionally, set up additional environment variables (e.g., for LangChain tracing).\n\n4. **Install Required Packages:**\n\n   ```bash\n   uv sync\n   ```\n\n#### Running the Workflow\n\nYou can run the workflow in two ways:\n\n- **One-Time Run:**\n\n   ```bash\n   uv run invoke\n   ```\n\n- **Streaming Output (Real-Time Updates):**\n\n   ```bash\n   uv run stream\n   ```\n\n\n### The Science Behind the Approach\n\n- **Parallelization (Sectioning \u0026 Voting):**  \n  The core idea is to split a complex task into independent subtasks that run concurrently. This is based on principles from parallel computing (Amdahl’s Law, Gustafson’s Law) and ensemble methods in machine learning, which show that combining multiple outputs can reduce variance and increase reliability.\n\n- **Ensemble \u0026 Aggregation Methods:**  \n  Aggregating multiple persona simulations ensures that any single anomalous output does not skew the final evaluation. This is analogous to ensemble learning methods that improve robustness.\n\n- **Human-AI Collaboration:**  \n  By including a judge evaluation (which may involve human review or a dedicated LLM), the workflow incorporates expert judgment, which has been shown to enhance decision-making quality in hybrid systems.\n\n#### Supporting Research \u0026 References\n\n- **Parallel Computing Research:**  \n  Studies on automatic parallelization confirm that breaking tasks into independent subtasks significantly enhances efficiency.\n  \n- **Ensemble Learning:**  \n  Research on ensemble techniques (bagging, boosting) demonstrates that combining outputs from multiple models leads to more stable and accurate predictions.\n  \n- **Human-AI Collaboration:**  \n  Numerous studies have shown that integrating human expertise with AI-generated outputs leads to higher quality and more actionable results.\n\n---\n\n### Conclusion\n\nThis Prompt Evaluations Agentic Workflow implements the parallelization strategy by splitting the evaluation task into multiple parallel persona simulations, followed by judge evaluation and industry research. The resulting professional Markdown report provides actionable insights for prompt improvement and is ideal for internal review. \n\nBy running the project, you can see the Parallelization Workflow in action—demonstrating both sectioning (by simulating multiple independent user interactions) and voting (aggregating multiple outputs to refine the final evaluation). This robust, data-driven approach provides real value for refining AI system prompts.\n\nFeel free to use and adapt this workflow for your internal evaluations and to drive strategic improvements in your AI-based solutions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmjunaidca%2Fprompt_evaluator_ai_workflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmjunaidca%2Fprompt_evaluator_ai_workflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmjunaidca%2Fprompt_evaluator_ai_workflow/lists"}