https://github.com/anmolian/prompt_eval_llm_judge
Prompt Design & LLM Judge
https://github.com/anmolian/prompt_eval_llm_judge
contrastive-cot-prompting cot-prompting few-shot-prompting llm-judge llms one-shot-prompting prompt-engineering role-playing-prompting self-consistency-prompting trec-rag-2024 zero-shot-prompting
Last synced: 3 months ago
JSON representation
Prompt Design & LLM Judge
- Host: GitHub
- URL: https://github.com/anmolian/prompt_eval_llm_judge
- Owner: Anmolian
- License: mit
- Created: 2025-02-10T21:46:55.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-02-10T21:59:17.000Z (3 months ago)
- Last Synced: 2025-02-10T22:32:09.854Z (3 months ago)
- Topics: contrastive-cot-prompting, cot-prompting, few-shot-prompting, llm-judge, llms, one-shot-prompting, prompt-engineering, role-playing-prompting, self-consistency-prompting, trec-rag-2024, zero-shot-prompting
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# Evaluating Prompt Strategies with an LLM Judge## Technologies Used: LLMs, Prompt Designing, OpenAI API
- Designed and implemented seven prompt strategies (Zero Shot, Few Shot, Chain of Thought, etc.) to systematically test LLM-generated responses on 150 queries from the TREC ’24 RAG Track dataset (MS Marco V2.1).
- Developed an LLM Judge, an automated evaluation framework that scored over 1,050 responses based on Relevance, Correctness, Coherence, Conciseness, and Consistency, using the GPT-4o-mini API.
- Engineered a Python-based pipeline to automate response generation, scoring, and visualisation, revealing thatstructured prompting techniques like Chain of Thought achieved the highest average normalised score (9.36/10), improving LLM performance in complex reasoning tasks.
---*Image credit: [Designed by Freepik](http://www.freepik.com/)*