{"id":19932092,"url":"https://github.com/amazon-science/auto-rag-eval","last_synced_at":"2025-06-28T17:36:12.865Z","repository":{"id":244302476,"uuid":"803206335","full_name":"amazon-science/auto-rag-eval","owner":"amazon-science","description":"Code repo for the ICML 2024 paper \"Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation\"","archived":false,"fork":false,"pushed_at":"2024-06-13T20:50:39.000Z","size":383,"stargazers_count":72,"open_issues_count":3,"forks_count":14,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-06T05:32:24.841Z","etag":null,"topics":["evaluation","genai","llm","machine-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2405.13622","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-20T09:24:49.000Z","updated_at":"2025-03-25T14:20:23.000Z","dependencies_parsed_at":"2024-06-13T23:44:14.949Z","dependency_job_id":null,"html_url":"https://github.com/amazon-science/auto-rag-eval","commit_stats":null,"previous_names":["amazon-science/auto-rag-eval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/auto-rag-eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fauto-rag-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fauto-rag-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fauto-rag-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fauto-rag-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/auto-rag-eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fauto-rag-eval/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262470348,"owners_count":23316481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","genai","llm","machine-learning"],"created_at":"2024-11-12T23:09:00.311Z","updated_at":"2025-06-28T17:36:12.847Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation\n\nThis repository is the companion of the ICML 2024 paper [Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation](https://arxiv.org/abs/2405.13622) ([Blog](https://www.amazon.science/blog/automated-evaluation-of-rag-pipelines-with-exam-generation))\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/generation_summary.png\" alt=\"Alt Text\" width=\"800\"/\u003e\n\u003c/p\u003e\n\n**Goal**: For a given knowledge corpus:\n* Leverage an LLM to generate an multi-choice exam associated with the task of interest.\n* Evaluate variants of RaG systems on this exam.\n* Evaluate and iteratively improve the exam.\n\nThe only thing you need to experiment with this code is a `json` file with your knowledge corpus in the format described bellow.\n\n## I - Package Structure\n\n* `Data`: For each use case, contains:\n  * Preprocessing Code\n  * Knowledge Corpus Data\n  * Exam Data (Raw and Processed)\n  * Retrieval Index\n* `ExamGenerator`: Code to generate and process the multi-choice exam using knowledge corpus and LLM generator(s).\n* `ExamEvaluator`: Code to evaluate exam using a combination `(Retrieval System, LLM, ExamCorpus)`, relying on `lm-harness` library.\n* `LLMServer`: Unified LLM endpoints to generate the exam.\n* `RetrievalSystems`: Unified Retrieval System classes (eg DPR, BM25, Embedding Similarity...).\n\n## II - Exam Data Generation Process\n\nWe illustrate our methodology on 4 tasks of interest: AWS DevOPS Troubleshooting, StackExchange Q\u0026A, Sec Filings Q\u0026A and Arxiv Q\u0026A. We then show how to adapt the methodology to any task.\n\n### StackExchange\n\nRun the commands bellow, where `question-date` is the data with the raw data generation. Add `--save-exam` if you want to save the exam and remove it if you're only interested by analytics.\n\n```bash\ncd auto-rag-eval\nrm -rf Data/StackExchange/KnowledgeCorpus/main/*\npython3 -m Data.StackExchange.preprocessor\npython3 -m ExamGenerator.question_generator --task-domain StackExchange\npython3 -m ExamGenerator.multi_choice_exam --task-domain StackExchange --question-date \"question-date\" --save-exam\n```\n\n\n### Arxiv\n\n```bash\ncd auto-rag-eval\nrm -rf Data/Arxiv/KnowledgeCorpus/main/*\npython3 -m Data.Arxiv.preprocessor\npython3 -m ExamGenerator.question_generator --task-domain Arxiv\npython3 -m ExamGenerator.multi_choice_exam --task-domain Arxiv --question-date \"question-date\" --save-exam\n```\n\n### Sec Filings\n\n```bash\ncd auto-rag-eval\nrm -rf Data/SecFilings/KnowledgeCorpus/main/*\npython3 -m Data.SecFilings.preprocessor\npython3 -m ExamGenerator.question_generator --task-domain SecFilings\npython3 -m ExamGenerator.multi_choice_exam --task-domain SecFilings --question-date \"question-date\" --save-exam\n```\n\n### Add you own task MyOwnTask\n\n#### Create file structure\n\n```bash\ncd src/llm_automated_exam_evaluation/Data/\nmkdir MyOwnTask\nmkdir MyOwnTask/KnowledgeCorpus\nmkdir MyOwnTask/KnowledgeCorpus/main\nmkdir MyOwnTask/RetrievalIndex\nmkdir MyOwnTask/RetrievalIndex/main\nmkdir MyOwnTask/ExamData\nmkdir MyOwnTask/RawExamData\n```\n\n#### Create documentation corpus\n\nStore in `MyOwnTask/KnowledgeCorpus/main` a `json` file, with contains a list of documentation, each with format bellow. See `DevOps/html_parser.py`, `DevOps/preprocessor.py` or `StackExchange/preprocessor.py` for some examples.\n\n```bash\n{'source': 'my_own_source',\n'docs_id': 'Doc1022',\n'title': 'Dev Desktop Set Up',\n'section': 'How to [...]',\n'text': \"Documentation Text, should be long enough to make informative questions but shorter enough to fit into context\",\n'start_character': 'N/A',\n'end_character': 'N/A',\n'date': 'N/A',\n}\n```\n\n#### Generate Exam and Retrieval index\n\nFirst generate the raw exam and the retrieval index.\nNote that you might need to add support for your own LLM, more on this bellow.\nYou might want to modify the prompt used for the exam generation in `LLMExamGenerator` class in `ExamGenerator/question_generator.py`.\n\n```bash\npython3 -m ExamGenerator.question_generator --task-domain MyOwnTask\n```\n\nOnce this is done (can take a couple of hours depending on the documentation size), generate the processed exam.\nTo do so, check MyRawExamDate in RawExamData (eg 2023091223) and run:\n\n```bash\npython3 -m ExamGenerator.multi_choice_exam --task-domain MyOwnTask  --question-date MyRawExamDate --save-exam\n```\n\n### Bring your own LLM\n\nWe currently support endpoints for Bedrock (Claude) in `LLMServer` file. \nThe only thing needed to bring your own is a class, with an `inference` function that takes a prompt in input and output both the prompt and completed text.\nModify `LLMExamGenerator` class in `ExamGenerator/question_generator.py` to incorporate it.\nDifferent LLM generate different types of questions. Hence, you might want to modify the raw exam parsing in `ExamGenerator/multi_choice_questions.py`.\nYou can experiment using `failed_questions.ipynb` notebook from `ExamGenerator`.\n\n## IV - Exam Evaluation Process\n\nWe leverage [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) package to evaluate the (LLM\u0026Retrieval) system on the generated exam.\nTo do, follow the next steps:\n\n### Create a benchmark\n\nCreate a benchmark folder for for your task, here `DevOpsExam`, see `ExamEvaluator/DevOpsExam` for the template.\nIt contains a code file preprocess_exam,py for prompt templates and more importantly, a set of tasks to evaluate models on:\n\n* `DevOpsExam` contains the tasks associated to ClosedBook (not retrieval) and OpenBook (Oracle Retrieval).\n* `DevOpsRagExam` contains the tasks associated to Retrieval variants (DPR/Embeddings/BM25...).\n\nThe script`task_evaluation.sh` provided illustrates the evalation of `Llamav2:Chat:13B` and `Llamav2:Chat:70B` on the task, using In-Context-Learning (ICL) with respectively 0, 1 and 2 samples.\n\n## Citation\n\nTo cite this work, please use\n```bash\n@misc{autorageval2024,\n      title={Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation}, \n      author={Gauthier Guinet and Behrooz Omidvar-Tehrani and Anoop Deoras and Laurent Callot},\n      year={2024},\n      eprint={2405.13622},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\n\nThis project is licensed under the Apache-2.0 License.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fauto-rag-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fauto-rag-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fauto-rag-eval/lists"}