{"id":21839094,"url":"https://github.com/illuin-tech/grouse","last_synced_at":"2025-04-14T10:35:55.174Z","repository":{"id":256570711,"uuid":"824500776","full_name":"illuin-tech/grouse","owner":"illuin-tech","description":"Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models","archived":false,"fork":false,"pushed_at":"2024-12-30T09:37:21.000Z","size":1319,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-12T13:17:47.395Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/illuin-tech.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-05T09:10:08.000Z","updated_at":"2025-01-29T19:14:20.000Z","dependencies_parsed_at":"2024-09-11T20:29:56.535Z","dependency_job_id":"677adc80-0372-4c97-a396-0ff199e31d64","html_url":"https://github.com/illuin-tech/grouse","commit_stats":null,"previous_names":["illuin-tech/grouse"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fgrouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fgrouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fgrouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fgrouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/illuin-tech","download_url":"https://codeload.github.com/illuin-tech/grouse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248863634,"owners_count":21174043,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-27T21:15:54.614Z","updated_at":"2025-04-14T10:35:55.154Z","avatar_url":"https://github.com/illuin-tech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GroUSE\n\n[![arXiv](https://img.shields.io/badge/arXiv-2409.06595-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2409.06595)\n[![Hugging Face](https://img.shields.io/badge/Grouse_Dataset-FFD21E?style=for-the-badge\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/datasets/illuin/grouse)\n[![Blog](https://img.shields.io/badge/Blog-Check%20it%20out-blue?style=for-the-badge)](https://huggingface.co/spaces/illuin/grouse)\n[![Tutorial](https://img.shields.io/badge/Tutorial-Get%20started-purple?style=for-the-badge)](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_grouse.ipynb)\n\n---\n\nEvaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.\n\n- [Install](#install)\n- [Command Line Usage](#command-line-usage)\n  - [Evaluation of the Grounded Question Answering task](#evaluation-of-the-grounded-question-answering-task)\n  - [Unit Testing of Evaluators with GroUSE](#unit-testing-of-evaluators-with-grouse)\n  - [Plot Matrices of unit tests success](#plot-matrices-of-unit-tests-success)\n- [Python Usage](#python-usage)\n- [Links](#links)\n- [Citation](#citation)\n\n## Install\n\n```bash\npip install grouse\n```\n\nThen, setup your OpenAI credentials by creating an `.env` file by copying the `.env.dist` file, filling in your OpenAI API key and organization id and exporting the environment variables `export $(cat .env | xargs)`.\n\n## Command Line Usage\n\n### Evaluation of the Grounded Question Answering task\n\nYou can build a dataset in a `jsonl` file with the following format per line:\n\n```json\n{\n    \"references\": [\"\", ...], // List of references\n    \"input\": \"\", // Query\n    \"actual_output\": \"\", // Predicted answer generated by the model we want to evaluate\n    \"expected_output\": \"\" // Ground truth answer to the input\n}\n```\n\nYou can also check this example `example_data/grounded_qa.jsonl`.\n\nThen, run this command:\n\n```bash\ngrouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o\n```\n\nWe recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and prompts using the otional arguments : \n- `--evaluator_model_name`: Name of the evaluator model. It can be any LiteLLM model. The default model is GPT-4.\n- `--prompts_path`: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.\n\n### Unit Testing of Evaluators with GroUSE\n\nMeta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.\n\n```bash\ngrouse meta-evaluate gpt-4o meta-outputs/gpt-4o\n```\n\nOptional arguments : \n- `--prompts_path`: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.\n- `--train_set`: Optional flag to meta-evaluate on the train set (16 tests) instead of the test set (144 tests). The train set is meant to be used during the prompt engineering phase.\n\n### Plot Matrices of unit tests success\n\nYou can plot the results of unit tests in the shape of matrices:\n\n```bash\ngrouse plot meta-outputs/gpt-4o\n```\n\nThe resulting matrices look like this:\n\n![result_matrices_plot](assets/result_matrices_plot.png)\n\n## Python Usage\n\n```python\nfrom grouse import EvaluationSample, GroundedQAEvaluator\n\nsample = EvaluationSample(\n    input=\"What is the capital of France?\",\n    # Replace this with the actual output from your LLM application\n    actual_output=\"The capital of France is Marseille.[1]\",\n    expected_output=\"The capital of France is Paris.[1]\",\n    references=[\"Paris is the capital of France.\"]\n)\nevaluator = GroundedQAEvaluator()\nevaluator.evaluate([sample])\n```\n\n### Tutorial\n\nYou can check this [tutorial](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_grouse.ipynb) to get started on some examples.\n\n## Links\n\n- [Paper](https://arxiv.org/abs/2409.06595)\n- [Unit Tests](https://huggingface.co/datasets/illuin/grouse)\n- [Finetuned model](https://huggingface.co/illuin/llama-3-grouse)\n\n## Citation\n\n```latex\n@misc{muller2024grousebenchmarkevaluateevaluators,\n      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, \n      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},\n      year={2024},\n      eprint={2409.06595},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2409.06595}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filluin-tech%2Fgrouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Filluin-tech%2Fgrouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filluin-tech%2Fgrouse/lists"}