Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mlabonne/llm-autoeval
Automatically evaluate your LLMs in Google Colab
https://github.com/mlabonne/llm-autoeval
Last synced: 3 days ago
JSON representation
Automatically evaluate your LLMs in Google Colab
- Host: GitHub
- URL: https://github.com/mlabonne/llm-autoeval
- Owner: mlabonne
- License: mit
- Created: 2024-01-01T23:13:42.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-05-07T15:28:17.000Z (9 months ago)
- Last Synced: 2025-01-11T03:14:24.053Z (10 days ago)
- Language: Python
- Size: 135 KB
- Stars: 574
- Watchers: 7
- Forks: 94
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-production-machine-learning - LLM AutoEval - autoeval.svg?style=social) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. (Evaluation and Monitoring)
README
🧐 LLM AutoEval
🐦 Follow me on X •
🤗 Hugging Face •
💻 Blog •
📙 Hands-on GNN
Simplify LLM evaluation using a convenient Colab notebook.
## 🔍 Overview
LLM AutoEval **simplifies the process of evaluating LLMs** using a convenient [Colab notebook](https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing). You just need to specify the name of your model, a benchmark, a GPU, and press run!
### Key Features
* Automated setup and execution using [RunPod](https://runpod.io?ref=9nvk2srl).
* Customizable evaluation parameters for tailored benchmarking.
* Summary generation and upload to [GitHub Gist](https://gist.github.com/) for easy sharing and reference.[View a sample summary here.](https://gist.github.com/mlabonne/1d33c86824b3a11d2308e36db1ba41c1)
*Note: This project is in the early stages and primarily designed for personal use. Use it carefully and feel free to contribute.*
## ⚡ Quick Start
### Evaluation
* **`MODEL_ID`**: Enter the model id from Hugging Face.
* **`BENCHMARK`**:
* `nous`: List of tasks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench (popularized by [Teknium](https://github.com/teknium1) and [NousResearch](https://github.com/NousResearch)). This is recommended.
* `lighteval`: This is a [new library](https://github.com/huggingface/lighteval) from Hugging Face. It allows you to specify your tasks as shown in the readme. Check the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/examples/tasks/recommended_set.txt) to see what you can use (e.g., HELM, PIQA, GSM8K, MATH, etc.)
* `openllm`: List of tasks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA (like the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). It uses the [vllm](https://docs.vllm.ai/) implementation to enhance speed (note that the results will not be identical to those obtained without using vllm). "mmlu" is currently missing because of a problem with vllm.
* **`LIGHTEVAL_TASK`**: You can select one or several tasks as specified in the [readme](https://github.com/huggingface/lighteval?tab=readme-ov-file#usage) or in the list of [recommended tasks](https://github.com/huggingface/lighteval/blob/main/examples/tasks/recommended_set.txt).### Cloud GPU
* **`GPU`**: Select the GPU you want for evaluation (see prices [here](https://www.runpod.io/console/gpu-cloud)). I recommend using beefy GPUs (RTX 3090 or higher), especially for the Open LLM benchmark suite.
* **`Number of GPUs`**: Self-explanatory (more cost-efficient than bigger GPUs if you need more VRAM).
* **`CONTAINER_DISK`**: Size of the disk in GB.
* **`CLOUD_TYPE`**: RunPod offers a community cloud (cheaper) and a secure cloud (more reliable).
* **`REPO`**: If you made a fork of this repo, you can specify its URL here (the image only runs `runpod.sh`).
* **`TRUST_REMOTE_CODE`**: Models like Phi require this flag to run them.
* **`PRIVATE_GIST`**: (W.I.P.) Make the Gist with the results private (true) or public (false).
* **`DEBUG`**: The pod will not be destroyed at the end of the run (not recommended).### Tokens
Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:
* **`RUNPOD_TOKEN`**: Please consider using my [referral link](https://runpod.io?ref=9nvk2srl) if you don't have an account yet. You can create your token [here](https://www.runpod.io/console/user/settings) under "API keys" (read & write permission). You'll also need to transfer some money there to start a pod.
* **`GITHUB_TOKEN`**: You can create your token [here](https://github.com/settings/tokens) (read & write, can be restricted to "gist" only).
* **`HF_TOKEN`**: Optional. You can find your Hugging Face token [here](https://huggingface.co/settings/tokens) if you have an account.## 📊 Benchmark suites
### Nous
You can compare your results with:
* [YALL - Yet Another LLM Leaderboard](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard), my leaderboard made with the gists produced by LLM AutoEval.
* Models like [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B#benchmark-results), [Nous-Hermes-2-SOLAR-10.7B](https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B), or [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B).
* Teknium stores his evaluations in his [LLM-Benchmark-Logs](https://github.com/teknium1/LLM-Benchmark-Logs).### Lighteval
You can compare your results on a case-by-case basis, depending on the tasks you have selected.
### Open LLM
You can compare your results with those listed on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
## 🏆 Leaderboard
I use the summaries produced by LLM AutoEval to created [YALL - Yet Another LLM Leaderboard](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard) with plots as follows:
![image](https://github.com/mlabonne/llm-autoeval/assets/81252890/a9a7e24a-ee29-4c8f-b587-4549d16bf142)
Let me know if you're interested in creating your own leaderboard with your gists in one click. This can be easily converted into a small notebook to create this space.
## 🛠️ Troubleshooting
* **"Error: File does not exist"**: This task didn't produce the JSON file that is parsed for the summary. Activate debug mode and rerun the evaluation to inspect the issue in the logs.
* **"700 Killed" Error**: The hardware is not powerful enough for the evaluation. This happens when you try to run the Open LLM benchmark suite on an RTX 3070 for example.
* **Outdated CUDA Drivers**: That's unlucky. You'll need to start a new pod in this case.
* **"triu_tril_cuda_template" not implemented for 'BFloat16'**: Switch the image as explained in [this issue](https://github.com/mlabonne/llm-autoeval/issues/22).## Acknowledgements
Special thanks to [burtenshaw](https://github.com/burtenshaw) for integrating lighteval, EleutherAI for the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [dmahan93](https://github.com/dmahan93) for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the [lighteval](https://github.com/huggingface/lighteval) library, [NousResearch](https://github.com/NousResearch) and [Teknium](https://github.com/teknium1) for the Nous benchmark suite, and
[vllm](https://docs.vllm.ai/) for the additional inference speed.