{"id":31236467,"url":"https://github.com/codecentric/llm-eval","last_synced_at":"2025-09-22T16:02:28.452Z","repository":{"id":299623353,"uuid":"1003472266","full_name":"codecentric/llm-eval","owner":"codecentric","description":"A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.","archived":false,"fork":false,"pushed_at":"2025-08-15T09:20:09.000Z","size":6052,"stargazers_count":12,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-15T09:20:43.870Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codecentric.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY_SCANNING.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-17T07:46:42.000Z","updated_at":"2025-08-15T07:29:53.000Z","dependencies_parsed_at":"2025-07-23T14:32:20.781Z","dependency_job_id":"448e1e70-a931-49aa-9441-4e6c47a5fa45","html_url":"https://github.com/codecentric/llm-eval","commit_stats":null,"previous_names":["codecentric/llm-eval"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/codecentric/llm-eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codecentric%2Fllm-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codecentric%2Fllm-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codecentric%2Fllm-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codecentric%2Fllm-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codecentric","download_url":"https://codeload.github.com/codecentric/llm-eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codecentric%2Fllm-eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276430839,"owners_count":25641123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-22T02:00:08.972Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-22T16:00:45.045Z","updated_at":"2025-09-22T16:02:28.434Z","avatar_url":"https://github.com/codecentric.png","language":"TypeScript","readme":"# LLM-Eval\n\nA flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.\n\n## 🚀 Key Features\n\n- 📚 **Document-Based Q\u0026A Generation**: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs\n- 📊 **Multi-Dimensional Evaluation Metrics**:\n  - ✅ **Answer Relevancy**: Measures how well responses address the actual question\n  - 🧠 **G-Eval**: Sophisticated evaluation using other LLMs as judges\n  - 🔍 **Faithfulness**: Assesses adherence to source material facts\n  - 🚫 **Hallucination Detection**: Identifies fabricated information not present in source documents\n- 📈 **Long-Term Quality Tracking**:\n  - 📆 **Temporal Performance Analysis**: Monitor model degradation or improvement over time\n  - 🔄 **Regression Testing**: Automatically detect when model updates negatively impact performance\n  - 📊 **Trend Visualization**: Track quality metrics across model versions with interactive charts\n- 🔄 **Universal Compatibility**: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama\n- 🏷️ **Version Control for Q\u0026A Catalogs**: Easily track changes in your evaluation sets over time\n- 📊 **Comparative Analysis**: Visualize performance differences between models on identical question sets\n- 🚀 **Batch Processing**: Evaluate multiple models simultaneously for efficient workflows\n- 🔌 **Extensible Plugin System**: Add new providers, metrics, and dataset generation techniques\n\n### Available Providers\n\n- **OpenAI**: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control\n- **Azure OpenAI**: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support\n- **C4**: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support\n\n## 📖 Table of Contents\n\n1. [🚀 Key Features](#-key-features)\n2. [📖 Table of Contents](#-table-of-contents)\n3. [📝 Introduction](#-introduction)\n4. [Getting Started](#getting-started)\n   1. [Running LLM-Eval Locally](#running-llm-eval-locally)\n      - [Prerequisites](#prerequisites)\n      - [Quick Start - for local usage](#quick-start---for-local-usage)\n   2. [Development Setup](#development-setup)\n      - [Development prerequisites](#development-prerequisites)\n      - [Installation \u0026 Local Development](#installation--local-development)\n      - [Keycloak Setup (Optional if you want to override defaults)](#keycloak-setup-optional-if-you-want-to-override-defaults)\n      - [Troubleshooting](#troubleshooting)\n5. [🤝 Contributing \u0026 Code of Conduct](#-contributing--code-of-conduct)\n6. [📜 License](#-license)\n\n## 📝 Introduction\n\nLLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.\n\n_Key benefits include:_ end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.\n\n## Getting Started\n\n### Running LLM-Eval Locally\n\nTo run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.\n\n#### Prerequisites\n\n- Docker\n- Docker Compose\n\n#### Quick Start - for local usage\n\n1. **Clone the repository:**\n\n   ```bash\n   git clone \u003cLLM-Eval github url\u003e\n   cd llm-eval\n   ```\n\n2. **Copy and configure environment:**\n\n   ```bash\n   cp .env.example .env\n   # Edit .env to add your API keys and secrets as needed\n   ```\n\n   **Required:**\n\n   - Generate the encryption keys set to `CHANGEME` with the respective commands commented next to them in `.env`\n   - Don't forget to set azure openai keys and the `AZURE_OPENAI_EMBEDDING_DEPLOYMENT=`, without these the catalog generation will fail.\n\n3. **Enable host networking in docker desktop (for macos users):**\n\n   Go to `Settings -\u003e Resources -\u003e Network` and check `Enable host networking`, without this step on macos, the frontend wouldn't be reachable on localhost.\n\n4. **Start the stack:**\n\n   ```bash\n   docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d\n   ```\n\n5. **Access the application:**\n\n   - Web UI: [http://localhost:3000](http://localhost:3000) (Default login: `username`:`password`)\n   - API: [http://localhost:8070/docs](http://localhost:8070/docs)\n\n6. **Login using default user**:\n\n   Default user for llmeval username: `username`, password: `password`.\n\nTo stop the app:\n\n```bash\ndocker compose -f docker-compose.yaml -f docker-compose.local.yaml down\n```\n\n### Development Setup\n\nIf you want to contribute to LLM-Eval or run it in a development environment, follow these steps:\n\n#### Development prerequisites\n\n- Python 3.12\n- [Poetry](https://python-poetry.org/docs/#installation)\n- Docker (for required services)\n- Node.js \u0026 npm (for frontend)\n\n#### Installation \u0026 Local Development\n\n```bash\ngit clone \u003cLLM-Eval github url\u003e\ncd llm-eval\npoetry install --only=main,dev,test\npoetry self add poetry-plugin-shell\n```\n\n- Install Git pre-commit hook:\n\n  ```bash\n  pre-commit install\n  ```\n\n1. **Start Poetry shell:**\n\n   ```bash\n   poetry shell\n   ```\n\n2. **Copy and configure environment:**\n\n   ```bash\n   cp .env.example .env\n   # Add your API keys and secrets to .env\n   # Fill CHANGEME with appropriate keys\n   ```\n\n3. **Comment the following in .env**\n\n   from\n\n   ```bash\n   # container variables\n   KEYCLOAK_HOST=keycloak\n   CELERY_BROKER_HOST=rabbit-mq\n   PG_HOST=eval-db\n   ```\n\n   to\n\n   ```bash\n   # container variables\n   # KEYCLOAK_HOST=keycloak\n   # CELERY_BROKER_HOST=rabbit-mq\n   # PG_HOST=eval-db\n   ```\n\n4. **Start databases and other services:**\n\n   ```bash\n   docker compose up -d\n   ```\n\n5. **Start backend:**\n\n   ```bash\n   cd backend\n   uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload\n   ```\n\n6. **Start Celery worker:**\n\n   ```bash\n   cd backend\n   celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4\n   ```\n\n7. **Start frontend:**\n\n   ```bash\n   cd frontend\n   npm install\n   npm run dev\n   ```\n\n8. **Login using default user**:\n\n   Default user for llmeval username: `username`, password: `password`.\n\n#### Keycloak Setup (Optional if you want to override defaults)\n\nUser access is managed through Keycloak, available at [localhost:8080](localhost:8080) (Default admin credentials: `admin`:`admin`). Select the `llm-eval` realm to manage users.\n\n- If you want to adjust keycloak manually see [docs/keycloak-setup-guide.md](docs/keycloak-setup-guide.md) for step-by-step guide.\n- Otherwise it will use default configuration found in [keycloak-config](.devcontainer/import/keycloak/llm-eval-realm.json), when docker compose launchs.\n\n##### Acquiring tokens from keycloak\n\nOnce keycloak is up and running, tokens might be requested by calling:\n\n**Without session** by service client `dev-ide` (direct backend api calls):\n\n```shell\n$ curl -X POST \\\n  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \\\n  -H 'Content-Type: application/x-www-form-urlencoded' \\\n  -d 'client_id=dev-ide' \\\n  -d 'client_secret=dev-ide' \\\n  -d 'grant_type=client_credentials' | jq\n```\n\nOr **with session** using client `llm-eval-ui` (frontend calls) :\n\n```shell\n$ curl -X POST \\\n  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \\\n  -H 'Content-Type: application/x-www-form-urlencoded' \\\n  -d 'client_id=llm-eval-ui' \\\n  -d 'client_secret=llm-eval-ui' \\\n  -d 'username=username' \\\n  -d 'password=password' \\\n  -d 'grant_type=password' | jq\n```\n\n## 🤝 Contributing \u0026 Code of Conduct\n\nAs the repo isn't fully prepared for contributions, we aren't open for them for the moment.\n\n## 📜 License\n\nThis project is licensed under the [Apache 2.0 License](LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodecentric%2Fllm-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodecentric%2Fllm-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodecentric%2Fllm-eval/lists"}