{"id":19029195,"url":"https://github.com/danilop/llm-test-mate","last_synced_at":"2025-04-23T15:48:44.116Z","repository":{"id":261062388,"uuid":"881480638","full_name":"danilop/llm-test-mate","owner":"danilop","description":"A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based evaluation.","archived":false,"fork":false,"pushed_at":"2025-01-23T15:27:33.000Z","size":85,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-18T00:58:15.794Z","etag":null,"topics":["ci-cd","devops","devops-tools","generative-ai","llm-evaluation","test-automation","testing-tools"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danilop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-31T16:52:26.000Z","updated_at":"2025-01-23T15:27:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"11cb4a5f-dc58-45b4-9980-9273a09a0701","html_url":"https://github.com/danilop/llm-test-mate","commit_stats":null,"previous_names":["danilop/llm-test-mate"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danilop%2Fllm-test-mate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danilop%2Fllm-test-mate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danilop%2Fllm-test-mate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danilop%2Fllm-test-mate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danilop","download_url":"https://codeload.github.com/danilop/llm-test-mate/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250464858,"owners_count":21435089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ci-cd","devops","devops-tools","generative-ai","llm-evaluation","test-automation","testing-tools"],"created_at":"2024-11-08T21:13:33.397Z","updated_at":"2025-04-23T15:48:44.083Z","avatar_url":"https://github.com/danilop.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Test Mate 🤝\n\nA simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based (LLM as a judge) evaluation.\n\n## 🚀 Features\n\n- 📝 String similarity testing using Damerau-Levenshtein distance and other methods\n- 📊 Semantic similarity testing using sentence transformers\n- 🤖 LLM-based evaluation of content quality and correctness\n- 🔧 Easy integration with pytest\n- 📝 Comprehensive test reports\n- 🎯 Sensible defaults with flexible overrides\n\n## 📚 Overview\n\n### Initialization\n\nUsing default models:\n\n```python\ntester = LLMTestMate(\n    similarity_threshold=0.8,\n    temperature=0.7\n)\n```\n\n### Semantic Similarity Testing\n\n```python\ntester.semantic_similarity(text: str, reference_text: str, threshold: Optional[float] = None)\n```\nCalculate semantic similarity between two texts using sentence transformers. Returns a similarity score and pass/fail status.\n\n```python\ntester.semantic_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None)\n```\nCompare text against multiple references using semantic similarity. Returns results sorted by similarity score.\n\n### String Similarity Testing\n\n```python\ntester.string_similarity(text: str, reference_text: str, threshold: Optional[float] = None, \n    normalize_case: bool = True, normalize_whitespace: bool = True,\n    remove_punctuation: bool = True, method: str = \"damerau-levenshtein\")\n```\nCalculate string similarity using various distance metrics (damerau-levenshtein, levenshtein, hamming, jaro, jaro-winkler, indel).\n\n```python\ntester.string_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None,\n    normalize_case: bool = True, normalize_whitespace: bool = True,\n    remove_punctuation: bool = True, method: str = \"damerau-levenshtein\")\n```\nCompare text against multiple references using string similarity. Returns results sorted by similarity score.\n\n### LLM-Based Evaluation\n\n```python\ntester.llm_evaluate(text: str, reference_text: str, criteria: Optional[str] = None,\n    model: Optional[str] = None, temperature: Optional[float] = None,\n    max_tokens: Optional[int] = None)\n```\nEvaluate text quality and correctness using an LLM as judge. Returns detailed analysis in JSON format.\n\n```python\ntester.llm_evaluate_list(text: str, reference_texts: list[str], criteria: Optional[str] = None,\n    model: Optional[str] = None, temperature: Optional[float] = None,\n    max_tokens: Optional[int] = None)\n```\nEvaluate text against multiple references using LLM. Returns results sorted by similarity if available.\n\n## 🏃‍♂️ Quick Start\n\n### Installation\n\n```bash\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows, use: venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n\n# To run the examples\npython examples.py\n\n# To run the tests\npytest                      # Run all tests\npytest test_examples.py     # Run all tests in file\npytest test_examples.py -v  # Run with verbose output\npytest test_examples.py::test_semantic_similarity  # Run a specific test\n```\n\nThe test examples (`test_examples.py`) include:\n- Semantic similarity testing\n- LLM-based evaluation\n- Custom evaluation criteria with Llama\n- Model comparison tests\n- Parameterized threshold testing\n\nHere's how to get you started using this tool (see `quickstart.py`):\n\n```python\nimport json\n\nfrom llm_test_mate import LLMTestMate\n\n# Initialize the test mate with your preferences\ntester = LLMTestMate(\n    llm_model=\"bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0\",\n    similarity_threshold=0.8,\n    temperature=0.7\n)\n\n# Example 1a: String similarity test (Single Reference)\nprint(\"\\n=== Example 1a: String Similarity (Single Reference) ===\")\ntext1 = \"The quick brown fox jumps over the lazy dog.\"\ntext2 = \"The quikc brown fox jumps over the lasy dog.\"\n\nresult = tester.string_similarity(text1, text2)\nprint(f\"Text 1: {text1}\")\nprint(f\"Text 2: {text2}\")\nprint(f\"String similarity score: {result['similarity']:.2f}\")\nprint(f\"Edit distance: {result['distance']:.2f}\")\nprint(f\"Passed threshold: {result['passed']}\")\n\n# Example 1b: String similarity test (Multiple References)\nprint(\"\\n=== Example 1b: String Similarity (Multiple References) ===\")\ntest_text = \"The quick brown fox jumps over the lazy dog.\"\nreference_texts = [\n    \"The quikc brown fox jumps over the lasy dog.\",\n    \"The quick brwon fox jumps over the layz dog.\",\n    \"The quick brown fox jumps over the lazy dog.\"\n]\n\nprint(f\"Test text: {test_text}\")\nprint(\"Reference texts:\")\nfor i, ref in enumerate(reference_texts, 1):\n    print(f\"{i}. {ref}\")\n\nresults = tester.string_similarity_list(test_text, reference_texts)\nprint(\"\\nResults sorted by similarity (highest first):\")\nfor result in results:\n    print(f\"\\nReference: {result['reference_text']}\")\n    print(f\"Similarity score: {result['similarity']:.2f}\")\n    print(f\"Edit distance: {result['distance']:.2f}\")\n    print(f\"Passed threshold: {result['passed']}\")\n\n# Example 2a: Semantic similarity test (Single Reference)\nprint(\"\\n=== Example 2a: Semantic Similarity (Single Reference) ===\")\ntext1 = \"The quick brown fox jumps over the lazy dog.\"\ntext2 = \"A swift brown fox leaps above a sleepy canine.\"\n\nresult = tester.semantic_similarity(text1, text2)\nprint(f\"Text 1: {text1}\")\nprint(f\"Text 2: {text2}\")\nprint(f\"Semantic similarity score: {result['similarity']:.2f}\")\nprint(f\"Passed threshold: {result['passed']}\")\n\n# Example 2b: Semantic similarity test (Multiple References)\nprint(\"\\n=== Example 2b: Semantic Similarity (Multiple References) ===\")\ntest_text = \"A swift brown fox leaps above a sleepy canine.\"\nreference_texts = [\n    \"The quick brown fox jumps over the lazy dog.\",\n    \"A fast brown fox leaps over a sleeping dog.\",\n    \"The agile brown fox bounds over the tired dog.\"\n]\n\nprint(f\"Test text: {test_text}\")\nprint(\"Reference texts:\")\nfor i, ref in enumerate(reference_texts, 1):\n    print(f\"{i}. {ref}\")\n\nresults = tester.semantic_similarity_list(test_text, reference_texts)\nprint(\"\\nResults sorted by similarity (highest first):\")\nfor result in results:\n    print(f\"\\nReference: {result['reference_text']}\")\n    print(f\"Similarity score: {result['similarity']:.2f}\")\n    print(f\"Passed threshold: {result['passed']}\")\n\n# Example 3a: LLM-based evaluation (Single Reference)\nprint(\"\\n=== Example 3a: LLM Evaluation (Single Reference) ===\")\ntext1 = \"The quick brown fox jumps over the lazy dog.\"\ntext2 = \"A swift brown fox leaps above a sleepy canine.\"\n\nresult = tester.llm_evaluate(text1, text2)\nprint(f\"Text 1: {text1}\")\nprint(f\"Text 2: {text2}\")\nprint(\"Evaluation result:\")\nprint(json.dumps(result, indent=2))\n\n# Example 3b: LLM-based evaluation (Multiple References)\nprint(\"\\n=== Example 3b: LLM Evaluation (Multiple References) ===\")\ntest_text = \"A swift brown fox leaps above a sleepy canine.\"\nreference_texts = [\n    \"The quick brown fox jumps over the lazy dog.\",\n    \"A fast brown fox leaps over a sleeping dog.\",\n    \"The agile brown fox bounds over the tired dog.\"\n]\n\nprint(f\"Test text: {test_text}\")\nprint(\"Reference texts:\")\nfor i, ref in enumerate(reference_texts, 1):\n    print(f\"{i}. {ref}\")\n\nresults = tester.llm_evaluate_list(test_text, reference_texts)\nprint(\"\\nResults sorted by similarity (highest first):\")\nfor result in results:\n    print(f\"\\nReference: {result['reference_text']}\")\n    print(json.dumps(result, indent=2))\n```\n\nSample output:\n```\n=== Example 1a: String Similarity (Single Reference) ===\nText 1: The quick brown fox jumps over the lazy dog.\nText 2: The quikc brown fox jumps over the lasy dog.\nString similarity score: 0.95\nEdit distance: 0.05\nPassed threshold: True\n\n=== Example 1b: String Similarity (Multiple References) ===\nTest text: The quick brown fox jumps over the lazy dog.\nReference texts:\n1. The quikc brown fox jumps over the lasy dog.\n2. The quick brwon fox jumps over the layz dog.\n3. The quick brown fox jumps over the lazy dog.\n\nResults sorted by similarity (highest first):\n\nReference: The quick brown fox jumps over the lazy dog.\nSimilarity score: 1.00\nEdit distance: 0.00\nPassed threshold: True\n\nReference: The quikc brown fox jumps over the lasy dog.\nSimilarity score: 0.95\nEdit distance: 0.05\nPassed threshold: True\n\nReference: The quick brwon fox jumps over the layz dog.\nSimilarity score: 0.95\nEdit distance: 0.05\nPassed threshold: True\n\n=== Example 2a: Semantic Similarity (Single Reference) ===\nText 1: The quick brown fox jumps over the lazy dog.\nText 2: A swift brown fox leaps above a sleepy canine.\nSemantic similarity score: 0.79\nPassed threshold: False\n\n=== Example 2b: Semantic Similarity (Multiple References) ===\nTest text: A swift brown fox leaps above a sleepy canine.\nReference texts:\n1. The quick brown fox jumps over the lazy dog.\n2. A fast brown fox leaps over a sleeping dog.\n3. The agile brown fox bounds over the tired dog.\n\nResults sorted by similarity (highest first):\n\nReference: A fast brown fox leaps over a sleeping dog.\nSimilarity score: 0.88\nPassed threshold: True\n\nReference: The quick brown fox jumps over the lazy dog.\nSimilarity score: 0.79\nPassed threshold: False\n\nReference: The agile brown fox bounds over the tired dog.\nSimilarity score: 0.72\nPassed threshold: False\n\n=== Example 3a: LLM Evaluation (Single Reference) ===\nText 1: The quick brown fox jumps over the lazy dog.\nText 2: A swift brown fox leaps above a sleepy canine.\nEvaluation result:\n{\n  \"passed\": true,\n  \"similarity\": 0.9,\n  \"analysis\": {\n    \"semantic_match\": \"Both sentences convey the same core meaning of a fox jumping over a dog, with only minor variations in word choice.\",\n    \"content_match\": \"The key elements (fox, brown, jumping, dog) are present in both texts, with slight differences in adjectives and verbs used.\",\n    \"key_differences\": [\n      \"Use of 'quick' vs 'swift'\",\n      \"Use of 'jumps' vs 'leaps'\",\n      \"Use of 'lazy' vs 'sleepy'\",\n      \"Use of 'dog' vs 'canine'\"\n    ]\n  },\n  \"model_used\": \"anthropic.claude-3-5-sonnet-20240620-v1:0\"\n}\n\n=== Example 3b: LLM Evaluation (Multiple References) ===\nTest text: A swift brown fox leaps above a sleepy canine.\nReference texts:\n1. The quick brown fox jumps over the lazy dog.\n2. A fast brown fox leaps over a sleeping dog.\n3. The agile brown fox bounds over the tired dog.\n\nResults sorted by similarity (highest first):\n\nReference: A fast brown fox leaps over a sleeping dog.\n{\n  \"passed\": true,\n  \"similarity\": 0.9,\n  \"analysis\": {\n    \"semantic_match\": \"Both texts convey the same core meaning of a fox quickly moving over a resting dog.\",\n    \"content_match\": \"The key elements (fox, brown, leaping, dog) are present in both texts with minor variations in descriptors.\",\n    \"key_differences\": [\n      \"Use of 'swift' vs 'fast' to describe the fox\",\n      \"Use of 'above' vs 'over' for the fox's action\",\n      \"Description of the dog as 'sleepy' vs 'sleeping'\",\n      \"Use of 'canine' instead of 'dog' in the generated text\"\n    ]\n  },\n  \"model_used\": \"anthropic.claude-3-5-sonnet-20240620-v1:0\",\n  \"reference_text\": \"A fast brown fox leaps over a sleeping dog.\"\n}\n\nReference: The agile brown fox bounds over the tired dog.\n{\n  \"passed\": true,\n  \"similarity\": 0.9,\n  \"analysis\": {\n    \"semantic_match\": \"Both sentences convey the same core meaning of a fox moving quickly over a dog.\",\n    \"content_match\": \"The key elements (fox, brown, jumping over, dog) are present in both texts with slight variations in descriptors.\",\n    \"key_differences\": [\n      \"The generated text uses 'swift' instead of 'agile'\",\n      \"The generated text uses 'leaps above' instead of 'bounds over'\",\n      \"The generated text describes the dog as 'sleepy' instead of 'tired'\",\n      \"'A' is used instead of 'The' at the beginning of the generated text\"\n    ]\n  },\n  \"model_used\": \"anthropic.claude-3-5-sonnet-20240620-v1:0\",\n  \"reference_text\": \"The agile brown fox bounds over the tired dog.\"\n}\n\nReference: The quick brown fox jumps over the lazy dog.\n{\n  \"passed\": true,\n  \"similarity\": 0.85,\n  \"analysis\": {\n    \"semantic_match\": \"Both sentences convey the same core meaning of a fox moving quickly over a dog.\",\n    \"content_match\": \"The main elements (fox, dog, action of moving over) are present in both sentences, with slight variations in adjectives and verbs used.\",\n    \"key_differences\": [\n      \"Use of 'swift' instead of 'quick'\",\n      \"Use of 'leaps above' instead of 'jumps over'\",\n      \"Use of 'sleepy' instead of 'lazy'\",\n      \"Absence of articles 'The' and 'the' in the generated text\",\n      \"Use of 'canine' instead of 'dog'\"\n    ]\n  },\n  \"model_used\": \"anthropic.claude-3-5-sonnet-20240620-v1:0\",\n  \"reference_text\": \"The quick brown fox jumps over the lazy dog.\"\n}\n```\n\n### 2. Custom Evaluation Criteria\n\n```python\n# Initialize with custom criteria\ntester = LLMTestMate(\n    evaluation_criteria=\"\"\"\n    Evaluate the marketing effectiveness of the generated text compared to the reference.\n    Consider:\n    1. Feature Coverage: Are all key features mentioned?\n    2. Tone: Is it engaging and professional?\n    3. Clarity: Is the message clear and concise?\n\n    Return JSON with:\n    {\n        \"passed\": boolean,\n        \"effectiveness_score\": float (0-1),\n        \"analysis\": {\n            \"feature_coverage\": string,\n            \"tone_analysis\": string,\n            \"suggestions\": list[string]\n        }\n    }\n    \"\"\"\n)\n\nproduct_description = \"Our new smartphone features a 6.1-inch OLED display, 12MP camera, and all-day battery life.\"\ngenerated_description = generate_text(\"Write a short description of a smartphone's key features\")\n\neval_result = tester.llm_evaluate(\n    generated_description,\n    product_description\n)\n```\n\nSample result:\n```\n{\n  \"passed\": true,\n  \"effectiveness_score\": 0.8,\n  \"analysis\": {\n    \"feature_coverage\": \"The generated text provides a much more comprehensive coverage of smartphone features compared to the reference. It includes details on display, camera, performance, storage, battery, connectivity, operating system, and additional features, while the reference only mentions display, camera, and battery.\",\n    \"tone_analysis\": \"The generated text maintains a professional and informative tone throughout, providing technical details and specifications. It is more detailed and technical compared to the concise, marketing-oriented tone of the reference.\",\n    \"suggestions\": [\n      \"Consider condensing some of the technical details for a more concise marketing message\",\n      \"Add more engaging language or unique selling points to make the features stand out\",\n      \"Include specific model comparisons or standout features to differentiate from competitors\",\n      \"Consider adding a brief overview or summary statement at the beginning to capture attention quickly\"\n    ]\n  },\n  \"model_used\": \"...\"\n}\n```\n\n### 3. Using with Pytest\n\n```python\nimport pytest\nfrom llm_test_mate import LLMTestMate\n\n@pytest.fixture\ndef tester():\n    return LLMTestMate(\n        similarity_threshold=0.8,\n        temperature=0.7\n    )\n\ndef test_generated_content(tester):\n    generated = generate_text(\"Explain what is Python\")\n    expected = \"Python is a high-level programming language...\"\n    \n    # Check semantic similarity\n    sem_result = tester.semantic_similarity(\n        generated,\n        expected\n    )\n    \n    # Evaluate with LLM\n    llm_result = tester.llm_evaluate(\n        generated,\n        expected\n    )\n    \n    assert sem_result[\"passed\"], \"Failed similarity check\"\n    assert llm_result[\"passed\"], f\"Failed requirements: {llm_result['reasoning']}\"\n```\n\n## 🛠️ Advanced Usage\n\n### String Similarity Testing\n\nLLM Test Mate provides comprehensive string similarity testing with multiple methods and configuration options:\n\n1. Basic Usage:\n```python\nresult = tester.string_similarity(\n    \"The quick brown fox jumps over the lazy dog!\",\n    \"The quikc brown fox jumps over the lasy dog\",  # Different punctuation and typos\n    threshold=0.9\n)\n```\n\n2. Available Methods:\n\n| Method | Best For | Description |\n|--------|----------|-------------|\n| damerau-levenshtein | General text | Handles transposed letters, good default choice |\n| levenshtein | Simple comparisons | Basic edit distance |\n| hamming | Equal length strings | Counts position differences |\n| jaro | Short strings | Good for typos in short text |\n| jaro-winkler | Names | Optimized for name comparisons |\n| indel | Subsequence matching | Based on longest common subsequence |\n\n3. Configuration Options:\n- `normalize_case`: Convert to lowercase (default: True)\n- `normalize_whitespace`: Standardize spaces (default: True)\n- `remove_punctuation`: Ignore punctuation marks (default: True)\n- `processor`: Custom function for text preprocessing\n- `threshold`: Similarity threshold for pass/fail (0-1)\n- `method`: Choice of similarity metric\n\n4. Example Usage:\n```python\n# Name comparison with Jaro-Winkler\nresult = tester.string_similarity(\n    \"John Smith\",\n    \"Jon Smyth\",\n    method=\"jaro-winkler\",\n    threshold=0.8\n)\n\n# Text with custom preprocessing\ndef remove_special_chars(text: str) -\u003e str:\n    return ''.join(c for c in text if c.isalnum() or c.isspace())\n\nresult = tester.string_similarity(\n    \"Hello! @#$ World\",\n    \"Hello World\",\n    processor=remove_special_chars,\n    threshold=0.9\n)\n\n# Combined options\nresult = tester.string_similarity(\n    \"Hello,  WORLD!\",\n    \"hello world\",\n    method=\"damerau-levenshtein\",\n    normalize_case=True,\n    normalize_whitespace=True,\n    remove_punctuation=True,\n    processor=remove_special_chars,\n    threshold=0.9\n)\n```\n\n5. Result Dictionary:\n```python\n{\n    \"similarity\": 0.95,        # Similarity score (0-1)\n    \"distance\": 0.05,         # Distance score (0-1)\n    \"method\": \"jaro-winkler\", # Method used\n    \"normalized\": {           # Applied normalizations\n        \"case\": True,\n        \"whitespace\": True,\n        \"punctuation\": True\n    },\n    \"options\": {              # Additional options\n        \"processor\": \"remove_special_chars\"\n    },\n    \"passed\": True,           # If threshold was met\n    \"threshold\": 0.9         # Threshold used\n}\n```\n\n### Combined Testing Approach 🔄\n\n```python\ndef test_comprehensive_check(embedding_model):\n    generated = generate_text(\"Write a recipe\")\n    expected = \"\"\"\n    Recipe must include:\n    - Ingredients list\n    - Instructions\n    - Cooking time\n    \"\"\"\n    \n    # Check similarity\n    sem_result = tester.semantic_similarity(\n        generated,\n        expected\n    )\n    \n    # Detailed evaluation\n    llm_result = tester.llm_evaluate(\n        generated,\n        expected\n    )\n    \n    assert sem_result[\"passed\"], \"Failed similarity check\"\n    assert llm_result[\"passed\"], f\"Failed requirements: {llm_result['reasoning']}\"\n```\n\n## 📊 Comprehensive Test Results\n\nWhen running tests with LLM Test Mate, you get comprehensive results from two types of evaluations:\n\n### Semantic Similarity Results\n```python\n{\n    \"similarity\": 0.85,        # Similarity score between 0-1\n    \"embedding_model\": \"all-MiniLM-L6-v2\",  # Model used for embeddings\n    \"passed\": True,           # Whether it passed the threshold\n    \"threshold\": 0.8          # The threshold used for this test\n}\n```\n\n### LLM Evaluation Results\n```python\n{\n    \"passed\": True,           # Overall pass/fail assessment\n    \"similarity_score\": 0.9,  # Semantic similarity assessment by LLM\n    \"analysis\": {\n        \"semantic_match\": \"The texts convey very similar meanings...\",\n        \"content_match\": \"Both texts cover the same key points...\",\n        \"key_differences\": [\n            \"Minor variation in word choice\",\n            \"Slightly different emphasis on...\"\n        ]\n    },\n    \"model_used\": \"anthropic.claude-3-5-sonnet-20240620-v1:0\"  # Model used for evaluation\n}\n```\n\nFor custom evaluation criteria, the results will match your specified JSON structure. For example, with marketing evaluation:\n```python\n{\n    \"passed\": True,\n    \"effectiveness_score\": 0.85,\n    \"analysis\": {\n        \"feature_coverage\": \"All key features mentioned...\",\n        \"tone_analysis\": \"Professional and engaging...\",\n        \"suggestions\": [\n            \"Consider emphasizing battery life more\",\n            \"Add specific camera capabilities\"\n        ]\n    },\n    \"model_used\": \"meta.llama3-2-90b-instruct-v1:0\"\n}\n```\n\n### Benefits of Combined Testing\nWhen using both approaches together, you get:\n- Quantitative similarity metrics from embedding comparison\n- Qualitative content evaluation from LLM analysis\n- Model-specific insights (can compare different LLM evaluations)\n- Clear pass/fail indicators for automated testing\n- Detailed feedback for manual review\n\nThis comprehensive approach helps ensure both semantic closeness to reference content and qualitative correctness of the generated output.\n\n## 🔧 Adding to Your Project\n\nThe simplest way to add LLM Test Mate to your project is to copy the `llm_test_mate.py` file:\n\n1. Copy `llm_test_mate.py` to your project's test directory\n2. Add the required dependencies to your `requirements.txt` file:\n- litellm\n- sentence-transformers\n- boto3\n- pytest\n- rapidfuzz\n\n3. Install the dependencies:\n```bash\npip install -r requirements.txt\n```\n\n### Project Structure\n\nTypical integration into an existing project:\n\n```\nyour_project/\n├── src/\n│   └── your_code.py\n├── tests/\n│   ├── llm_test_mate.py    # Copy the file here\n│   ├── your_test_file.py   # Your LLM tests\n│   └── conftest.py         # Pytest fixtures\n├─ requirements.txt        # Add dependencies here\n└── pytest.ini              # Optional pytest configuration\n```\n\nExample `conftest.py`:\n```python\nimport pytest\nfrom llm_test_mate import LLMTestMate\n\n@pytest.fixture\ndef llm_tester():\n    return LLMTestMate(\n        similarity_threshold=0.8,\n        temperature=0.7\n    )\n\n@pytest.fixture\ndef strict_llm_tester():\n    return LLMTestMate(\n        similarity_threshold=0.9,\n        temperature=0.5\n    )\n```\n\nExample test file:\n```python\ndef test_product_description(llm_tester):\n    expected = \"Our product helps you test LLM outputs effectively.\"\n    generated = your_llm_function(\"Describe our product\")\n    \n    result = llm_tester.semantic_similarity(generated, expected)\n    assert result['passed'], f\"Generated text not similar enough: {result['similarity']}\"\n```\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## 📜 License\n\nDistributed under the MIT License. See `LICENSE` for more information.\n\n## 🙏 Acknowledgments\n\n- Built with [LiteLLM](https://github.com/BerriAI/litellm)\n- Uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- String similarity powered by [RapidFuzz](https://github.com/maxbachmann/RapidFuzz)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanilop%2Fllm-test-mate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanilop%2Fllm-test-mate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanilop%2Fllm-test-mate/lists"}