{"id":38205791,"url":"https://github.com/thisguymartin/burro","last_synced_at":"2026-01-17T00:34:45.803Z","repository":{"id":271189241,"uuid":"909140915","full_name":"thisguymartin/burro","owner":"thisguymartin","description":"Burro is a command-line interface (CLI) tool built with Deno for evaluating Large Language Model (LLM) outputs. It provides a straightforward way to run different types of evaluations with secure API key management.","archived":false,"fork":false,"pushed_at":"2025-11-07T01:49:38.000Z","size":117,"stargazers_count":2,"open_issues_count":7,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-07T03:26:45.538Z","etag":null,"topics":["ai-testing","deno","evaluation","llm","quality-assurance"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thisguymartin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-12-27T20:51:47.000Z","updated_at":"2025-06-23T10:28:14.000Z","dependencies_parsed_at":"2025-01-06T06:39:49.560Z","dependency_job_id":null,"html_url":"https://github.com/thisguymartin/burro","commit_stats":null,"previous_names":["thisguymartin/burro"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/thisguymartin/burro","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisguymartin%2Fburro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisguymartin%2Fburro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisguymartin%2Fburro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisguymartin%2Fburro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thisguymartin","download_url":"https://codeload.github.com/thisguymartin/burro/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisguymartin%2Fburro/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28490240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T23:55:29.509Z","status":"ssl_error","status_checked_at":"2026-01-16T23:55:29.108Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-testing","deno","evaluation","llm","quality-assurance"],"created_at":"2026-01-17T00:34:45.709Z","updated_at":"2026-01-17T00:34:45.782Z","avatar_url":"https://github.com/thisguymartin.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Burro 🫏\n\n**Burro** is a powerful command-line interface (CLI) tool for evaluating Large Language Model (LLM) outputs. It provides both heuristic and LLM-based evaluation methods with secure API key management.\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Deno](https://img.shields.io/badge/deno-%5E1.37-green)](https://deno.land/)\n\n## 🚀 Features\n\n\u003e **📖 For detailed feature documentation, see [FEATURES.md](FEATURES.md)**\n\n### Evaluation Methods\n\n**📊 Heuristic Evaluations** (No API key required)\n- **Levenshtein Distance** - Measure string similarity using edit distance\n- **Exact Match** - Perfect matching for IDs, codes, and specific formats\n- **Case Insensitive Match** - Flexible text matching\n- **Numeric Difference** - Compare numerical values with configurable tolerance\n- **JSON Diff** - Analyze structural differences in JSON outputs\n- **Jaccard Similarity** - Calculate similarity between sets of tokens\n- **Contains** - Verify if expected value appears in output\n\n**🤖 LLM-as-a-Judge Evaluations** (Requires OpenAI API key)\n- **Factuality** - Answer correctness with context validation\n- **Close QA** - Close-ended question matching\n- **Battle** - Compare outputs from different models head-to-head\n- **Summarization** - Evaluate summary quality and accuracy\n- **SQL** - Verify correctness of generated SQL queries\n- **Translation** - Assess translation quality across languages\n\n### Additional Features\n- 🔒 Secure OpenAI API key management with AES encryption\n- 📈 Progress indicators for long-running evaluations\n- 💾 Export results to JSON format\n- 🎯 Comprehensive error handling and validation\n- 🚀 Cross-platform support (Mac, Linux, Windows)\n- ⚡ Fast execution with Deno runtime\n\n## 📋 Prerequisites\n\n- **For Heuristic Evaluations**: None! Works out of the box\n- **For LLM-based Evaluations**: OpenAI API key\n\n## 🛠️ Installation\n\n### MacOS - Apple Silicon (M1/M2/M3)\n```bash\nsudo curl -L \"https://github.com/thisguymartin/burro/releases/download/latest/build-mac-silicon\" -o /usr/local/bin/burro \u0026\u0026 sudo chmod +x /usr/local/bin/burro\n```\n\n### MacOS - Intel\n```bash\nsudo curl -L \"https://github.com/thisguymartin/burro/releases/download/latest/build-mac-intel\" -o /usr/local/bin/burro \u0026\u0026 sudo chmod +x /usr/local/bin/burro\n```\n\n### Linux - ARM\n```bash\nsudo curl -L \"https://github.com/thisguymartin/burro/releases/download/latest/build-linux-arm\" -o /usr/local/bin/burro \u0026\u0026 sudo chmod +x /usr/local/bin/burro\n```\n\n### Linux - Intel\n```bash\nsudo curl -L \"https://github.com/thisguymartin/burro/releases/download/latest/build-linux-intel\" -o /usr/local/bin/burro \u0026\u0026 sudo chmod +x /usr/local/bin/burro\n```\n\n### Windows\n1. Download `build-windows.exe` from the [releases page](https://github.com/thisguymartin/burro/releases)\n2. Rename it to `burro.exe`\n3. Move it to your desired location (e.g., `C:\\Program Files\\burro\\burro.exe`)\n\n## 🔧 Quick Start\n\n### 1. Set up API Key (for LLM-based evaluations only)\n```bash\nburro set-openai-key\n```\n\n### 2. Run Your First Evaluation\n\n**Heuristic Evaluation (no API key needed):**\n```bash\nburro run-eval -t exact example/exact-match.json\n```\n\n**LLM-based Evaluation:**\n```bash\nburro run-eval -t factuality example/evals.json\n```\n\n**With progress indicators and result export:**\n```bash\nburro run-eval -t levenshtein example/levenshtein.json --progress -p\n```\n\n## 📊 Evaluation Types Guide\n\n### Heuristic Evaluations\n\n#### Levenshtein Distance\nMeasures string similarity using edit distance. Great for catching typos and minor variations.\n\n**Example: `example/levenshtein.json`**\n```json\n[\n  {\n    \"input\": \"Who wrote Hamlet?\",\n    \"output\": \"William Shakespear\",\n    \"expected\": \"William Shakespeare\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t levenshtein example/levenshtein.json\n```\n\n---\n\n#### Exact Match\nPerfect matching for critical data like IDs, codes, or specific formats.\n\n**Example: `example/exact-match.json`**\n```json\n[\n  {\n    \"input\": \"What is the ISO code for United States?\",\n    \"output\": \"US\",\n    \"expected\": \"US\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t exact example/exact-match.json\n```\n\n---\n\n#### Numeric Difference\nCompare numerical values with configurable tolerance.\n\n**Example: `example/numeric.json`**\n```json\n[\n  {\n    \"input\": \"What is the value of Pi to 2 decimal places?\",\n    \"output\": \"3.14\",\n    \"expected\": \"3.14159\",\n    \"tolerance\": 0.01\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t numeric example/numeric.json\n```\n\n---\n\n#### JSON Diff\nAnalyze structural differences in JSON outputs.\n\n**Example: `example/json-diff.json`**\n```json\n[\n  {\n    \"input\": \"Convert user data to JSON\",\n    \"output\": \"{\\\"name\\\": \\\"John Doe\\\", \\\"age\\\": 30}\",\n    \"expected\": \"{\\\"name\\\": \\\"John Doe\\\", \\\"age\\\": 30}\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t json example/json-diff.json\n```\n\n---\n\n#### Jaccard Similarity\nCalculate similarity between sets of tokens.\n\n**Example: `example/jaccard.json`**\n```json\n[\n  {\n    \"input\": \"List programming languages for web development\",\n    \"output\": \"JavaScript TypeScript Python Ruby PHP Java\",\n    \"expected\": \"JavaScript Python Ruby PHP Go\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t jaccard example/jaccard.json\n```\n\n---\n\n### LLM-as-a-Judge Evaluations\n\n#### Factuality\nEvaluate answer correctness with context validation.\n\n**Example: `example/evals.json`**\n```json\n[\n  {\n    \"input\": \"What is the capital of France?\",\n    \"output\": \"The capital city of France is Paris\",\n    \"expected\": \"Paris\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t factuality example/evals.json\n```\n\n---\n\n#### Close QA\nExact matching for close-ended questions.\n\n**Example: `example/closeqa.json`**\n```json\n[\n  {\n    \"input\": \"List the first three prime numbers\",\n    \"output\": \"2,3,5\",\n    \"criteria\": \"Numbers must be in correct order, separated by commas\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t closeqa example/closeqa.json\n```\n\n---\n\n#### Battle\nCompare outputs from different models head-to-head.\n\n**Example: `example/battle.json`**\n```json\n[\n  {\n    \"input\": \"Write a haiku about technology\",\n    \"output\": \"Code flows like water\\nBits and bytes dance in rhythm\\nDigital zen speaks\",\n    \"expected\": \"Silicon pathways\\nData streams through endless night\\nMachines dream in code\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t battle example/battle.json\n```\n\n---\n\n#### Summarization\nEvaluate the quality and accuracy of text summaries.\n\n**Example: `example/summarization.json`**\n```json\n[\n  {\n    \"input\": \"Summarize this text\",\n    \"output\": \"Climate change impacts polar regions; urgent global action needed.\",\n    \"context\": \"Long article about climate change effects...\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t summarization example/summarization.json\n```\n\n---\n\n#### SQL\nVerify the correctness of generated SQL queries.\n\n**Example: `example/sql.json`**\n```json\n[\n  {\n    \"input\": \"Find all users over age 18\",\n    \"output\": \"SELECT * FROM users WHERE age \u003e 18;\",\n    \"expected\": \"SELECT * FROM users WHERE age \u003e 18;\",\n    \"context\": \"Database schema: users(id, name, email, age)\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t sql example/sql.json\n```\n\n---\n\n#### Translation\nAssess translation quality across languages.\n\n**Example: `example/translation.json`**\n```json\n[\n  {\n    \"input\": \"Translate 'Hello' to Spanish\",\n    \"output\": \"Hola\",\n    \"expected\": \"Hola\"\n  }\n]\n```\n\n**Run:**\n```bash\nburro run-eval -t translation example/translation.json\n```\n\n---\n\n## 📖 Real-World Scenarios\n\nSee [SCENARIOS.md](SCENARIOS.md) for comprehensive real-world examples including:\n\n- 🎯 Customer Support Bot Evaluation\n- 💻 Code Generation Validation\n- 🌍 Translation Quality Assessment\n- ⚔️ Chatbot Response Comparison\n- 📊 Data Extraction Accuracy\n- 📚 Educational Content Assessment\n\n## 🔒 Security Features\n\n- **AES-256 Encryption** for API key storage\n- **Secure key generation** using Web Crypto API\n- **Encrypted SQLite storage** for settings\n- **No plaintext secrets** ever stored on disk\n\n## 📈 Advanced Usage\n\n### Progress Indicators\nFor long-running evaluations:\n```bash\nburro run-eval -t factuality large-dataset.json --progress\n```\n\n### Export Results\nSave evaluation results to JSON:\n```bash\nburro run-eval -t battle comparison.json -p\n# Results saved to ~/Downloads/comparison.json-result.json\n```\n\n### Batch Evaluation\nRun multiple evaluations sequentially:\n```bash\nburro run-eval -t exact tests/ids.json\nburro run-eval -t factuality tests/qa.json\nburro run-eval -t sql tests/queries.json\n```\n\n## 🎯 Choosing the Right Evaluation Type\n\n| Use Case | Recommended Type | Why? |\n|----------|------------------|------|\n| Order IDs, Product Codes | `exact` | Requires perfect match |\n| User Questions | `factuality` | Needs semantic understanding |\n| Price Calculations | `numeric` | Allows tolerance |\n| Model A/B Testing | `battle` | Direct comparison |\n| API Responses | `json` | Structure validation |\n| Spelling Variations | `levenshtein` | Fuzzy matching |\n| Keywords/Tags | `jaccard` | Set similarity |\n| Summaries | `summarization` | Quality assessment |\n| SQL Queries | `sql` | Syntax + logic validation |\n| Translations | `translation` | Language expertise |\n\n## 🏗️ System Architecture Check\n\nTo determine which version to download:\n\n### MacOS\n```bash\nuname -m\n```\n- `arm64` → Use Apple Silicon version (M1/M2/M3)\n- `x86_64` → Use Intel version\n\n### Linux\n```bash\nuname -m\n```\n- `aarch64` or `arm64` → Use ARM version\n- `x86_64` → Use Intel version\n\n## 🐛 Troubleshooting\n\n### Permission Denied\n```bash\nsudo chmod +x /usr/local/bin/burro\n```\n\n### Command Not Found\n1. Verify installation location is in your PATH\n2. Restart your terminal\n3. Check executable exists: `ls -l /usr/local/bin/burro`\n\n### API Key Issues\n```bash\nburro set-openai-key  # Re-enter your API key\n```\n\n### Low Evaluation Scores\n- **Exact match**: Check for extra spaces or case differences\n- **Numeric**: Adjust tolerance values\n- **JSON**: Ensure consistent formatting\n- **Factuality**: Make expected answers less specific\n\n## 🗑️ Uninstallation\n\n### MacOS \u0026 Linux\n```bash\nsudo rm /usr/local/bin/burro\nwhich burro  # Should return nothing\n```\n\n### Windows\n1. Delete `burro.exe` from installation location\n2. Remove from PATH if added\n\n## 🎓 Examples Directory\n\nAll evaluation types have examples in the `/example` directory:\n\n```\nexample/\n├── closeqa.json           # Close-ended QA\n├── evals.json            # Factuality evaluation\n├── levenshtein.json      # String similarity\n├── exact-match.json      # Exact matching\n├── numeric.json          # Numeric comparison\n├── json-diff.json        # JSON structure diff\n├── jaccard.json          # Token similarity\n├── battle.json           # Model comparison\n├── summarization.json    # Summary quality\n├── sql.json             # SQL validation\n└── translation.json      # Translation quality\n```\n\n## 🚀 Getting Started Tutorial\n\n1. **Install Burro** using the command for your platform\n2. **Try a heuristic evaluation** (no API key needed):\n   ```bash\n   burro run-eval -t exact example/exact-match.json\n   ```\n3. **Set up your API key** for LLM evaluations:\n   ```bash\n   burro set-openai-key\n   ```\n4. **Run an LLM evaluation**:\n   ```bash\n   burro run-eval -t factuality example/evals.json\n   ```\n5. **Explore scenarios** in [SCENARIOS.md](SCENARIOS.md)\n6. **Create your own** evaluation files based on examples\n\n## 💡 Tips for Success\n\n1. **Start with heuristics** - They're fast and free\n2. **Use the right tool** - Match evaluation type to your use case\n3. **Build incrementally** - Start with 5-10 test cases\n4. **Version your tests** - Track evaluation files in git\n5. **Automate regularly** - Run evaluations as part of your workflow\n6. **Compare methods** - Try multiple evaluation types on the same data\n7. **Check examples** - Learn from provided examples in `/example` directory\n\n## 📚 Documentation\n\n- **[FEATURES.md](FEATURES.md)** - Complete feature documentation and technical details\n- **[SCENARIOS.md](SCENARIOS.md)** - Real-world use cases and examples\n- **[/example](example/)** - Sample evaluation files for all types\n- **README.md** - This file, quick start guide and overview\n\n## 🤝 Contributing\n\nContributions welcome! Please feel free to submit a Pull Request.\n\n## 📝 License\n\nMIT License - feel free to use Burro in your projects!\n\n## 🎯 Next Steps\n\nReady to evaluate your LLM outputs?\n\n1. Install Burro for your platform\n2. Pick an evaluation type that matches your needs\n3. Create or use an example evaluation file\n4. Run your first evaluation\n5. Iterate and improve!\n\nNeed help? Check the [issues page](https://github.com/thisguymartin/burro/issues) or review the examples!\n\n---\n\n**Made with ❤️ for LLM developers and evaluators**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisguymartin%2Fburro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthisguymartin%2Fburro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisguymartin%2Fburro/lists"}