{"id":30802123,"url":"https://github.com/valyriantech/valyriangamescodingchallenge","last_synced_at":"2026-02-17T11:31:26.708Z","repository":{"id":312592181,"uuid":"1048001177","full_name":"ValyrianTech/ValyrianGamesCodingChallenge","owner":"ValyrianTech","description":"Coding challenge for the Valyrian Games","archived":false,"fork":false,"pushed_at":"2025-09-08T09:53:09.000Z","size":10250,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-09T03:42:17.635Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ValyrianTech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-31T17:44:04.000Z","updated_at":"2025-09-08T09:53:13.000Z","dependencies_parsed_at":"2025-08-31T19:28:04.490Z","dependency_job_id":"446c1e4d-7b74-4f13-ab78-dc8a7c85bdbc","html_url":"https://github.com/ValyrianTech/ValyrianGamesCodingChallenge","commit_stats":null,"previous_names":["valyriantech/valyriangamescodingchallenge"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ValyrianTech/ValyrianGamesCodingChallenge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ValyrianTech%2FValyrianGamesCodingChallenge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ValyrianTech%2FValyrianGamesCodingChallenge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ValyrianTech%2FValyrianGamesCodingChallenge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ValyrianTech%2FValyrianGamesCodingChallenge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ValyrianTech","download_url":"https://codeload.github.com/ValyrianTech/ValyrianGamesCodingChallenge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ValyrianTech%2FValyrianGamesCodingChallenge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29542522,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T08:11:05.436Z","status":"ssl_error","status_checked_at":"2026-02-17T08:09:38.860Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-05T21:46:56.370Z","updated_at":"2026-02-17T11:31:26.684Z","avatar_url":"https://github.com/ValyrianTech.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Valyrian Games: Olympics of AI\n\n**The Ultimate AI Model Benchmarking Platform**\n\n---\n\n### 🔗 **Connect with ValyrianTech**\n[![Follow ValyrianTech](https://img.shields.io/badge/Follow-ValyrianTech-blue?style=for-the-badge\u0026logo=linktree)](https://linktr.ee/ValyrianTech)\n\n**[🌟 Follow me on all platforms → linktr.ee/ValyrianTech](https://linktr.ee/ValyrianTech)**\n\n---\n\n### 📊 **Quick Access to Key Results**\n- **[📈 View Full Qualification Results](qualification_results.md)** - Comprehensive model performance rankings and analytics\n- **[🎯 Challenge Prompts](prompts/)** - Core prompt templates used for challenge creation and solving\n  - [CreateCodingChallenge.txt](prompts/CreateCodingChallenge.txt) - Template for LLMs to create challenges\n  - [SolveCodingChallenge.txt](prompts/SolveCodingChallenge.txt) - Template for LLMs to solve challenges\n\n---\n\nWelcome to the Valyrian Games, an advanced AI benchmarking system that evaluates Large Language Models (LLMs) through rigorous coding challenges. This platform serves as the \"Olympics of AI,\" providing comprehensive performance analytics, cost analysis, and qualification metrics for AI models across multiple providers.\n\n## 🎯 System Overview\n\nThe Valyrian Games platform is a sophisticated benchmarking ecosystem designed to:\n\n- **Automatically generate coding challenges** using AI models\n- **Validate challenge quality** through multi-attempt solving\n- **Track comprehensive performance metrics** including cost, speed, and accuracy\n- **Provide executive-grade analytics** with professional visualizations\n- **Support automated model qualification** based on performance thresholds\n- **Enable cost-aware model selection** for optimal resource utilization\n\n## 🏗️ Architecture\n\nThe system consists of four core components working in harmony:\n\n### 1. **Challenge Generation \u0026 Execution**\nThe foundation system that orchestrates the complete challenge workflow:\n\n- **Docker Environment Management**: Automatically restarts containers for clean execution\n- **Challenge Creation**: Generates unique coding problems using specified LLM models\n- **Multi-Attempt Validation**: Tests each challenge multiple times to ensure solvability\n- **Performance Tracking**: Records tokens, costs, timing, and accuracy metrics\n- **Automated Classification**: Sorts results into `accepted/` or `rejected/` directories\n- **Statistics Integration**: Automatically triggers analysis updates after completion\n\n**Key Features:**\n- Configurable validation attempts (default: 3)\n- Adjustable success thresholds (default: 50%)\n- Timeout handling with graceful failure recording\n- Comprehensive conversation metrics extraction\n- Real-time progress monitoring with detailed logging\n\n### 2. **Intelligent Model Selection**\nAdvanced orchestration system for automated testing across the entire model fleet:\n\n- **Cost-Aware Selection**: Prioritizes cheaper models and those with fewer existing challenges\n- **Dynamic Disqualification**: Two-tier system removes poorly performing models\n  - **Early Disqualification**: ≥3 rejected challenges with 0 accepted\n  - **Statistical Disqualification**: \u003c50% acceptance rate after 10+ challenges\n- **Weighted Random Selection**: Balances data collection across models using cost and challenge count\n- **Qualified Model Pool**: Automatically loads qualified models from cost analysis data\n- **Batch Execution**: Supports multiple runs with configurable delays\n- **Auto-Visualization**: Generates updated charts after successful runs\n\n**Supported Model Providers:**\n- **OpenAI**: GPT-4.1, GPT-4o, O1, O3, O4 series (62+ models)\n- **Anthropic**: Claude Opus, Sonnet, Haiku series\n- **Google**: Gemini 2.5, 2.0, 1.5 series\n- **Mistral**: Magistral, Codestral, Devstral, Ministral series\n- **DeepSeek**: Chat and reasoning models\n- **Together.ai**: DeepSeek-R1, Qwen, GLM, Llama variants\n- **Groq**: High-speed inference models\n\n### 3. **Comprehensive Analytics Engine**\nSophisticated statistical analysis system that processes all challenge results:\n\n- **Multi-Dimensional Metrics**: Acceptance rate, success rate, cost efficiency, token usage\n- **Nested Directory Support**: Handles complex provider structures (Together.ai, Groq)\n- **Qualification Determination**: Automatic model qualification based on performance criteria\n- **Cost Summary Generation**: Creates `model_cost_summary.json` for downstream systems\n- **Executive Reporting**: Generates detailed markdown reports with model rankings\n- **Auto-Visualization Trigger**: Launches chart generation after analysis completion\n\n**Generated Outputs:**\n- `qualification_results.md`: Comprehensive performance report\n- `model_cost_summary.json`: Structured data for visualizations and model selection\n- Console reports with sortable metrics and detailed breakdowns\n\n### 4. **Professional Visualization Suite**\nExecutive-grade analytics dashboard with four distinct visualization types:\n\n#### **Cost vs Performance Scatter Plot**\n![Cost Performance Chart](valyrian_games_cost_performance.png)\n- **X-Axis**: Acceptance Rate (%) - primary performance metric\n- **Y-Axis**: Average Cost per Challenge ($) - logarithmic scale for better distribution\n- **Bubble Size**: Total challenges completed\n- **Color Coding**: Green (qualified) vs Red (disqualified)\n- **Smart Labeling**: Collision-aware model name placement\n\n#### **Model Comparison Bar Charts**\n![Model Comparison Chart](valyrian_games_model_comparison.png)\n- **Dual Charts**: Acceptance rates and costs side-by-side\n- **Sorted Rankings**: Performance-based ordering for easy comparison\n- **Value Labels**: Precise metrics displayed on each bar\n- **Qualification Status**: Color-coded qualification indicators\n\n#### **Executive Dashboard**\n![Executive Dashboard](valyrian_games_dashboard.png)\n- **Qualification Overview**: Pie chart showing qualified vs disqualified models\n- **Top Performers**: Top 5 models by acceptance rate\n- **Cost Efficiency Analysis**: Acceptance rate divided by cost with negative values for disqualified models\n- **Key Metrics Summary**: Total models, qualification rates, averages, and top performers\n- **Adaptive Labeling**: Smart label reduction for crowded charts\n- **Professional Sizing**: 20×14 inches optimized for presentations\n\n#### **Performance Heatmap**\n![Performance Heatmap](valyrian_games_heatmap.png)\n- **Multi-Dimensional View**: Success rate, acceptance rate, cost, challenges, efficiency\n- **Dual Visualization**: Raw values and normalized scores (0-1 scale)\n- **Acceptance Rate Sorting**: Models ordered by performance for easy pattern recognition\n- **Color Mapping**: Red-Yellow-Green scale for intuitive interpretation\n\n## 📊 Key Metrics \u0026 Terminology\n\n### **Performance Metrics**\n- **Acceptance Rate**: Percentage of challenges that meet the success threshold (primary metric)\n- **Success Rate**: Overall percentage of correct solution attempts across all challenges\n- **Cost Efficiency**: Acceptance rate divided by average cost per challenge\n- **Qualification Status**: Models with ≥50% acceptance rate after sufficient testing\n\n### **Cost Analysis**\n- **Average Cost per Challenge**: Total cost divided by number of challenges\n- **Token Efficiency**: Tokens per second during challenge execution\n- **Cost-Performance Ratio**: Balances model capability with economic efficiency\n\n### **Challenge Classification**\n- **Accepted Challenges**: Meet or exceed the success threshold (default: 50%)\n- **Rejected Challenges**: Fall below the success threshold or timeout\n- **Validation Attempts**: Number of solution attempts per challenge (default: 3)\n\n## 🚀 Quick Start Guide\n\n### **Running Individual Challenges**\nThe system supports running individual challenges with specific models and configurable parameters including temperature, validation attempts, and success thresholds.\n\n### **Automated Fleet Testing**\nThe system provides automated testing across qualified models with configurable parameters including:\n- Number of runs and delays between runs\n- Disqualification thresholds\n- Option to include expensive models\n- Verbose logging capabilities\n\n### **Analytics Generation**\nThe system provides comprehensive analytics and visualization capabilities including:\n- Statistical analysis with markdown report generation\n- Sortable metrics by various criteria (acceptance rate, cost, etc.)\n- Professional visualizations in multiple formats (PNG, SVG, PDF)\n- Executive dashboards and performance heatmaps\n\n## 📁 Directory Structure\n\n```\n/volumes/Serendipity/ValyrianGames/CodingChallenge/\n├── README.md                           # This comprehensive guide\n├── model_cost_summary.json             # Structured performance data\n├── qualification_results.md            # Detailed analysis report\n├── valyrian_games_cost_performance.png # Cost vs performance chart\n├── valyrian_games_model_comparison.png # Model comparison bars\n├── valyrian_games_dashboard.png        # Executive dashboard\n├── valyrian_games_heatmap.png         # Performance heatmap\n├── OpenAI:gpt-4.1-2025-04-14/         # Model-specific results\n│   ├── accepted/                       # Successful challenges\n│   │   ├── conversation_001.json       # Challenge result with metrics\n│   │   └── conversation_002.json\n│   └── rejected/                       # Failed challenges\n│       ├── conversation_003.json       # Failed challenge with reason\n│       └── conversation_004.json\n├── Anthropic:claude-3-5-sonnet-20241022/\n│   ├── accepted/\n│   └── rejected/\n└── [Additional model directories...]\n```\n\n## 🎮 Challenge Result Format\n\nEach challenge result is stored as a comprehensive JSON file containing:\n\n```json\n{\n  \"conversation_id\": \"unique_identifier\",\n  \"timestamp\": \"2025-01-29T12:00:00\",\n  \"status\": \"ACCEPTED|REJECTED\",\n  \"parameters\": {\n    \"validation_attempts\": 3,\n    \"success_threshold\": 0.5,\n    \"agent\": \"Contender\"\n  },\n  \"challenge\": {\n    \"challenge_prompt\": \"Create a function that...\",\n    \"example_code\": \"def solution():\",\n    \"expected_answer\": 42\n  },\n  \"validation_results\": {\n    \"total_attempts\": 3,\n    \"correct_answers\": 2,\n    \"success_rate\": 0.67,\n    \"accepted\": true\n  },\n  \"performance_metrics\": {\n    \"model_name\": \"OpenAI:gpt-4.1-2025-04-14\",\n    \"temperature\": 0.7,\n    \"total_completion_tokens\": 1250,\n    \"total_cost\": 0.0125,\n    \"total_elapsed_time\": 45.2,\n    \"tokens_per_second\": 27.6\n  },\n  \"solution_attempts\": [\n    {\n      \"filename\": \"challenge_candidate_solution_1.json\",\n      \"answer\": 42,\n      \"python_code\": \"def solution(): return 42\",\n      \"is_correct\": true\n    }\n  ]\n}\n```\n\n## 🏆 Qualification System\n\n### **Qualification Criteria**\nModels are automatically qualified based on:\n1. **Minimum Challenges**: At least 1 completed challenge\n2. **Acceptance Threshold**: ≥50% acceptance rate\n3. **Statistical Significance**: Performance maintained over multiple challenges\n\n### **Disqualification Rules**\nModels are disqualified through a two-tier system:\n1. **Early Disqualification**: ≥3 rejected challenges with 0 accepted\n2. **Statistical Disqualification**: \u003c50% acceptance rate after 10+ challenges\n\n### **Re-qualification**\nDisqualified models can re-qualify by:\n- Achieving successful challenge completions\n- Improving acceptance rate above 50%\n- Demonstrating consistent performance over time\n\n## 💡 Advanced Features\n\n### **Cost-Aware Selection**\nThe system intelligently balances:\n- **Model Performance**: Prioritizes higher-performing models\n- **Cost Efficiency**: Favors economical models for budget optimization\n- **Data Balance**: Ensures comprehensive testing across all models\n- **Quality Control**: Automatically removes consistently poor performers\n\n### **Docker Integration**\n- **Clean Execution Environment**: Containers restart before each challenge\n- **Isolation**: Prevents cross-contamination between challenges\n- **Reliability**: Ensures consistent execution conditions\n- **Scalability**: Supports concurrent challenge execution\n\n### **Automated Workflows**\n- **End-to-End Automation**: From challenge generation to visualization\n- **Failure Handling**: Graceful timeout and error management\n- **Progress Tracking**: Real-time status updates and logging\n- **Integration**: Seamless data flow between all components\n\n\n## 🔧 Configuration Options\n\n### **Challenge Parameters**\n- `--validation-attempts`: Number of solution attempts (1-10)\n- `--success-threshold`: Minimum success rate (0.0-1.0)\n- `--temperature`: Model creativity parameter (0.0-2.0)\n- `--solution-timeout`: Maximum time per solution (seconds)\n\n### **Selection Parameters**\n- `--disqualification-threshold`: Rejection limit before disqualification\n- `--include-expensive`: Include high-cost models in selection\n- `--category`: Test specific model categories only\n- `--use-static-pool`: Use hardcoded model list instead of qualified models\n\n### **Output Parameters**\n- `--save-markdown`: Generate detailed markdown reports\n- `--sort-by`: Sort results by specific metrics\n- `--verbose`: Enable detailed logging and progress updates\n- `--format`: Chart output format (png, svg, pdf)\n\n## 🎯 Use Cases\n\n### **AI Research \u0026 Development**\n- **Model Comparison**: Objective performance benchmarking\n- **Cost Analysis**: Budget optimization for AI deployments\n- **Capability Assessment**: Understanding model strengths and limitations\n- **Trend Analysis**: Tracking performance improvements over time\n\n### **Enterprise AI Strategy**\n- **Vendor Selection**: Data-driven model provider decisions\n- **Budget Planning**: Cost forecasting for AI initiatives\n- **Performance Monitoring**: Ongoing model evaluation\n- **Risk Assessment**: Identifying reliable vs unreliable models\n\n### **Academic Research**\n- **Benchmarking Studies**: Standardized model evaluation\n- **Performance Analysis**: Statistical model comparison\n- **Cost-Benefit Research**: Economic efficiency studies\n- **Longitudinal Studies**: Model evolution tracking\n\n## 🔧 System Requirements\n\n- **Python 3.8+** with required dependencies\n- **Docker** for containerized execution environment\n- **Sufficient Storage** for challenge results and visualizations\n- **API Access** to supported LLM providers\n\n### **Dependencies**\n```bash\npip install matplotlib seaborn pandas numpy requests\n```\n\n## 🏁 Conclusion\n\nThe Valyrian Games represents the pinnacle of AI model benchmarking, providing unprecedented insights into LLM performance, cost efficiency, and reliability. Through rigorous testing, comprehensive analytics, and professional visualizations, this platform empowers organizations to make informed decisions about AI model selection and deployment.\n\nWhether you're conducting academic research, optimizing enterprise AI costs, or simply curious about the latest AI capabilities, the Valyrian Games provides the tools and insights needed to navigate the rapidly evolving landscape of artificial intelligence.\n\n**Welcome to the Olympics of AI – may the best models win! 🏆**\n\n---\n\n*Generated by the Valyrian Games Analytics System*  \n*Last Updated: 2025-01-29*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalyriantech%2Fvalyriangamescodingchallenge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvalyriantech%2Fvalyriangamescodingchallenge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalyriantech%2Fvalyriangamescodingchallenge/lists"}