{"id":28731852,"url":"https://github.com/lpalbou/llm-basic-benchmark","last_synced_at":"2026-04-29T15:05:57.892Z","repository":{"id":298403488,"uuid":"999774978","full_name":"lpalbou/llm-basic-benchmark","owner":"lpalbou","description":"Comprehensive benchmark of 44 open source language models across creative writing, logic puzzles, counterfactual reasoning, and programming tasks. Tested on Apple M4 Max with detailed performance analysis.","archived":false,"fork":false,"pushed_at":"2025-06-10T23:05:39.000Z","size":204,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-11T00:19:35.895Z","etag":null,"topics":["benchmark","cogito","counterfactual","gemma3","granite3","llama3","llama4","llm","mlx","ollama","open-source","phi4","programming","puzzle","qwen3","writing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lpalbou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-10T19:08:51.000Z","updated_at":"2025-06-10T23:05:42.000Z","dependencies_parsed_at":"2025-06-11T00:29:48.791Z","dependency_job_id":null,"html_url":"https://github.com/lpalbou/llm-basic-benchmark","commit_stats":null,"previous_names":["lpalbou/llm-basic-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lpalbou/llm-basic-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpalbou%2Fllm-basic-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpalbou%2Fllm-basic-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpalbou%2Fllm-basic-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpalbou%2Fllm-basic-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lpalbou","download_url":"https://codeload.github.com/lpalbou/llm-basic-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpalbou%2Fllm-basic-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32430826,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T13:34:34.882Z","status":"ssl_error","status_checked_at":"2026-04-29T13:34:29.830Z","response_time":110,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","cogito","counterfactual","gemma3","granite3","llama3","llama4","llm","mlx","ollama","open-source","phi4","programming","puzzle","qwen3","writing"],"created_at":"2025-06-15T19:11:25.183Z","updated_at":"2026-04-29T15:05:57.879Z","avatar_url":"https://github.com/lpalbou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Open Source Language Model Benchmark\n\nAn evaluation of 43 open source language models across four distinct tasks: creative writing, logical reasoning, counterfactual causality, and programming. This benchmark aims to provide practical insights into current open source LLM capabilities and performance characteristics.\n\n## Overview\n\nThis benchmark evaluates where we currently stand with open source language models, examining not just raw generative speed (tokens/second) but actual task completion effectiveness. A key insight from this work is that the fastest generative model is not necessarily the fastest at getting correct answers—models that over-reason or are overly verbose can be slower despite higher token generation rates.\n\n## Testing Environment\n\n- **Hardware**: Apple M4 Max with 128GB unified memory\n- **Software**: Ollama 0.9.0 (`ollama serve`)\n- **Quantization**: All models tested in q4_K_M quantization (4-bit) unless specified\n- **Context Length**: Extended context window (`OLLAMA_CONTEXT_LENGTH=100000`) to handle complex reasoning tasks\n- **Special Case**: Qwen3-235B-A22B tested with MLX 3-bit quantization due to memory constraints\n\n## Models Tested (44 total)\n\n### General Purpose Models\n- **Cogito**: 3B, 8B, 14B, 32B, 70B (5 models)\n- **Gemma3**: 1B, 4B, 12B, 27B (4 models)  \n- **Granite3.3**: 2B, 8B (2 models)\n- **Qwen3**: 0.6B, 1.7B, 8B, 32B, 30B-A3B, 235B-A22B (6 models)\n- **Llama3/4**: 3.2:1B, 3.2:3B, 3.1:8B, 3.3:70B, 4:17B-Scout (5 models)\n\n### Specialized Models\n- **Coding Models**: Qwen2.5-Coder (0.5B-32B), CodeGemma, Codestral, Devstral, DeepCoder (8 models)\n- **Reasoning Models**: DeepSeek-R1 (1.5B, 8B, 32B variants), Phi4 (Mini, Reasoning variants) (9 models)\n- **Other**: Mistral 7B (1 model)\n\n## Benchmark Tasks\n\n### 1. Creative Writing ([Part 1](benchmark-part1.md))\n**Task**: Generate a 5-sentence short story  \n**Evaluation**: Coherence, creativity, adherence to length requirement, narrative structure\n\n### 2. Logic Puzzle ([Part 2](benchmark-part2-light.md))\n**Task**: Solve a deceptive riddle requiring careful logical reasoning  \n**Evaluation**: Correct answer identification, reasoning quality, avoidance of common logical traps\n\n### 3. Counterfactual Causality ([Part 3](benchmark-part3.md))\n**Task**: Analyze a scenario involving counterfactual reasoning about causation  \n**Evaluation**: Understanding of causal relationships, ability to reason about hypothetical scenarios\n\n### 4. Python Programming ([Part 4](benchmark-part4-light.md))\n**Task**: Generate a complete 3D physics simulation (bouncing ball with gravity)  \n**Evaluation**: Code correctness, execution success, physics accuracy, code quality\n\n## Key Performance Metrics\n\n### Ollama Performance Data\nEach test captures detailed performance metrics:\n\n| Metric | Description |\n|--------|-------------|\n| **Total Duration** | Complete operation time (loading + processing + generation) |\n| **Load Duration** | Model initialization time |\n| **Prompt Eval Count/Rate** | Input processing tokens and speed |\n| **Eval Count/Rate** | Output generation tokens and speed |\n\n### Evaluation Criteria\n- **Task Completion**: Did the model fulfill the specific requirements?\n- **Accuracy**: Was the response factually/logically correct?\n- **Efficiency**: How quickly did the model arrive at a correct solution?\n- **Quality**: Overall response quality and coherence\n\n## Key Findings\n\n### Performance vs. Size\n- Larger models don't always perform better on specific tasks\n- Parameter count correlates weakly with task-specific performance\n- Specialized models often outperform general-purpose models in their domain\n\n### Speed vs. Effectiveness\n- **Token Generation Speed ≠ Task Completion Speed**\n- Models with extensive reasoning chains can be slower despite high tok/s rates\n- Verbose models may appear productive but take longer to reach conclusions\n- Concise, accurate responses often indicate better practical performance\n\n### Architecture Insights\n- Multi-modal models may have different performance characteristics due to diverse training data\n- Reasoning-specialized models show improved performance on logical tasks but may over-analyze simple problems\n- Code-specialized models excel at programming but may struggle with general reasoning\n\n## Practical Implications\n\n### Model Selection Guidelines\n- **Task-Specific Performance** matters more than general benchmarks\n- **Multi-Agent Systems** using different specialized models may be optimal\n- **Context Requirements** significantly impact performance and should be considered\n- **Resource Constraints** (memory, inference time) are practical limiting factors\n\n### Real-World Considerations\n- Quality doesn't always correlate with response length\n- Different architectures excel at different task types\n- Quantization impacts should be evaluated per use case\n- Local deployment considerations (hardware, memory) affect model choice\n\n## Repository Structure\n\n```\n├── README.md                      # This file\n├── LICENSE                        # MIT License\n├── benchmark.md                   # Summary and methodology\n├── benchmark-part1.md             # Creative writing results\n├── benchmark-part2-light.md       # Logic puzzle results  \n├── benchmark-part3.md             # Counterfactual reasoning results\n├── benchmark-part4-light.md       # Programming task results\n├── mlx_chat.py                    # MLX model testing utility\n└── codes/                         # Generated code samples\n    ├── [model-name].py            # Programming task outputs\n    └── performance_data.py        # Performance analysis\n```\n\n## Testing Tools\n\n### MLX Chat Utility (`mlx_chat.py`)\n\nFor models that couldn't run with Ollama due to memory constraints (specifically Qwen3-235B-A22B), I provide a specialized MLX testing utility. This Python script enables testing of MLX-optimized models on Apple Silicon:\n\n**Features:**\n- Interactive chat interface for MLX models\n- Configurable thinking mode and token budgets\n- Support for large models (tested with 235B parameters)\n- Real-time response streaming\n- Conversation history management\n\n**Usage:**\n```bash\n# Test with default Qwen3-30B-A3B model\npython mlx_chat.py\n\n# Test with larger model (requires ~128GB RAM)\npython mlx_chat.py --model mlx-community/Qwen3-235B-A22B-3bit\n\n# Disable thinking mode for direct responses\npython mlx_chat.py --thinking-budget 0\n```\n\n**Requirements:**\n- Apple Silicon Mac with sufficient memory\n- MLX framework: `pip install mlx-lm`\n- For 235B model: 128GB unified memory recommended\n\n**Note:** MLX models are optimized for Apple Silicon and run ~20% faster than equivalent Ollama models, but results should be interpreted within this context when comparing performance metrics.\n\n## Important Notes\n\n- **Context Window Size**: Critical for complex reasoning tasks—many models perform significantly better with extended context\n- **Quantization Effects**: 4-bit quantization used throughout for consistency, but performance may vary with different quantization levels\n- **Hardware Specificity**: Results obtained on Apple Silicon; performance may differ on other architectures\n- **Model Versions**: Specific model versions and quantizations tested—results may not generalize to other versions\n\n## Acknowledgments\n\nThis benchmark was made possible by the incredible work of the open source community:\n\n### Model Developers\nI extend my gratitude to all the organizations and researchers who have made their language models freely available:\n- **Alibaba** (Qwen series)\n- **Google** (Gemma series)\n- **Meta** (Llama series)\n- **Microsoft** (Phi series)\n- **IBM** (Granite series)\n- **Mistral AI** (Mistral, Codestral, Devstral)\n- **DeepSeek** (DeepSeek-R1 series)\n- **DeepCogito** (Cogito series)\n- **And all other contributors** to the open source LLM ecosystem\n\n### Infrastructure and Tools\n- **[Ollama](https://ollama.ai/)** - For providing an excellent local LLM serving platform that made testing 40+ models seamless\n- **[Apple MLX](https://github.com/ml-explore/mlx)** - For the MLX framework enabling efficient inference on Apple Silicon\n- **[MLX-LM](https://github.com/ml-explore/mlx-examples/tree/main/llms)** - For the high-level MLX interface used in our custom testing utility\n- **[Hugging Face](https://huggingface.co/)** - For hosting and distributing the quantized models\n\n### Hardware\n- **Apple** - For the M4 Max chip and unified memory architecture that enabled testing of large models locally\n\n### Analysis and Documentation\n- The open source AI community's collaborative spirit makes research like this possible. These benchmarks aim to contribute back to the community by providing practical performance insights.\n\n- Claude 4 Sonnet, for its valuable assistance in analyzing benchmark results, identifying data inconsistencies, and helping to structure comprehensive documentation\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use this benchmark in your research or analysis, please reference this repository and note the specific testing conditions (hardware, software versions, quantization levels) as they significantly impact results.\n\n---\n\n*This benchmark provides a snapshot of open source LLM capabilities as of the testing date. The rapidly evolving nature of this field means results should be interpreted within their temporal and technical context.* \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpalbou%2Fllm-basic-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flpalbou%2Fllm-basic-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpalbou%2Fllm-basic-benchmark/lists"}