{"id":24680269,"url":"https://github.com/herniqeu/sum_thread_benchmark","last_synced_at":"2026-02-09T13:32:20.174Z","repository":{"id":260768329,"uuid":"882291207","full_name":"herniqeu/sum_thread_benchmark","owner":"herniqeu","description":" comparing parallel sum implementations across c++, go, haskell and python","archived":false,"fork":false,"pushed_at":"2025-01-26T01:18:10.000Z","size":3946,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-08T12:40:24.299Z","etag":null,"topics":["benchmark","cpp","go","haskell","parallel-computing","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/herniqeu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-02T12:30:08.000Z","updated_at":"2025-01-26T02:22:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"360d1906-6948-48b4-8b63-5c17e5877a25","html_url":"https://github.com/herniqeu/sum_thread_benchmark","commit_stats":null,"previous_names":["herniqeu/sum_thread_benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/herniqeu/sum_thread_benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/herniqeu%2Fsum_thread_benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/herniqeu%2Fsum_thread_benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/herniqeu%2Fsum_thread_benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/herniqeu%2Fsum_thread_benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/herniqeu","download_url":"https://codeload.github.com/herniqeu/sum_thread_benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/herniqeu%2Fsum_thread_benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29266958,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T12:53:16.161Z","status":"ssl_error","status_checked_at":"2026-02-09T12:52:30.244Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","cpp","go","haskell","parallel-computing","python"],"created_at":"2025-01-26T14:12:52.870Z","updated_at":"2026-02-09T13:32:20.131Z","avatar_url":"https://github.com/herniqeu.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multi-Language Parallel Sum Benchmark Suite\n\n![Log Execution Time vs Input Size](/img/log_execution_time_vs_input_size.png)\n\n## Technical Overview\n\nThis repository implements a parallel sum algorithm across multiple programming languages (C++, Go, Haskell, and Python), demonstrating various parallel programming paradigms and their performance characteristics in a controlled environment.\n\n### System Configuration\n- Platform: Linux (WSL2) x86_64\n- Processor: 6 physical cores, 12 logical cores\n- Memory: 7.61 GB\n- Compiler/Runtime Versions:\n  - GCC (C++17)\n  - Go 1.11+\n  - GHC 8.0+\n  - Python 3.8.10\n\n## Implementation Architecture\n\n### 1. C++ Implementation (`sum_thread.cpp`)\n- Uses RAII principles with `std::thread` and modern C++ features\n- Leverages `std::accumulate` for vectorized operations\n- Key optimizations:\n  ```cpp\n  partial_sum = std::accumulate(numbers.begin() + start_index, \n                               numbers.begin() + end_index, 0LL);\n  ```\n- Thread-safe design through data partitioning\n- Performance: Best mean execution time (167.33ms ± 121.23ms)\n\n### 2. Go Implementation (`sum_thread.go`)\n- Utilizes goroutines and channels for communication\n- Efficient memory management through Go's runtime\n- Performance: Second-best mean execution time (203.35ms ± 179.97ms)\n\n### 3. Haskell Implementation (`sum_thread.hs`)\n- Employs Software Transactional Memory (STM)\n- Pure functional approach with MVars for synchronization\n- Performance: Third place (745.29ms ± 1129.14ms)\n\n### 4. Python Implementation (`sum_thread.py`)\n- GIL-constrained threading model\n- Numpy-optimized operations where possible\n- Performance: Baseline reference (2446.33ms ± 3862.51ms)\n\n## Statistical Analysis\n\n### Performance Metrics\n\n#### 1. Execution Time Distribution [BoxPlot_Reference]\n- C++: Lowest variability (CV: 72.45%)\n- Go: Moderate variability (CV: 88.50%)\n- Haskell: High variability (CV: 151.51%)\n- Python: Highest variability (CV: 157.89%)\n\n#### 2. Statistical Significance\n- ANOVA results: F(3, 596) = 42.04, p \u003c 1.18e-24\n- Tukey HSD Analysis:\n  - C++ vs Python: Significant (p \u003c 0.001)\n  - Go vs Python: Significant (p \u003c 0.001)\n  - Haskell vs Python: Significant (p \u003c 0.001)\n  - C++ vs Go: Non-significant (p = 0.9987)\n\n### Scaling Analysis\n\n#### 1. Small-Scale Performance (n ≤ 100,000)\n- All languages perform similarly\n- Mean execution times within 5% range\n- Low coefficient of variation (\u003c2%)\n\n#### 2. Medium-Scale Performance (100,000 \u003c n ≤ 1,000,000)\n- C++ and Go maintain consistent performance\n- Python shows linear degradation\n- Haskell begins showing increased variance\n\n#### 3. Large-Scale Performance (n \u003e 1,000,000)\n- C++: Best scaling (493.23ms at n=25M)\n- Go: Linear scaling (701.25ms at n=25M)\n- Haskell: Exponential growth (3885.12ms at n=25M)\n- Python: Significant degradation (12866.13ms at n=25M)\n\n## Performance Characteristics\n\n### 1. Memory Efficiency\n- C++: Lowest memory footprint (0.49% mean usage)\n- Go: Efficient garbage collection (0.67% mean usage)\n- Haskell: Higher memory overhead (3.28% mean usage)\n- Python: Significant memory usage (2.84% mean usage)\n\n### 2. Thread Scaling\n- Linear scaling up to physical core count (6)\n- Diminishing returns beyond logical core count (12)\n- False sharing effects visible in large datasets\n\n### 3. Cache Effects\n- Visible in performance jumps at:\n  - L1 cache boundary (~32KB)\n  - L2 cache boundary (~256KB)\n  - L3 cache boundary (~12MB)\n\n## Technical Insights\n\n1. **Algorithmic Complexity**\n   - Theoretical: O(n/p), where p = thread count\n   - Practical: Limited by memory bandwidth\n   - Cache coherency overhead significant at thread boundaries\n\n2. **Memory Access Patterns**\n   - Sequential access benefits from hardware prefetching\n   - Thread-local summation minimizes false sharing\n   - NUMA effects visible in large datasets\n\n3. **Synchronization Overhead**\n   - Minimal in C++ (final reduction only)\n   - Channel-based in Go (negligible impact)\n   - STM overhead in Haskell (significant at scale)\n   - GIL contention in Python (major bottleneck)\n\n## Benchmark Runner Architecture\n\n### Core Components (`run_benchmarks.py`)\n\n1. **Execution Pipeline**\n```python\ndef run_single_test(self, executable, size, num_threads, lang):\n    process = psutil.Popen(\n        cmd,\n        stdout=subprocess.PIPE,\n        stderr=subprocess.PIPE\n    )\n    metrics_samples.append(self._measure_process_metrics(process))\n```\n- Real-time metrics collection\n- Process isolation per test\n- Resource monitoring (CPU, memory, threads)\n\n2. **Compilation Strategy**\n- C++: `-O3 -pthread` optimizations\n- Go: Native build with race detector\n- Haskell: `-O2 -threaded` RTS options\n- Python: JIT compilation through CPython\n\n### Statistical Methodology\n\n1. **Sampling Framework**\n- Sample size: 150 runs per language\n- Confidence level: 95%\n- Power analysis: β \u003e 0.95\n\n2. **Variance Analysis**\n```text\nLanguage   Mean (ms)    Std Dev    CV (%)\n----------------------------------------\nC++        167.33      121.64     72.45\nGo         203.35      180.57     88.50\nHaskell    745.29     1132.93    151.51\nPython    2446.33     3875.45    157.89\n```\n\n3. **Distribution Characteristics**\n- Non-normal distributions (Shapiro-Wilk test)\n  - C++: W = 0.5886, p \u003c 8.92e-19\n  - Go: W = 0.6002, p \u003c 1.59e-18\n  - Haskell: W = 0.6073, p \u003c 2.27e-18\n  - Python: W = 0.6411, p \u003c 1.33e-17\n\n## Detailed Performance Analysis\n\n### 1. Small-Scale Efficiency (n ≤ 100,000)\n```text\nSize: 100,000\nLanguage   Mean (ms)    CV (%)\n----------------------------\nC++        103.77      1.29\nGo         104.60      1.12\nHaskell    104.08      1.15\nPython     104.62      1.25\n```\n- Negligible performance differences\n- Cache-resident data sets\n- Linear scaling with threads\n\n### 2. Medium-Scale Behavior (n = 1,000,000)\n```text\nLanguage   Mean (ms)    CV (%)\n----------------------------\nC++        103.40      0.48\nGo         105.70      1.11\nHaskell    145.27     34.25\nPython     406.15      0.49\n```\n- Memory hierarchy effects become visible\n- Thread synchronization overhead emerges\n- GC pressure in managed languages\n\n### 3. Large-Scale Performance (n = 25,000,000)\n```text\nLanguage   Mean (ms)    CV (%)\n----------------------------\nC++        493.23     16.59\nGo         701.25     12.38\nHaskell   3885.12     17.08\nPython   12866.13     22.14\n```\n- Memory bandwidth saturation\n- NUMA effects dominant\n- Garbage collection overhead significant\n\n## Technical Optimizations\n\n### 1. Memory Management\n- Thread-local accumulation\n- Cache-line alignment\n- NUMA-aware thread pinning\n- False sharing mitigation\n\n### 2. Synchronization Strategies\n```cpp\n// C++: Lock-free accumulation\nstd::vector\u003cstd::thread\u003e threads;\nstd::vector\u003cSumThread\u003e sum_threads;\n```\n\n```haskell\n-- Haskell: MVar-based synchronization\nresults \u003c- replicateM numWorkers newEmptyMVar\n```\n\n```python\n# Python: Thread pooling\nthread = SumThread(numbers, start_idx, end_idx)\nthreads.append(thread)\n```\n\n### 3. Compiler Optimizations\n- Loop unrolling\n- Vectorization\n- Constant folding\n- Dead code elimination\n\n## Performance Bottlenecks\n\n1. **Language-Specific Limitations**\n- Python: GIL contention\n- Haskell: Garbage collection pauses\n- Go: Channel communication overhead\n- C++: Cache coherency protocol\n\n2. **Hardware Constraints**\n- Memory bandwidth saturation\n- Cache line bouncing\n- NUMA access patterns\n- Thread scheduling overhead\n\n## Future Optimizations\n\n1. **Implementation Improvements**\n- SIMD vectorization\n- Cache-oblivious algorithms\n- Work-stealing schedulers\n- Dynamic thread pooling\n\n2. **Measurement Enhancements**\n- Hardware performance counters\n- Cache miss profiling\n- Branch prediction statistics\n- Memory bandwidth utilization\n\n## Conclusions\n\n1. **Performance Hierarchy**\n- C++ provides best raw performance\n- Go offers good balance of performance and safety\n- Haskell shows competitive small-scale performance\n- Python suitable for prototype development\n\n2. **Scaling Characteristics**\n- Linear scaling up to physical core count\n- Memory bandwidth becomes bottleneck at scale\n- Thread synchronization overhead increases with dataset size\n\n3. **Statistical Significance**\n- ANOVA confirms significant differences (p \u003c 1.18e-24)\n- Tukey HSD shows clear language performance tiers\n- Non-normal distribution suggests complex performance factors\n\n![Log Execution Time vs Input Size](/img/log_execution_time_vs_input_size.png)\n![CPU Usage](/img/cpu_usage.png)\n![Execution Time](/img/execution_time.png)\n![Execution Time vs Input Size](/img/execution_time_vs_input_size.png)\n![Memory Usage by Performance](/img/memory_usage_by_perfomance.png)\n![Performance Consistency](/img/perfomance_consistency.png)\n![Speedup Number of Threads](/img/speedup_number_of_threads.png)\n![Statistical Confidence](/img/statistical_confidence.png)\n![Thread Scaling Language](/img/thread_scaling_language.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fherniqeu%2Fsum_thread_benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fherniqeu%2Fsum_thread_benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fherniqeu%2Fsum_thread_benchmark/lists"}