{"id":31762316,"url":"https://github.com/foadsf/opencl-windows-examples","last_synced_at":"2026-05-19T05:43:13.934Z","repository":{"id":317744951,"uuid":"1068634414","full_name":"Foadsf/opencl-windows-examples","owner":"Foadsf","description":"Practical OpenCL examples for Windows with C/C++, demonstrating GPU computing across NVIDIA and Intel hardware","archived":false,"fork":false,"pushed_at":"2025-10-02T18:27:48.000Z","size":47,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-02T20:30:39.911Z","etag":null,"topics":["cmake","cpp","gpu-computing","intel","nvidia","opencl","parallel-computing","windows"],"latest_commit_sha":null,"homepage":null,"language":"Batchfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Foadsf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-02T17:16:38.000Z","updated_at":"2025-10-02T18:27:51.000Z","dependencies_parsed_at":"2025-10-02T20:30:41.561Z","dependency_job_id":"2969af68-082f-458b-9a79-72e3153796f3","html_url":"https://github.com/Foadsf/opencl-windows-examples","commit_stats":null,"previous_names":["foadsf/opencl-windows-examples"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Foadsf/opencl-windows-examples","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Foadsf%2Fopencl-windows-examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Foadsf%2Fopencl-windows-examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Foadsf%2Fopencl-windows-examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Foadsf%2Fopencl-windows-examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Foadsf","download_url":"https://codeload.github.com/Foadsf/opencl-windows-examples/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Foadsf%2Fopencl-windows-examples/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279002122,"owners_count":26083307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cmake","cpp","gpu-computing","intel","nvidia","opencl","parallel-computing","windows"],"created_at":"2025-10-09T22:18:03.231Z","updated_at":"2025-10-09T22:18:09.096Z","avatar_url":"https://github.com/Foadsf.png","language":"Batchfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OpenCL Examples for Windows (C/C++)\n\nA collection of practical OpenCL examples demonstrating GPU computing on Windows using Visual Studio 2019 and CMake.\n\n## System Requirements\n\n- **OS**: Windows 10/11\n- **Compiler**: Visual Studio 2019 Build Tools (MSVC 19.29+)\n- **CMake**: 3.15+ (bundled with VS Build Tools)\n- **OpenCL Runtimes**: At least one of:\n  - NVIDIA CUDA Toolkit 12.x (for NVIDIA GPUs)\n  - Intel Graphics Driver (for Intel integrated GPUs)\n  - Intel oneAPI Base Toolkit (for CPU execution)\n\n## Hardware Tested\n\n- **GPU**: NVIDIA RTX A2000 Laptop GPU\n- **iGPU**: Intel UHD Graphics (11th Gen)\n- **CPU**: Intel Core i7-11850H @ 2.50GHz\n\n## Repository Structure\n\n```\nopencl-windows-cpp/\n├── examples/\n│   ├── 000_device_enumeration/    # List all OpenCL platforms and devices\n│   ├── 001_hello_opencl/          # Simple \"Hello World\" kernel\n│   ├── 002_vector_addition/       # CPU vs GPU performance comparison\n│   ├── 003_breakeven_analysis/    # Find OpenCL performance crossover points\n│   └── 004_async_multidevice/     # Concurrent execution across devices\n├── setup/\n│   ├── check_opencl_installed.bat # Verify OpenCL installation\n│   └── detect_opencl_hardware.bat # Hardware detection script\n└── README.md\n```\n\n## Quick Start\n\n### 1. Verify OpenCL Installation\n\n```cmd\ncd setup\ncheck_opencl_installed.bat\n```\n\nExpected output: Lists installed OpenCL runtimes and detects your hardware.\n\n### 2. Build and Run an Example\n\n```cmd\ncd examples\\001_hello_opencl\nbuild.bat\n```\n\nEach example includes:\n- `main.cpp` - Host code\n- `*.cl` - OpenCL kernel(s)\n- `CMakeLists.txt` - CMake configuration\n- `build.bat` - Build and run script\n- `README.md` - Example documentation\n\n## Examples Overview\n\n### 000: Device Enumeration\n**Purpose**: Detect all OpenCL platforms and devices on your system.\n\n**Key Concepts**: Platform querying, device properties\n\n```cmd\ncd examples\\000_device_enumeration\nbuild.bat\n```\n\n**Output**: Lists all GPUs and CPUs with their capabilities.\n\n---\n\n### 001: Hello OpenCL\n**Purpose**: Simplest possible GPU kernel - write \"Hello from GPU!\" to a buffer.\n\n**Key Concepts**: Context creation, kernel compilation, buffer management\n\n```cmd\ncd examples\\001_hello_opencl\nbuild.bat\n```\n\n**Expected Output**:\n```\nUsing device: NVIDIA RTX A2000 Laptop GPU\nKernel output: Hello from GPU!\nSuccess!\n```\n\n---\n\n### 002: Vector Addition\n**Purpose**: Compare CPU vs GPU performance for vector addition.\n\n**Key Concepts**: Memory transfer overhead, parallel execution\n\n```cmd\ncd examples\\002_vector_addition\nbuild.bat\n```\n\n**Results** (10M elements):\n- Serial C++: 6.12ms (baseline)\n- OpenCL GPU: 24.23ms (SLOWER - memory-bound operation)\n\n**Lesson**: Simple operations don't benefit from GPUs due to transfer overhead.\n\n---\n\n### 003: Breakeven Analysis\n**Purpose**: Find the vector size where OpenCL becomes faster than serial C++.\n\n**Key Concepts**: Performance profiling, scaling analysis\n\n```cmd\ncd examples\\003_breakeven_analysis\nbuild.bat\n```\n\n**Key Findings**:\n- **NVIDIA RTX A2000**: Faster at 64K elements\n- **Intel UHD Graphics**: Faster at 256K elements\n\n**Lesson**: GPUs require sufficient workload to amortize overhead.\n\n---\n\n### 004: Async Multi-Device\n**Purpose**: Execute kernels simultaneously across multiple devices.\n\n**Key Concepts**: Multiple contexts, asynchronous execution, cross-platform limitations\n\n```cmd\ncd examples\\004_async_multidevice\nbuild.bat\n```\n\n**Important**: Timing analysis across platforms is unreliable - each vendor uses different time references.\n\n---\n\n### 005: Parallelization Technologies Comparison\n**Purpose**: Compare Serial C++, C++17 std::execution, OpenMP, and OpenCL for matrix-vector multiplication.\n\n**Key Concepts**: Technology trade-offs, memory bandwidth limits\n\n```cmd\ncd examples\\005_parallelization_comparison\nbuild.bat\n```\n\n**Results** (4096×4096 matrix):\n- Serial: 16.4ms\n- OpenMP: 1.5ms (11x speedup) - **Winner**\n- OpenCL GPU: 23.4ms (SLOWER - still memory-bound)\n\n**Lesson**: OpenMP dominates moderate-intensity operations.\n\n---\n\n### 006: Matrix Multiplication - Where GPUs Shine\n**Purpose**: Demonstrate compute-intensive operations where GPUs provide massive speedups.\n\n**Key Concepts**: O(n³) complexity, tiling optimization, GFLOPS\n\n```cmd\ncd examples\\006_matrix_multiply\nbuild.bat\n```\n\n**Results** (2048×2048 matrices):\n- Serial: 16,306ms\n- OpenMP: 5,242ms (3.1x)\n- **NVIDIA GPU (tiled): 105ms (155x speedup)** - **Winner**\n\n**Lesson**: Matrix multiply is the canonical GPU-suitable computation.\n\n---\n\n### 007: Image Convolution - GPU Performance Sweet Spot\n**Purpose**: Show when/where/why OpenCL becomes the optimal choice for image processing.\n\n**Key Concepts**: Arithmetic intensity, separable filters, local memory optimization\n\n```cmd\ncd examples\\007_image_convolution\nbuild.bat\n```\n\n**Results** (4096×4096 image, 15×15 kernel):\n- Serial: 3,370ms\n- OpenMP: 526ms (6.4x)\n- **Intel CPU OpenCL (separable): 22.5ms (150x speedup)** - **Winner**\n- NVIDIA GPU (separable): 28.7ms (117x)\n\n**Lesson**: Image processing with large kernels is OpenCL's sweet spot. Separable decomposition critical.\n\n---\n\n## Performance Summary Across All Examples\n\n| Operation Type | Arithmetic Intensity | Winner | Best Speedup |\n|----------------|---------------------|--------|--------------|\n| Vector addition | Very low (1 op/access) | CPU (Serial) | 1x |\n| Matrix-vector | Low (10 ops/access) | OpenMP | 11x |\n| Matrix multiply | High (2000 ops/access) | OpenCL GPU | 155x |\n| Convolution (small kernel) | Low (9 ops/access) | OpenMP | 1.4x |\n| Convolution (large kernel) | Very high (225 ops/access) | OpenCL | 150x |\n\n**Key Insight**: GPU advantage grows exponentially with arithmetic intensity.\n\n## When to Use Each Technology\n\n**Use Serial C++ when:**\n- Dataset is tiny (\u003c 1K elements)\n- Algorithm has poor parallelism\n- Prototyping and validation\n\n**Use OpenMP when:**\n- Arithmetic intensity is moderate (5-50 ops per memory access)\n- Quick parallelization needed\n- Expect 6-12x speedup across diverse workloads\n\n**Use OpenCL GPU when:**\n- Arithmetic intensity is high (\u003e 50 ops per memory access)\n- Large datasets (\u003e 10M elements)\n- 100x+ speedup justifies development complexity\n\n**Use OpenCL CPU when:**\n- Data must stay in CPU memory\n- Cache-friendly with data reuse\n- Can outperform discrete GPUs for specific patterns\n\n## Building from Source\n\nAll examples use the same build pattern:\n\n```cmd\ncd examples\\\u003cexample_name\u003e\nbuild.bat\n```\n\nThe `build.bat` script:\n1. Creates a `build/` directory\n2. Runs CMake to generate Visual Studio projects\n3. Builds in Release configuration\n4. Copies kernel files (`.cl`) to output directory\n5. Runs the executable\n\n### Manual Build (Advanced)\n\n```cmd\nmkdir build \u0026\u0026 cd build\ncmake .. -G \"Visual Studio 16 2019\" -A x64\ncmake --build . --config Release\ncd Release\n\u003cexecutable\u003e.exe\n```\n\n## Troubleshooting\n\n### \"No OpenCL platforms found\"\n- Install at least one OpenCL runtime (NVIDIA CUDA, Intel Graphics Driver, or Intel oneAPI)\n- Run `setup\\check_opencl_installed.bat` to verify\n\n### \"Failed to open kernel file\"\n- Ensure `.cl` files are in the same directory as the executable\n- Check that `build.bat` copies kernel files correctly\n\n### Device enumeration hangs\n- Update Intel Graphics drivers to latest version\n- Remove legacy Intel OpenCL CPU Runtime if installed alongside Intel oneAPI\n- See: https://github.com/intel/compute-runtime\n\n### CMake not found\n- Install Visual Studio 2019 Build Tools with \"Desktop development with C++\"\n- Or install standalone CMake from https://cmake.org\n\n## Performance Tips\n\n1. **Memory-bound operations** (like vector addition) don't benefit much from GPUs due to transfer overhead\n2. **Compute-intensive operations** (matrix multiplication, image processing) show significant GPU speedup\n3. **Breakeven point** varies by hardware - test with realistic data sizes\n4. **Multi-device execution** works best when devices have independent work chunks\n\n## Learning Path\n\nRecommended order:\n1. **000_device_enumeration** - Understand your hardware\n2. **001_hello_opencl** - Learn basic OpenCL workflow\n3. **002_vector_addition** - See why simple operations are slow\n4. **003_breakeven_analysis** - Find when GPU acceleration helps\n5. **004_async_multidevice** - Advanced: use all devices simultaneously\n\n## Common Issues \u0026 Solutions\n\n### Issue: OpenCL hangs during platform enumeration\n**Cause**: Conflicting Intel OpenCL runtimes  \n**Solution**: Uninstall legacy \"Intel OpenCL CPU Runtime 16.x\", keep only Intel oneAPI\n\n### Issue: Build errors about `std::to_string`\n**Cause**: Missing `\u003cstring\u003e` header  \n**Solution**: Add `#include \u003cstring\u003e` at top of file\n\n### Issue: \"Cannot create context with devices from multiple platforms\"\n**Cause**: Trying to use devices from NVIDIA + Intel in single context  \n**Solution**: Create separate contexts per platform (see example 004)\n\n## Resources\n\n- **OpenCL Programming Guide**: https://www.khronos.org/opencl/\n- **NVIDIA OpenCL Best Practices**: https://docs.nvidia.com/cuda/opencl-best-practices-guide/\n- **Intel OpenCL Documentation**: https://www.intel.com/content/www/us/en/developer/tools/opencl-sdk/overview.html\n\n## License\n\nMIT License - See LICENSE file for details\n\n## Contributing\n\nContributions welcome! Please ensure:\n- Code follows existing style\n- Examples build cleanly on Windows + MSVC 2019\n- Include README.md for new examples\n- Test on at least one GPU platform\n\n## Acknowledgments\n\nExamples developed and tested on Windows 10 with:\n- Visual Studio 2019 Build Tools\n- NVIDIA CUDA Toolkit 12.9\n- Intel oneAPI 2025.1\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoadsf%2Fopencl-windows-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoadsf%2Fopencl-windows-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoadsf%2Fopencl-windows-examples/lists"}