https://github.com/foadsf/opencl-windows-examples

Practical OpenCL examples for Windows with C/C++, demonstrating GPU computing across NVIDIA and Intel hardware
https://github.com/foadsf/opencl-windows-examples

cmake cpp gpu-computing intel nvidia opencl parallel-computing windows

Last synced: 3 months ago
JSON representation

Practical OpenCL examples for Windows with C/C++, demonstrating GPU computing across NVIDIA and Intel hardware

Host: GitHub
URL: https://github.com/foadsf/opencl-windows-examples
Owner: Foadsf
Created: 2025-10-02T17:16:38.000Z (3 months ago)
Default Branch: master
Last Pushed: 2025-10-02T18:27:48.000Z (3 months ago)
Last Synced: 2025-10-02T20:30:39.911Z (3 months ago)
Topics: cmake, cpp, gpu-computing, intel, nvidia, opencl, parallel-computing, windows
Language: Batchfile
Size: 45.9 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# OpenCL Examples for Windows (C/C++)

A collection of practical OpenCL examples demonstrating GPU computing on Windows using Visual Studio 2019 and CMake.

## System Requirements

- **OS**: Windows 10/11
- **Compiler**: Visual Studio 2019 Build Tools (MSVC 19.29+)
- **CMake**: 3.15+ (bundled with VS Build Tools)
- **OpenCL Runtimes**: At least one of:
- NVIDIA CUDA Toolkit 12.x (for NVIDIA GPUs)
- Intel Graphics Driver (for Intel integrated GPUs)
- Intel oneAPI Base Toolkit (for CPU execution)

## Hardware Tested

- **GPU**: NVIDIA RTX A2000 Laptop GPU
- **iGPU**: Intel UHD Graphics (11th Gen)
- **CPU**: Intel Core i7-11850H @ 2.50GHz

## Repository Structure

```
opencl-windows-cpp/
├── examples/
│ ├── 000_device_enumeration/ # List all OpenCL platforms and devices
│ ├── 001_hello_opencl/ # Simple "Hello World" kernel
│ ├── 002_vector_addition/ # CPU vs GPU performance comparison
│ ├── 003_breakeven_analysis/ # Find OpenCL performance crossover points
│ └── 004_async_multidevice/ # Concurrent execution across devices
├── setup/
│ ├── check_opencl_installed.bat # Verify OpenCL installation
│ └── detect_opencl_hardware.bat # Hardware detection script
└── README.md
```

## Quick Start

### 1. Verify OpenCL Installation

```cmd
cd setup
check_opencl_installed.bat
```

Expected output: Lists installed OpenCL runtimes and detects your hardware.

### 2. Build and Run an Example

```cmd
cd examples\001_hello_opencl
build.bat
```

Each example includes:
- `main.cpp` - Host code
- `*.cl` - OpenCL kernel(s)
- `CMakeLists.txt` - CMake configuration
- `build.bat` - Build and run script
- `README.md` - Example documentation

## Examples Overview

### 000: Device Enumeration
**Purpose**: Detect all OpenCL platforms and devices on your system.

**Key Concepts**: Platform querying, device properties

```cmd
cd examples\000_device_enumeration
build.bat
```

**Output**: Lists all GPUs and CPUs with their capabilities.

---

### 001: Hello OpenCL
**Purpose**: Simplest possible GPU kernel - write "Hello from GPU!" to a buffer.

**Key Concepts**: Context creation, kernel compilation, buffer management

```cmd
cd examples\001_hello_opencl
build.bat
```

**Expected Output**:
```
Using device: NVIDIA RTX A2000 Laptop GPU
Kernel output: Hello from GPU!
Success!
```

---

### 002: Vector Addition
**Purpose**: Compare CPU vs GPU performance for vector addition.

**Key Concepts**: Memory transfer overhead, parallel execution

```cmd
cd examples\002_vector_addition
build.bat
```

**Results** (10M elements):
- Serial C++: 6.12ms (baseline)
- OpenCL GPU: 24.23ms (SLOWER - memory-bound operation)

**Lesson**: Simple operations don't benefit from GPUs due to transfer overhead.

---

### 003: Breakeven Analysis
**Purpose**: Find the vector size where OpenCL becomes faster than serial C++.

**Key Concepts**: Performance profiling, scaling analysis

```cmd
cd examples\003_breakeven_analysis
build.bat
```

**Key Findings**:
- **NVIDIA RTX A2000**: Faster at 64K elements
- **Intel UHD Graphics**: Faster at 256K elements

**Lesson**: GPUs require sufficient workload to amortize overhead.

---

### 004: Async Multi-Device
**Purpose**: Execute kernels simultaneously across multiple devices.

**Key Concepts**: Multiple contexts, asynchronous execution, cross-platform limitations

```cmd
cd examples\004_async_multidevice
build.bat
```

**Important**: Timing analysis across platforms is unreliable - each vendor uses different time references.

---

### 005: Parallelization Technologies Comparison
**Purpose**: Compare Serial C++, C++17 std::execution, OpenMP, and OpenCL for matrix-vector multiplication.

**Key Concepts**: Technology trade-offs, memory bandwidth limits

```cmd
cd examples\005_parallelization_comparison
build.bat
```

**Results** (4096×4096 matrix):
- Serial: 16.4ms
- OpenMP: 1.5ms (11x speedup) - **Winner**
- OpenCL GPU: 23.4ms (SLOWER - still memory-bound)

**Lesson**: OpenMP dominates moderate-intensity operations.

---

### 006: Matrix Multiplication - Where GPUs Shine
**Purpose**: Demonstrate compute-intensive operations where GPUs provide massive speedups.

**Key Concepts**: O(n³) complexity, tiling optimization, GFLOPS

```cmd
cd examples\006_matrix_multiply
build.bat
```

**Results** (2048×2048 matrices):
- Serial: 16,306ms
- OpenMP: 5,242ms (3.1x)
- **NVIDIA GPU (tiled): 105ms (155x speedup)** - **Winner**

**Lesson**: Matrix multiply is the canonical GPU-suitable computation.

---

### 007: Image Convolution - GPU Performance Sweet Spot
**Purpose**: Show when/where/why OpenCL becomes the optimal choice for image processing.

**Key Concepts**: Arithmetic intensity, separable filters, local memory optimization

```cmd
cd examples\007_image_convolution
build.bat
```

**Results** (4096×4096 image, 15×15 kernel):
- Serial: 3,370ms
- OpenMP: 526ms (6.4x)
- **Intel CPU OpenCL (separable): 22.5ms (150x speedup)** - **Winner**
- NVIDIA GPU (separable): 28.7ms (117x)

**Lesson**: Image processing with large kernels is OpenCL's sweet spot. Separable decomposition critical.

---

## Performance Summary Across All Examples

| Operation Type | Arithmetic Intensity | Winner | Best Speedup |
|----------------|---------------------|--------|--------------|
| Vector addition | Very low (1 op/access) | CPU (Serial) | 1x |
| Matrix-vector | Low (10 ops/access) | OpenMP | 11x |
| Matrix multiply | High (2000 ops/access) | OpenCL GPU | 155x |
| Convolution (small kernel) | Low (9 ops/access) | OpenMP | 1.4x |
| Convolution (large kernel) | Very high (225 ops/access) | OpenCL | 150x |

**Key Insight**: GPU advantage grows exponentially with arithmetic intensity.

## When to Use Each Technology

**Use Serial C++ when:**
- Dataset is tiny (< 1K elements)
- Algorithm has poor parallelism
- Prototyping and validation

**Use OpenMP when:**
- Arithmetic intensity is moderate (5-50 ops per memory access)
- Quick parallelization needed
- Expect 6-12x speedup across diverse workloads

**Use OpenCL GPU when:**
- Arithmetic intensity is high (> 50 ops per memory access)
- Large datasets (> 10M elements)
- 100x+ speedup justifies development complexity

**Use OpenCL CPU when:**
- Data must stay in CPU memory
- Cache-friendly with data reuse
- Can outperform discrete GPUs for specific patterns

## Building from Source

All examples use the same build pattern:

```cmd
cd examples\
build.bat
```

The `build.bat` script:
1. Creates a `build/` directory
2. Runs CMake to generate Visual Studio projects
3. Builds in Release configuration
4. Copies kernel files (`.cl`) to output directory
5. Runs the executable

### Manual Build (Advanced)

```cmd
mkdir build && cd build
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release
cd Release
.exe
```

## Troubleshooting

### "No OpenCL platforms found"
- Install at least one OpenCL runtime (NVIDIA CUDA, Intel Graphics Driver, or Intel oneAPI)
- Run `setup\check_opencl_installed.bat` to verify

### "Failed to open kernel file"
- Ensure `.cl` files are in the same directory as the executable
- Check that `build.bat` copies kernel files correctly

### Device enumeration hangs
- Update Intel Graphics drivers to latest version
- Remove legacy Intel OpenCL CPU Runtime if installed alongside Intel oneAPI
- See: https://github.com/intel/compute-runtime

### CMake not found
- Install Visual Studio 2019 Build Tools with "Desktop development with C++"
- Or install standalone CMake from https://cmake.org

## Performance Tips

1. **Memory-bound operations** (like vector addition) don't benefit much from GPUs due to transfer overhead
2. **Compute-intensive operations** (matrix multiplication, image processing) show significant GPU speedup
3. **Breakeven point** varies by hardware - test with realistic data sizes
4. **Multi-device execution** works best when devices have independent work chunks

## Learning Path

Recommended order:
1. **000_device_enumeration** - Understand your hardware
2. **001_hello_opencl** - Learn basic OpenCL workflow
3. **002_vector_addition** - See why simple operations are slow
4. **003_breakeven_analysis** - Find when GPU acceleration helps
5. **004_async_multidevice** - Advanced: use all devices simultaneously

## Common Issues & Solutions

### Issue: OpenCL hangs during platform enumeration
**Cause**: Conflicting Intel OpenCL runtimes
**Solution**: Uninstall legacy "Intel OpenCL CPU Runtime 16.x", keep only Intel oneAPI

### Issue: Build errors about `std::to_string`
**Cause**: Missing `` header
**Solution**: Add `#include ` at top of file

### Issue: "Cannot create context with devices from multiple platforms"
**Cause**: Trying to use devices from NVIDIA + Intel in single context
**Solution**: Create separate contexts per platform (see example 004)

## Resources

- **OpenCL Programming Guide**: https://www.khronos.org/opencl/
- **NVIDIA OpenCL Best Practices**: https://docs.nvidia.com/cuda/opencl-best-practices-guide/
- **Intel OpenCL Documentation**: https://www.intel.com/content/www/us/en/developer/tools/opencl-sdk/overview.html

## License

MIT License - See LICENSE file for details

## Contributing

Contributions welcome! Please ensure:
- Code follows existing style
- Examples build cleanly on Windows + MSVC 2019
- Include README.md for new examples
- Test on at least one GPU platform

## Acknowledgments

Examples developed and tested on Windows 10 with:
- Visual Studio 2019 Build Tools
- NVIDIA CUDA Toolkit 12.9
- Intel oneAPI 2025.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/foadsf/opencl-windows-examples

Awesome Lists containing this project

README