{"id":17278321,"url":"https://github.com/projectphysx/opencl-wrapper","last_synced_at":"2025-05-16T09:04:41.642Z","repository":{"id":53265506,"uuid":"450825784","full_name":"ProjectPhysX/OpenCL-Wrapper","owner":"ProjectPhysX","description":"OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.","archived":false,"fork":false,"pushed_at":"2025-04-18T16:04:25.000Z","size":307,"stargazers_count":390,"open_issues_count":5,"forks_count":40,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-19T05:17:25.065Z","etag":null,"topics":["gpgpu","gpgpu-computing","gpu","gpu-acceleration","gpu-computing","gpu-programming","opencl","vector-processor","vectorization"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ProjectPhysX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-01-22T13:30:44.000Z","updated_at":"2025-04-19T01:41:01.000Z","dependencies_parsed_at":"2024-05-08T18:00:41.783Z","dependency_job_id":"952bd24f-a8fb-480f-8418-083866fb0871","html_url":"https://github.com/ProjectPhysX/OpenCL-Wrapper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FOpenCL-Wrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FOpenCL-Wrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FOpenCL-Wrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FOpenCL-Wrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ProjectPhysX","download_url":"https://codeload.github.com/ProjectPhysX/OpenCL-Wrapper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254501557,"owners_count":22081528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpgpu","gpgpu-computing","gpu","gpu-acceleration","gpu-computing","gpu-programming","opencl","vector-processor","vectorization"],"created_at":"2024-10-15T09:11:24.832Z","updated_at":"2025-05-16T09:04:41.624Z","avatar_url":"https://github.com/ProjectPhysX.png","language":"C++","readme":"# OpenCL-Wrapper\nOpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started.\nI created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.\n\nWorks in Windows, Linux and Android with C++17.\n\nUse-case example: [FluidX3D](https://github.com/ProjectPhysX/FluidX3D) builds entirely on top of this OpenCL-Wrapper.\n\n## Getting started:\n\u003cdetails\u003e\u003csummary\u003eInstall GPU Drivers and OpenCL Runtime (click to expand section)\u003c/summary\u003e\n\n- **Windows**\n  \u003cdetails\u003e\u003csummary\u003eGPUs\u003c/summary\u003e\n\n  - Download and install the [AMD](https://www.amd.com/en/support/download/drivers.html)/[Intel](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html)/[Nvidia](https://www.nvidia.com/Download/index.aspx) GPU Drivers, which contain the OpenCL Runtime.\n  - Reboot.\n\n  \u003c/details\u003e\n  \u003cdetails\u003e\u003csummary\u003eCPUs\u003c/summary\u003e\n\n  - Download and install the [Intel CPU Runtime for OpenCL](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-cpu-runtime-for-opencl-applications-with-sycl-support.html) (works for both AMD/Intel CPUs).\n  - Reboot.\n\n  \u003c/details\u003e\n- **Linux**\n  \u003cdetails\u003e\u003csummary\u003eAMD GPUs\u003c/summary\u003e\n\n  - Download and install [AMD GPU Drivers](https://www.amd.com/en/support/download/linux-drivers.html), which contain the OpenCL Runtime, with:\n    ```bash\n    sudo apt update \u0026\u0026 sudo apt upgrade -y\n    sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev\n    mkdir -p ~/amdgpu\n    wget -P ~/amdgpu https://repo.radeon.com/amdgpu-install/6.3.4/ubuntu/noble/amdgpu-install_6.3.60304-1_all.deb\n    sudo apt install -y ~/amdgpu/amdgpu-install*.deb\n    sudo amdgpu-install -y --usecase=graphics,rocm,opencl --opencl=rocr\n    sudo usermod -a -G render,video $(whoami)\n    rm -r ~/amdgpu\n    sudo shutdown -r now\n    ```\n\n  \u003c/details\u003e\n  \u003cdetails\u003e\u003csummary\u003eIntel GPUs\u003c/summary\u003e\n\n  - Intel GPU Drivers come already installed since Linux Kernel 6.2, but they don't contain the OpenCL Runtime.\n  - The the [OpenCL Runtime](https://github.com/intel/compute-runtime/releases) has to be installed separately with:\n    ```bash\n    sudo apt update \u0026\u0026 sudo apt upgrade -y\n    sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev intel-opencl-icd\n    sudo usermod -a -G render $(whoami)\n    sudo shutdown -r now\n    ```\n\n  \u003c/details\u003e\n  \u003cdetails\u003e\u003csummary\u003eNvidia GPUs\u003c/summary\u003e\n\n  - Download and install [Nvidia GPU Drivers](https://www.nvidia.com/Download/index.aspx), which contain the OpenCL Runtime, with:\n    ```bash\n    sudo apt update \u0026\u0026 sudo apt upgrade -y\n    sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev nvidia-driver-570\n    sudo shutdown -r now\n    ```\n\n  \u003c/details\u003e\n  \u003cdetails\u003e\u003csummary\u003eCPUs\u003c/summary\u003e\n\n  - Option 1: Download and install the [oneAPI DPC++ Compiler](https://github.com/intel/llvm/releases?q=%22oneAPI+DPC%2B%2B+Compiler+dependencies%22) and [oneTBB](https://github.com/uxlfoundation/oneTBB/releases) with:\n    ```bash\n    export OCLV=\"oclcpuexp-2025.19.3.0.17_230222_rel\"\n    export TBBV=\"oneapi-tbb-2022.1.0\"\n    sudo apt update \u0026\u0026 sudo apt upgrade -y\n    sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev\n    sudo mkdir -p ~/cpurt /opt/intel/${OCLV} /etc/OpenCL/vendors /etc/ld.so.conf.d\n    sudo wget -P ~/cpurt https://github.com/intel/llvm/releases/download/2025-WW13/${OCLV}.tar.gz\n    sudo wget -P ~/cpurt https://github.com/uxlfoundation/oneTBB/releases/download/v2022.1.0/${TBBV}-lin.tgz\n    sudo tar -zxvf ~/cpurt/${OCLV}.tar.gz -C /opt/intel/${OCLV}\n    sudo tar -zxvf ~/cpurt/${TBBV}-lin.tgz -C /opt/intel\n    echo /opt/intel/${OCLV}/x64/libintelocl.so | sudo tee /etc/OpenCL/vendors/intel_expcpu.icd\n    echo /opt/intel/${OCLV}/x64 | sudo tee /etc/ld.so.conf.d/libintelopenclexp.conf\n    sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbb.so /opt/intel/${OCLV}/x64\n    sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/${OCLV}/x64\n    sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/${OCLV}/x64\n    sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/${OCLV}/x64\n    sudo ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf\n    sudo rm -r ~/cpurt\n    ```\n  - Option 2: Download and install [PoCL](https://portablecl.org/) with:\n    ```bash\n    sudo apt update \u0026\u0026 sudo apt upgrade -y\n    sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev pocl-opencl-icd\n    ```\n  \u003c/details\u003e\n\n- **Android**\n  \u003cdetails\u003e\u003csummary\u003eARM GPUs\u003c/summary\u003e\n\n  - Download the [Termux `.apk`](https://github.com/termux/termux-app/releases) and install it.\n  - In the Termux app, run:\n    ```bash\n    apt update \u0026\u0026 apt upgrade -y\n    apt install -y clang git make\n    ```\n\n  \u003c/details\u003e\n\n\u003c/details\u003e\n\n\u0026#9656;[Download](https://github.com/ProjectPhysX/OpenCL-Wrapper/archive/refs/heads/master.zip)+unzip the source code or `git clone https://github.com/ProjectPhysX/OpenCL-Wrapper.git`\n\n\u003cdetails\u003e\u003csummary\u003eCompiling on Windows (click to expand section)\u003c/summary\u003e\n\n- Download and install [Visual Studio Community](https://visualstudio.microsoft.com/de/vs/community/). In Visual Studio Installer, add:\n  - Desktop development with C++\n  - MSVC v142\n  - Windows 10 SDK\n- Open [`OpenCL-Wrapper.sln`](OpenCL-Wrapper.sln) in [Visual Studio Community](https://visualstudio.microsoft.com/de/vs/community/).\n- Compile and run by clicking the \u003ckbd\u003e► Local Windows Debugger\u003c/kbd\u003e button.\n\n\u003c/details\u003e\n\u003cdetails\u003e\u003csummary\u003eCompiling on Linux / macOS / Android (click to expand section)\u003c/summary\u003e\n\n- Compile and run with:\n  ```bash\n  chmod +x make.sh\n  ./make.sh\n  ```\n- Compiling requires [`g++`](https://gcc.gnu.org/) with `C++17`, which is supported since version `8` (check with `g++ --version`).\n- Operating system (Linux/macOS/Android) is detected automatically.\n\n\u003c/details\u003e\n\n## Key simplifications:\n1. select a `Device` with 1 line\n   - automatically select fastest device / device with most memory / device with specified ID from a list of all devices\n   - easily get device information (performance in TFLOPs/s, amount of memory and cache, FP64/FP16 capabilities, etc.)\n   - automatic OpenCL C code compilation when creating the Device object\n     - automatically enable FP64/FP16 capabilities in OpenCL C code\n     - automatically print log to console if there are compile errors\n     - easy option to generate PTX assembly for Nvidia GPUs and save that in a `.ptx` file\n   - \u003cdetails\u003e\u003csummary\u003econtains all device-specific workarounds/patches to make OpenCL fully cross-compatible\u003c/summary\u003e\n\n     - AMD\n       - fix for wrong device name reporting on AMD GPUs\n       - fix for maximum buffer allocation size limit for AMD GPUs\n     - Intel\n       - enable \u003e4GB single buffer VRAM allocations on Intel Arc GPUs\n       - fix for wrong VRAM capacity reporting on Intel Arc GPUs\n       - fix for maximum buffer allocation size limit in Intel CPU Runtime for OpenCL\n       - fix for false dp4a reporting on Intel\n     - Nvidia\n       - enable basic FP16 support on Nvidia Pascal and newer GPUs with driver 520 or newer\n     - other\n       - enable FP64, FP16 and INT64 atomics support on supported devices\n       - fix for unreliable OpenCL C version reporting\n       - always compile for latest supported OpenCL C standard\n       - fix for terrible `fma` performance on ARM GPUs\n\n     \u003c/details\u003e\n2. create a `Memory` object with 1 line\n   - one object for both host and device memory\n   - easy host \u003c-\u003e device memory transfer (also for 1D/2D/3D grid domains)\n   - easy handling of multi-dimensional vectors\n   - can also be used to only allocate memory on host or only allocate memory on device\n   - automatically tracks total global memory usage of device when allocating/deleting memory\n   - automatically uses zero-copy buffers on CPUs/iGPUs\n3. create a `Kernel` with 1 line\n   - Memory objects and constants are linked to OpenCL C kernel parameters during Kernel creation\n   - a list of Memory objects and constants can be added to Kernel parameters in one line (`add_parameters(...)`)\n   - Kernel parameters can be edited (`set_parameters(...)`)\n   - easy Kernel execution: `kernel.run();`\n   - Kernel function calls can be daisy chained, for example: `kernel.set_parameters(3u, time).run();`\n   - failsafe: you'll get an error message if kernel parameters mismatch between C++ and OpenCL code\n4. OpenCL C code is embedded into C++\n   - syntax highlighting in the code editor is retained\n   - notes / peculiarities of this workaround:\n     - the `#define R(...) string(\" \"#__VA_ARGS__\" \")` stringification macro converts its arguments to string literals; `'\\n'` is converted to `' '` in the process\n     - these string literals cannot be arbitrarily long, so interrupt them periodically with `)+R(`\n     - to use unbalanced round brackets `'('`/`')'`, exit the `R(...)` macro and insert a string literal manually: `)+\"void function(\"+R(` and `)+\") {\"+R(`\n     - to use preprocessor switch macros, exit the `R(...)` macro and insert a string literal manually: `)+\"#define TEST\"+R(` and `)+\"#endif\"+R( // TEST`\n     - preprocessor replacement macros (for example `#define VARIABLE 42`) don't work; hand these to the `Device` constructor directly instead\n\n## No need to:\n- have code overhead for selecting a platform/device, passing the OpenCL C code, etc.\n- keep track of length and data type for buffers\n- have duplicate code for host and device buffers\n- keep track of total global memory usage\n- keep track of global/local range for kernels\n- bother with Queue, Context, Source, Program\n- load a `.cl` file at runtime\n- bother with device-specific workarounds/patches\n\n## Example (OpenCL vector addition)\n### main.cpp\n```c\n#include \"opencl.hpp\"\n\nint main() {\n\tDevice device(select_device_with_most_flops()); // compile OpenCL C code for the fastest available device\n\n\tconst uint N = 1024u; // size of vectors\n\tMemory\u003cfloat\u003e A(device, N); // allocate memory on both host and device\n\tMemory\u003cfloat\u003e B(device, N);\n\tMemory\u003cfloat\u003e C(device, N);\n\n\tKernel add_kernel(device, N, \"add_kernel\", A, B, C); // kernel that runs on the device\n\n\tfor(uint n=0u; n\u003cN; n++) {\n\t\tA[n] = 3.0f; // initialize memory\n\t\tB[n] = 2.0f;\n\t\tC[n] = 1.0f;\n\t}\n\n\tprint_info(\"Value before kernel execution: C[0] = \"+to_string(C[0]));\n\n\tA.write_to_device(); // copy data from host memory to device memory\n\tB.write_to_device();\n\tadd_kernel.run(); // run add_kernel on the device\n\tC.read_from_device(); // copy data from device memory to host memory\n\n\tprint_info(\"Value after kernel execution: C[0] = \"+to_string(C[0]));\n\n\twait();\n\treturn 0;\n}\n```\n\n### kernel.cpp\n```c\n#include \"kernel.hpp\" // note: unbalanced round brackets () are not allowed and string literals can't be arbitrarily long, so periodically interrupt with )+R(\nstring opencl_c_container() { return R( // ########################## begin of OpenCL C code ####################################################################\n\n\n\nkernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to \"for(uint n=0u; n\u003cN; n++) {\", but executed in parallel\n\tconst uint n = get_global_id(0);\n\tC[n] = A[n]+B[n];\n}\n\n\n\n);} // ############################################################### end of OpenCL C code #####################################################################\n```\n\n### For comparison, the very same OpenCL vector addition example looks like this when directly using the OpenCL C++ bindings:\n```c\n#define CL_HPP_MINIMUM_OPENCL_VERSION 100\n#define CL_HPP_TARGET_OPENCL_VERSION 300\n#include \u003cCL/opencl.hpp\u003e\n#include \"utilities.hpp\"\n\n#define WORKGROUP_SIZE 64\n\nint main() {\n\n\t// 1. select device\n\n\tvector\u003ccl::Device\u003e cl_devices; // get all devices of all platforms\n\t{\n\t\tvector\u003ccl::Platform\u003e cl_platforms; // get all platforms (drivers)\n\t\tcl::Platform::get(\u0026cl_platforms);\n\t\tfor(uint i=0u; i\u003c(uint)cl_platforms.size(); i++) {\n\t\t\tvector\u003ccl::Device\u003e cl_devices_available;\n\t\t\tcl_platforms[i].getDevices(CL_DEVICE_TYPE_ALL, \u0026cl_devices_available);\n\t\t\tfor(uint j=0u; j\u003c(uint)cl_devices_available.size(); j++) {\n\t\t\t\tcl_devices.push_back(cl_devices_available[j]);\n\t\t\t}\n\t\t}\n\t}\n\tcl::Device cl_device; // select fastest available device\n\t{\n\t\tfloat best_value = 0.0f;\n\t\tuint best_i = 0u; // index of fastest device\n\t\tfor(uint i=0u; i\u003c(uint)cl_devices.size(); i++) { // find device with highest (estimated) floating point performance\n\t\t\tconst string name = trim(cl_device.getInfo\u003cCL_DEVICE_NAME\u003e()); // device name\n\t\t\tconst string vendor = trim(cl_device.getInfo\u003cCL_DEVICE_VENDOR\u003e()); // device vendor\n\t\t\tconst uint compute_units = (uint)cl_device.getInfo\u003cCL_DEVICE_MAX_COMPUTE_UNITS\u003e(); // compute units (CUs) can contain multiple cores depending on the microarchitecture\n\t\t\tconst uint clock_frequency = (uint)cl_device.getInfo\u003cCL_DEVICE_MAX_CLOCK_FREQUENCY\u003e(); // in MHz\n\t\t\tconst bool is_gpu = cl_device.getInfo\u003cCL_DEVICE_TYPE\u003e()==CL_DEVICE_TYPE_GPU;\n\t\t\tconst int vendor_id = (int)cl_device.getInfo\u003cCL_DEVICE_VENDOR_ID\u003e(); // AMD=0x1002, Intel=0x8086, Nvidia=0x10DE, Apple=0x1027F00\n\t\t\tfloat cores_per_cu = 1.0f;\n\t\t\tif(vendor_id==0x1002) { // AMD GPU/CPU\n\t\t\t\tconst bool amd_128_cores_per_dualcu = contains(to_lower(name), \"gfx10\"); // identify RDNA/RDNA2 GPUs where dual CUs are reported\n\t\t\t\tconst bool amd_256_cores_per_dualcu = contains(to_lower(name), \"gfx11\"); // identify RDNA3 GPUs where dual CUs are reported\n\t\t\t\tcores_per_cu = is_gpu ? (amd_256_cores_per_dualcu ? 256.0f : amd_128_cores_per_dualcu ? 128.0f : 64.0f) : 0.5f; // 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2), 256 cores/dualCU (RDNA3), 1/2 core/CU (CPUs)\n\t\t\t} else if(vendor_id==0x8086) { // Intel GPU/CPU\n\t\t\t\tconst bool intel_16_cores_per_cu = contains_any(to_lower(name), {\"gpu max\", \"140v\", \"130v\", \"b580\", \"b570\"}); // identify PVC/Xe2 GPUs\n\t\t\t\tcores_per_cu = is_gpu ? (intel_16_cores_per_cu ? 16.0f : 8.0f) : 0.5f; // Intel GPUs have 16 cores/CU (PVC) or 8 cores/CU (integrated/Arc), Intel CPUs (with HT) have 1/2 core/CU\n\t\t\t} else if(vendor_id==0x10DE||vendor_id==0x13B5) { // Nvidia GPU/CPU\n\t\t\t\tconst uint nvidia_compute_capability = 10u*(uint)cl_device.getInfo\u003cCL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV\u003e()+(uint)cl_device.getInfo\u003cCL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV\u003e();\n\t\t\t\tconst bool nvidia__32_cores_per_cu = (nvidia_compute_capability \u003c30); // identify Fermi GPUs\n\t\t\t\tconst bool nvidia_192_cores_per_cu = (nvidia_compute_capability\u003e=30\u0026\u0026nvidia_compute_capability\u003c50); // identify Kepler GPUs\n\t\t\t\tconst bool nvidia__64_cores_per_cu = (nvidia_compute_capability\u003e=70\u0026\u0026nvidia_compute_capability\u003c80)||contains_any(to_lower(name), {\"p100\", \"a100\", \"a30\"}); // identify Volta, Turing, P100, A100, A30\n\t\t\t\tcores_per_cu = is_gpu ? (nvidia__32_cores_per_cu ? 32.0f : nvidia_192_cores_per_cu ? 192.0f : nvidia__64_cores_per_cu ? 64.0f : 128.0f) : 1.0f; // 32 (Fermi), 192 (Kepler), 64 (Volta, Turing, P100, A100, A30), 128 (Maxwell, Pascal, Ampere, Hopper, Ada, Blackwell) or 1 (CPUs)\n\t\t\t} else if(vendor_id==0x1027F00) { // Apple iGPU\n\t\t\t\tcores_per_cu = 128.0f; // Apple ARM GPUs usually have 128 cores/CU\n\t\t\t} else if(vendor_id==0x1022||vendor_id==0x10006||vendor_id==0x6C636F70) { // x86 CPUs with PoCL runtime\n\t\t\t\tcores_per_cu = 0.5f; // CPUs typically have 1/2 cores/CU due to SMT/hyperthreading\n\t\t\t} else if(contains(to_lower(vendor), \"arm\")) { // ARM\n\t\t\t\tcores_per_cu = is_gpu ? 8.0f : 1.0f; // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU\n\t\t\t}\n\t\t\tconst uint ipc = is_gpu ? 2u : 32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs\n\t\t\tconst uint cores = to_uint((float)compute_units*cores_per_cu); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)\n\t\t\tconst float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device floating point performance in TeraFLOPs/s\n\t\t\tif(tflops\u003ebest_value) {\n\t\t\t\tbest_value = tflops;\n\t\t\t\tbest_i = i;\n\t\t\t}\n\t\t}\n\t\tconst string name = trim(cl_devices[best_i].getInfo\u003cCL_DEVICE_NAME\u003e()); // device name\n\t\tcl_device = cl_devices[best_i];\n\t\tprint_info(name); // print device name\n\t}\n\n\t// 2. embed OpenCL C code (raw string literal breaks syntax highlighting)\n\n\tstring opencl_c_code = R\"(\n\t\tkernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to \"for(uint n=0u; n\u003cN; n++) {\", but executed in parallel\n\t\t\tconst uint n = get_global_id(0);\n\t\t\tC[n] = A[n]+B[n];\n\t\t}\n\t)\";\n\n\t// 3. compile OpenCL C code\n\n\tcl::Context cl_context;\n\tcl::Program cl_program;\n\tcl::CommandQueue cl_queue;\n\t{\n\t\tcl_context = cl::Context(cl_device);\n\t\tcl_queue = cl::CommandQueue(cl_context, cl_device);\n\t\tcl::CommandQueue cl_queue(cl_context, cl_device); // queue to push commands for the device\n\t\tcl::Program::Sources cl_source;\n\t\tcl_source.push_back({ opencl_c_code.c_str(), opencl_c_code.length() });\n\t\tcl_program = cl::Program(cl_context, cl_source);\n\t\tint error = cl_program.build({ cl_device }, \"-cl-finite-math-only -cl-no-signed-zeros -cl-mad-enable -w\"); // compile OpenCL C code, disable warnings\n\t\tif(error) print_warning(cl_program.getBuildInfo\u003cCL_PROGRAM_BUILD_LOG\u003e(cl_device)); // print build log\n\t\tif(error) print_error(\"OpenCL C code compilation failed.\");\n\t\telse print_info(\"OpenCL C code successfully compiled.\");\n\t}\n\n\t// 4. allocate memory on host and device\n\n\tconst uint N = 1024u;\n\tfloat* host_A;\n\tfloat* host_B;\n\tfloat* host_C;\n\tcl::Buffer device_A;\n\tcl::Buffer device_B;\n\tcl::Buffer device_C;\n\t{\n\t\thost_A = new float[N];\n\t\thost_B = new float[N];\n\t\thost_C = new float[N];\n\t\tfor(uint i=0u; i\u003cN; i++) {\n\t\t\thost_A[i] = 0.0f; // zero all buffers\n\t\t\thost_B[i] = 0.0f;\n\t\t\thost_C[i] = 0.0f;\n\t\t}\n\t\tint error = 0;\n\t\tdevice_A = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, \u0026error);\n\t\tif(error) print_error(\"OpenCL Buffer allocation failed with error code \"+to_string(error)+\".\");\n\t\tdevice_B = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, \u0026error);\n\t\tif(error) print_error(\"OpenCL Buffer allocation failed with error code \"+to_string(error)+\".\");\n\t\tdevice_C = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, \u0026error);\n\t\tif(error) print_error(\"OpenCL Buffer allocation failed with error code \"+to_string(error)+\".\");\n\t\tcl_queue.enqueueWriteBuffer(device_A, true, 0u, N*sizeof(float), (void*)host_A); // have to keep track of buffer range and buffer data type\n\t\tcl_queue.enqueueWriteBuffer(device_B, true, 0u, N*sizeof(float), (void*)host_B);\n\t\tcl_queue.enqueueWriteBuffer(device_C, true, 0u, N*sizeof(float), (void*)host_C);\n\t}\n\n\t// 5. create Kernel object and link input parameters\n\n\tcl::NDRange cl_range_global, cl_range_local;\n\tcl::Kernel cl_kernel;\n\t{\n\t\tcl_kernel = cl::Kernel(cl_program, \"add_kernel\");\n\t\tcl_kernel.setArg(0, device_A);\n\t\tcl_kernel.setArg(1, device_B);\n\t\tcl_kernel.setArg(2, device_C);\n\t\tcl_range_local = cl::NDRange(WORKGROUP_SIZE);\n\t\tcl_range_global = cl::NDRange(((N+WORKGROUP_SIZE-1)/WORKGROUP_SIZE)*WORKGROUP_SIZE); // make global range a multiple of local range\n\t}\n\n\t// 6. finally run the actual program\n\n\t{\n\t\tfor(uint i=0u; i\u003cN; i++) {\n\t\t\thost_A[i] = 3.0f; // initialize buffers on host\n\t\t\thost_B[i] = 2.0f;\n\t\t\thost_C[i] = 1.0f;\n\t\t}\n\n\t\tprint_info(\"Value before kernel execution: C[0] = \"+to_string(host_C[0]));\n\n\t\tcl_queue.enqueueWriteBuffer(device_A, true, 0u, N*sizeof(float), (void*)host_A); // copy A and B to device\n\t\tcl_queue.enqueueWriteBuffer(device_B, true, 0u, N*sizeof(float), (void*)host_B); // have to keep track of buffer range and buffer data type\n\t\tcl_queue.enqueueNDRangeKernel(cl_kernel, cl::NullRange, cl_range_global, cl_range_local); // have to keep track of kernel ranges\n\t\tcl_queue.finish(); // don't forget to finish the queue\n\t\tcl_queue.enqueueReadBuffer(device_C, true, 0u, N*sizeof(float), (void*)host_C);\n\n\t\tprint_info(\"Value after kernel execution: C[0] = \"+to_string(host_C[0]));\n\t}\n\n\twait();\n\treturn 0;\n}\n```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprojectphysx%2Fopencl-wrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprojectphysx%2Fopencl-wrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprojectphysx%2Fopencl-wrapper/lists"}