{"id":40864870,"url":"https://github.com/eyalroz/gpu-kernel-runner","last_synced_at":"2026-01-22T00:16:39.223Z","repository":{"id":41511749,"uuid":"485848035","full_name":"eyalroz/gpu-kernel-runner","owner":"eyalroz","description":"Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line","archived":false,"fork":false,"pushed_at":"2025-11-25T12:24:17.000Z","size":561,"stargazers_count":25,"open_issues_count":17,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-11-28T17:58:42.449Z","etag":null,"topics":["cuda","debugging-tool","gpgpu","gpu","gpu-kernel-performance","gpu-kernels","multi-language","opencl","performance-analysis","performance-testing","profiling","runner"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eyalroz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-04-26T15:38:09.000Z","updated_at":"2025-11-25T12:24:21.000Z","dependencies_parsed_at":"2023-02-17T21:31:13.807Z","dependency_job_id":"35f027fa-c020-40cc-ad86-ddd73001dd38","html_url":"https://github.com/eyalroz/gpu-kernel-runner","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/eyalroz/gpu-kernel-runner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eyalroz%2Fgpu-kernel-runner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eyalroz%2Fgpu-kernel-runner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eyalroz%2Fgpu-kernel-runner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eyalroz%2Fgpu-kernel-runner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eyalroz","download_url":"https://codeload.github.com/eyalroz/gpu-kernel-runner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eyalroz%2Fgpu-kernel-runner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28647922,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T21:29:11.980Z","status":"ssl_error","status_checked_at":"2026-01-21T21:24:31.872Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","debugging-tool","gpgpu","gpu","gpu-kernel-performance","gpu-kernels","multi-language","opencl","performance-analysis","performance-testing","profiling","runner"],"created_at":"2026-01-22T00:16:39.120Z","updated_at":"2026-01-22T00:16:39.211Z","avatar_url":"https://github.com/eyalroz.png","language":"C++","readme":"# GPU Kernel Runner\n\nA harness for stand-alone execution of single GPU kernels, for timing, debugging and profiling.\n\n\u003cbr\u003e\n\n| Table of contents|\n|:----------------|\n| \u003csub\u003e[Example: Executing a simple kernel to get its output](#example) \u003cbr\u003e [Motivation](#motivation)\u003cbr\u003e[Command-line interface](#cmdline) \u003cbr\u003e[How do I get the runner to run my own kernel?](#running-your-own-kernel) \u003cbr\u003e [Feedback, bugs, questions etc.](#feedback) \u003c/sub\u003e|\n\n----\n\n## \u003ca name=\"example\"\u003eExample: Executing a simple kernel to get its output\u003c/a\u003e\n\nConsider the following kernel (bundled with this repository):\n\n```\n__global__ void vectorAdd(\n        unsigned char       * __restrict  C,\n        unsigned char const * __restrict  A,\n        unsigned char const * __restrict  B,\n        size_t length)\n{\n    size_t i = blockDim.x * blockIdx.x + threadIdx.x;\n    if (i \u003c length) {\n        C[i] = A[i] + B[i] + A_LITTLE_EXTRA;\n    }\n}\n```\na suppose that you've also created two files: \n\n* `input_a`, containing the three characters `abc`;\n* `input_b`, containing 3 octets, each with values `03`.\n\nNow, if you run:\n```\nkernel-runner \\\n    --execution-ecosystem cuda \\\n    --kernel-key buled_with_runner/vector_add \\\n    --kernel-source vector_add.cu \\\n    --block-dimensions 256,1,1 \\\n    --grid-dimensions 1,1,1 \\\n    --arg A=input_a --arg length=3 --arg B=input_b \\\n    --arg-size C=3 \\\n    -DA_LITTLE_EXTRA=2\n```\nthen you'll get a file named `C.out`, containing `fgh`... which is indeed the correct output of the kernel: The sequence `abc`, plus 3 for each character due the values in `input_B`, plus 2 for each character from the preprocessor definition of `A_LITTLE_EXTRA`. \n\nYou can do the same with an equivalent OpenCL kernel, also bundled with this repository; just specify `opencl` instead of `cuda` as the execution ecosystem, and use the `vector_add.cl` kernel source file.\n\nThere is a bit of \"cheating\" here: The kernel runner doesn't magically parse your kernel source to determine what arguments are required. You need to have added some boilerplate code for your kernel into the runner:  Listing the kernel name, parameter names, whether they're input or output etc.\n\n## \u003ca name=\"motivation\"\u003eMotivation\u003c/a\u003e\n\nWhen we develop GPU kernels, or try to optimize existing ones, they are often intended for the middle of a large application:\n\n* A lot of work (and time) is expended before our kernel of interest gets run\n* The kernel is fed inputs - scalars and buffers - which are created as intermediate data of the larger program, and are neither saved to disk nor printed to logs.\n* ... alternatively, the kernel may be invoked so many times, that it would not make sense to save or print all of that data.\n* The kernel may be compiled dynamically, and the compilation parameters may also not be saved for later scrutiny.\n\nThis makes the isolation and consistently repeated invocation of our kernel quite unwieldy - if not outright impossible. If we want to avoid this repeated frustration, we sometimes find ourselves writing a small separate program which only runs our kernel; which is a decent enough idea, except - you have to rewrite this program again and again and again, for each and every kernel.\n\nThis repository is intended to take all that hassle away: It contains has the machinery you need for a small program which will run _any_ kernel - CUDA or OpenCL - independently and with minimal overhead. You just need to provide with the kernel's direct inputs and outputs (scalars or buffers); launch grid parameters; and dynamic compilation options for JIT'ing.\n\n## \u003ca name=\"cmdline\"\u003eCommand-line interface\u003c/a\u003e\n\nThe kernel runner executable supports the following command-linr options:\n```\nUsage:\n  kernel-runner [OPTION...]\n\n  -l, --log arg                 Set logging level (default: warning)\n      --log-flush-threshold arg\n                                Set the threshold level at and above which\n                                the log is flushed on each message\n                                (default: info)\n  -w, --save-output             Write output buffers to files (default:\n                                true)\n  -n, --repetitions arg         Number of times to run the compiled kernel\n                                (default: 1)\n  -e, --execution-ecosystem arg\n      --opencl                  Use OpenCL\n      --cuda                    Use CUDA\n                                Execution ecosystem (CUDA or Opencl)\n  -p, --platform-id arg         Use the OpenCL platform with the specified\n                                index\n  -a, --argument arg            Set one of the kernel's argument, keyed by\n                                name, with a serialized value for a scalar\n                                (e.g. foo=123) or a path to the contents of\n                                a buffer (e.g. bar=/path/to/data.bin)\n  -A, --no-default-compilation-options\n                                Avoid setting any compilation options not\n                                explicitly requested by the user\n      --output-buffer-size arg  Set one of the output buffers' sizes, keyed\n                                by name, in bytes (e.g. myresult=1048576)\n  -d, --device arg              Device index (default: 0)\n  -D, --define arg              Set a preprocessor definition for NVRTC\n                                (can be used repeatedly; specify either\n                                DEFINITION or DEFINITION=VALUE)\n  -c, --compile-only            Compile the kernel, but don't actually run\n                                it\n  -G, --device-debug-mode       Have the NVRTC compile the kernel in debug\n                                mode (no optimizations)\n  -P, --write-ptx               Write the intermediate representation code\n                                (PTX) resulting from the kernel\n                                compilation, to a file\n      --ptx-output-file arg     File to which to write the kernel's\n                                intermediate representation\n      --print-compilation-log   Print the compilation log to the standard\n                                output\n      --write-compilation-log arg\n                                Path of a file into which to write the\n                                compilation log (regardless of whether it's\n                                printed to standard output) (default: \"\")\n      --print-execution-durations\n                                Print the execution duration, in\n                                nanoseconds, of each kernel invocation to\n                                the standard output\n      --write-execution-durations arg\n                                Path of a file into which to write the\n                                execution durations, in nanoseconds, for\n                                each kernel invocation (regardless of\n                                whether they're printed to standard output)\n                                (default: \"\")\n      --generate-line-info      Add source line information to the\n                                intermediate representation code (PTX)\n  -b, --block-dimensions arg    Set grid block dimensions in threads\n                                (OpenCL: local work size); a\n                                comma-separated list\n  -g, --grid-dimensions arg     Set grid dimensions in blocks; a\n                                comma-separated list\n  -o, --overall-grid-dimensions arg\n                                Set grid dimensions in threads (OpenCL:\n                                global work size); a comma-separated list\n  -O, --append-compilation-option arg\n                                Append an arbitrary extra compilation\n                                option\n  -S, --dynamic-shared-memory-size arg\n                                Force specific amount of dynamic shared\n                                memory\n  -W, --overwrite-output-files  Overwrite the files for buffer and/or PTX\n                                output if they already exists\n  -i, --include arg             Include a specific file into the kernels'\n                                translation unit\n  -I, --include-path arg        Add a directory to the search paths for\n                                header files included by the kernel (can be\n                                used repeatedly)\n  -s, --source-file arg         Path to CUDA source file with the kernel\n                                function to compile; may be absolute or\n                                relative to the sources dir\n  -k, --kernel-function arg     Name of function within the source file to\n                                compile and run as a kernel (if different\n                                than the key)\n  -K, --kernel-key arg          The key identifying the kernel among all\n                                registered runnable kernels\n  -L, --list-adapters           List the (keys of the) kernels which may be\n                                run with this program\n  -z, --zero-outputs            Set the contents of output(-only) buffers\n                                to all-zeros\n      --language-standard arg   Set the language standard to use for CUDA\n                                compilation (options: c++11, c++14, c++17)\n      --input-buffer-directory arg\n                                Base location for locating input buffers\n                                (default:\n                                /home/lh156516/src/gpu-kernel-runner)\n      --output-buffer-directory arg\n                                Base location for writing output buffers\n                                (default:\n                                /home/lh156516/src/gpu-kernel-runner)\n      --kernel-sources-dir arg  Base location for locating kernel source\n                                files (default:\n                                /home/lh156516/src/gpu-kernel-runner)\n  -h, --help                    Print usage information\n```\n\n## \u003ca name=\"running-your-own-kernel\"\u003eHow do I get the runner to run my own kernels?\u003c/a\u003e\n\nSo, you've written a kernel. In order for the GPU kernel runner to run it, the runner needs to know about it. Internally, the runner knows kernels through \"kernel adapter\" classes, instantiated into a factory. Luckily, you don't have to be familiar with this mechanism in order to use it. What you _do_ need is a:\n\n1. A kernel adapter class definition, in a header file\n2. Building the kernel runner so as to recognize and use your kernel adapter header file\n\n### \u003ca name=\"build-with-extra-adapters\"\u003eTelling the build about your adapters\u003c/a\u003e\n\nThe CMake for this repository has a variables named `EXTRA_ADAPTER_SOURCE_DIRS`- you can set it when invoking CMake to configure your build, e.g.:\n```\ncmake \\\n    -D CMAKE_BUILD_TYPE=Release \\\n    -DEXTRA_ADAPTER_SOURCE_DIRS=\"/path/to/my_adapters/;/another/path/to/more_adapters\" \\\n    -B /path/to/build_dir/\n```\n... so that the build configuration can find your adapters and ensure they are instantiated.\n\n### \u003ca name=\"kernel-adapter-template\"\u003eA kernel adapter template\u003c/a\u003e\n\nTo create a kernel adapter for your kernel, it's easiest to start with the following empty template and replace the `[[[ ... ]]]` parts with what's relevant for your own kernel:\n```\n#include \u003ckernel_adapter.hpp\u003e\n\nclass [[[ UNIQUE CLASS NAME HERE ]]] : public kernel_adapter {\npublic:\n    KA_KERNEL_FUNCTION_NAME(\"[[[ (POSSIBLY-NON-UNIQUE) FUNCTION NAME HERE ]]]\")\n    KA_KERNEL_KEY(\"[[[ UNIQUE KERNEL KEY STRING HERE ]]]\")\n\n    const parameter_details_type\u0026 parameter_details() const override\n    {\n        static const parameter_details_type pd = {\n            [[[ DETAIL LINES FOR EACH KERNEL PARAMETER ]]]\n            \n            // Example detail lines:\n            //\n            //  scalar_details\u003cint\u003e(\"x\"),\n            //  buffer_details(\"my_results\", output),\n            //  buffer_details(\"my_data\", input),\n        };\n        return pd;\n    }\n};\n```\nFor a concrete example, see the adapter file [`vector_add.hpp`](https://github.com/eyalroz/gpu-kernel-runner/blob/main/src/kernel_adapters/vector_add.hpp).\n\nThe `kernel_adapter` class actually has other methods one could overwrite in order to make the kernel easier to invoke, for:\n\n* Deducing launch grid configuration\n* Specifying required preprocessor definitions\n* Generation of some arguments from other arguments (e.g. length from buffer size)\n\nbut none of these are essential.\n\n## \u003ca name=\"feedback\"\u003e Feedback, bugs, questions etc.\n\n* If you use this kernel runner in an interesting project, consider [dropping me a line](mailto:eyalroz1@gmx.com) and telling me about it - both the positive and any negative part of your experience.\n* Found a bug? A function/feature that's missing? A poor choice of design or of wording?-Please file an [issue](https://github.com/eyalroz/gpu-kernel-runner/issues/).\n* Have a question? If you believe it's generally relevant, also [file an issue](https://github.com/eyalroz/gpu-kernel-runner/issues/), and clearly state that it's a question.\n* Want to suggest significant additional functionality, which you believe would be of general interest? Either file an [issue](https://github.com/eyalroz/gpu-kernel-runner/issues/) or [write me](mailto:eyalroz1@gmx.com).\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feyalroz%2Fgpu-kernel-runner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feyalroz%2Fgpu-kernel-runner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feyalroz%2Fgpu-kernel-runner/lists"}