Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/projectphysx/ptxprofiler
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
https://github.com/projectphysx/ptxprofiler
cuda gpu gpu-acceleration gpu-computing gpu-programming hpc nvidia nvidia-cuda nvidia-gpu opencl profiler ptx ptx-utils roofline-model sycl
Last synced: about 1 month ago
JSON representation
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
- Host: GitHub
- URL: https://github.com/projectphysx/ptxprofiler
- Owner: ProjectPhysX
- License: other
- Created: 2023-01-11T19:24:04.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-12-31T11:45:40.000Z (12 months ago)
- Last Synced: 2024-11-01T02:51:49.539Z (about 2 months ago)
- Topics: cuda, gpu, gpu-acceleration, gpu-computing, gpu-programming, hpc, nvidia, nvidia-cuda, nvidia-gpu, opencl, profiler, ptx, ptx-utils, roofline-model, sycl
- Language: C++
- Homepage:
- Size: 10.7 KB
- Stars: 42
- Watchers: 4
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# PTXprofiler
A simple profiler to count Nvidia [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) instructions of [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/SYCL/CUDA kernels for [roofline model](https://en.wikipedia.org/wiki/Roofline_model) analysis.## How to compile?
- on Windows: compile with Visual Studio Community
- on Linux: run `chmod +x make.sh` and `./make.sh path/to/kernel.ptx`## How to use?
1. Generate a `.ptx` file from your application; this works only with an Nvidia GPU. With the [OpenCL-Wrapper](https://github.com/ProjectPhysX/OpenCL-Wrapper), you can simply uncomment `#define PTX` in [`src/opencl.hpp`](https://github.com/ProjectPhysX/OpenCL-Wrapper/blob/master/src/opencl.hpp#L4) and compile and run. A file `kernel.ptx` is created, containing the [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) code.
2. Run `bin/PTXprofiler.exe path/to/kernel.ptx`. For [FluidX3D](https://github.com/ProjectPhysX/FluidX3D) for example, this table is generated:
```
kernel name |flops (float int bit )|copy |branch|cache (load store)|memory (load cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize | 283 129 61 93| 33| 6| 0 0 0| 135 35 0 100
stream_collide | 363 261 35 67| 23| 2| 0 0 0| 153 77 0 76
update_fields | 160 56 37 67| 21| 2| 0 0 0| 93 77 0 16
voxelize_mesh | 170 91 34 45| 40| 11| 84 48 36| 37 36 0 1
transfer_extract_fi | 460 0 221 239| 122| 63| 0 0 0| 180 80 20 80
transfer__insert_fi | 483 0 247 236| 115| 47| 0 0 0| 180 80 20 80
transfer_extract_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
transfer__insert_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
```
3. For each [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/CUDA kernel, instructions are counted and listed:
- GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as `flops`, but also listed separately as `float`, `int` and `bit`.
- Data movement operations are listed under `copy`.
- Branches are listed under `branch`.
- Total shared/local memory (L1 cache) accesses in Byte are listed under `cache`, with separate counters for `load` and `store`.
- Total global memory (VRAM) accesses in Byte are listed under `memory`, with separate counters for `load`, `cached` (load from VRAM or L2 cache) and `store`.
4. You can use the counted `flops` and `memory` accesses, together with the measured execution time of the kernel, to place it in a [roofline model](https://en.wikipedia.org/wiki/Roofline_model) diagram.## Limitations
- Matrix/tensor operations are not yet supported.
- Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.