https://github.com/projectphysx/ptxprofiler

A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
https://github.com/projectphysx/ptxprofiler

cuda gpu gpu-acceleration gpu-computing gpu-programming hpc nvidia nvidia-cuda nvidia-gpu opencl profiler ptx ptx-utils roofline-model sycl

Last synced: 3 months ago
JSON representation

A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.

Host: GitHub
URL: https://github.com/projectphysx/ptxprofiler
Owner: ProjectPhysX
License: other
Created: 2023-01-11T19:24:04.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2025-03-20T06:14:32.000Z (4 months ago)
Last Synced: 2025-04-04T19:32:16.515Z (3 months ago)
Topics: cuda, gpu, gpu-acceleration, gpu-computing, gpu-programming, hpc, nvidia, nvidia-cuda, nvidia-gpu, opencl, profiler, ptx, ptx-utils, roofline-model, sycl
Language: C++
Homepage:
Size: 11.7 KB
Stars: 50
Watchers: 4
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # PTXprofiler

A simple profiler to count Nvidia [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) instructions of [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/SYCL/CUDA kernels for [roofline model](https://en.wikipedia.org/wiki/Roofline_model) analysis.

## How to compile?

- on Windows: compile with Visual Studio Community

- on Linux: run `chmod +x make.sh` and `./make.sh path/to/kernel.ptx`

## How to use?

1. Generate a `.ptx` file from your application; this works only with an Nvidia GPU. With the [OpenCL-Wrapper](https://github.com/ProjectPhysX/OpenCL-Wrapper), you can simply uncomment `#define PTX` in [`src/opencl.hpp`](https://github.com/ProjectPhysX/OpenCL-Wrapper/blob/master/src/opencl.hpp#L4) and compile and run. A file `kernel.ptx` is created, containing the [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) code.

2. Run `bin/PTXprofiler.exe path/to/kernel.ptx`. For [FluidX3D](https://github.com/ProjectPhysX/FluidX3D) for example, this table is generated:

```

kernel name                     |flops  (float int    bit  )|copy  |branch|cache  (load  store)|memory (load  cached store)

--------------------------------|---------------------------|------|------|--------------------|---------------------------

initialize                      |   283    129     61     93|    33|     6|     0      0      0|   135     35      0    100

stream_collide                  |   363    261     35     67|    23|     2|     0      0      0|   153     77      0     76

update_fields                   |   160     56     37     67|    21|     2|     0      0      0|    93     77      0     16

voxelize_mesh                   |   170     91     34     45|    40|    11|    84     48     36|    37     36      0      1

transfer_extract_fi             |   460      0    221    239|   122|    63|     0      0      0|   180     80     20     80

transfer__insert_fi             |   483      0    247    236|   115|    47|     0      0      0|   180     80     20     80

transfer_extract_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34

transfer__insert_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34

```

3. For each [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/CUDA kernel, instructions are counted and listed:

   - GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as `flops`, but also listed separately as `float`, `int` and `bit`.

   - Data movement operations are listed under `copy`.

   - Branches are listed under `branch`.

   - Total shared/local memory (L1 cache) accesses in Byte are listed under `cache`, with separate counters for `load` and `store`.

   - Total global memory (VRAM) accesses in Byte are listed under `memory`, with separate counters for `load`, `cached` (load from VRAM or L2 cache) and `store`.

4. You can use the counted `flops` and `memory` accesses, together with the measured execution time of the kernel, to place it in a [roofline model](https://en.wikipedia.org/wiki/Roofline_model) diagram.

## Limitations

- Matrix/tensor operations are not yet supported.

- Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/projectphysx/ptxprofiler

Awesome Lists containing this project

README