{"id":17278326,"url":"https://github.com/projectphysx/ptxprofiler","last_synced_at":"2025-09-10T22:43:54.345Z","repository":{"id":159413829,"uuid":"587868577","full_name":"ProjectPhysX/PTXprofiler","owner":"ProjectPhysX","description":"A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.","archived":false,"fork":false,"pushed_at":"2025-03-20T06:14:32.000Z","size":12,"stargazers_count":50,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T19:32:16.515Z","etag":null,"topics":["cuda","gpu","gpu-acceleration","gpu-computing","gpu-programming","hpc","nvidia","nvidia-cuda","nvidia-gpu","opencl","profiler","ptx","ptx-utils","roofline-model","sycl"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ProjectPhysX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-11T19:24:04.000Z","updated_at":"2025-03-20T06:14:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"e3086a49-fdd3-459d-9676-ec13a65ae149","html_url":"https://github.com/ProjectPhysX/PTXprofiler","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FPTXprofiler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FPTXprofiler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FPTXprofiler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProjectPhysX%2FPTXprofiler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ProjectPhysX","download_url":"https://codeload.github.com/ProjectPhysX/PTXprofiler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248845972,"owners_count":21170872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu","gpu-acceleration","gpu-computing","gpu-programming","hpc","nvidia","nvidia-cuda","nvidia-gpu","opencl","profiler","ptx","ptx-utils","roofline-model","sycl"],"created_at":"2024-10-15T09:11:25.888Z","updated_at":"2025-04-14T08:32:43.376Z","avatar_url":"https://github.com/ProjectPhysX.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PTXprofiler\nA simple profiler to count Nvidia [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) instructions of [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/SYCL/CUDA kernels for [roofline model](https://en.wikipedia.org/wiki/Roofline_model) analysis.\n\n## How to compile?\n- on Windows: compile with Visual Studio Community\n- on Linux: run `chmod +x make.sh` and `./make.sh path/to/kernel.ptx`\n\n## How to use?\n1. Generate a `.ptx` file from your application; this works only with an Nvidia GPU. With the [OpenCL-Wrapper](https://github.com/ProjectPhysX/OpenCL-Wrapper), you can simply uncomment `#define PTX` in [`src/opencl.hpp`](https://github.com/ProjectPhysX/OpenCL-Wrapper/blob/master/src/opencl.hpp#L4) and compile and run. A file `kernel.ptx` is created, containing the [PTX assembly](https://docs.nvidia.com/cuda/parallel-thread-execution/) code.\n2. Run `bin/PTXprofiler.exe path/to/kernel.ptx`. For [FluidX3D](https://github.com/ProjectPhysX/FluidX3D) for example, this table is generated:\n```\nkernel name                     |flops  (float int    bit  )|copy  |branch|cache  (load  store)|memory (load  cached store)\n--------------------------------|---------------------------|------|------|--------------------|---------------------------\ninitialize                      |   283    129     61     93|    33|     6|     0      0      0|   135     35      0    100\nstream_collide                  |   363    261     35     67|    23|     2|     0      0      0|   153     77      0     76\nupdate_fields                   |   160     56     37     67|    21|     2|     0      0      0|    93     77      0     16\nvoxelize_mesh                   |   170     91     34     45|    40|    11|    84     48     36|    37     36      0      1\ntransfer_extract_fi             |   460      0    221    239|   122|    63|     0      0      0|   180     80     20     80\ntransfer__insert_fi             |   483      0    247    236|   115|    47|     0      0      0|   180     80     20     80\ntransfer_extract_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34\ntransfer__insert_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34\n```\n3. For each [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper)/CUDA kernel, instructions are counted and listed:\n   - GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as `flops`, but also listed separately as `float`, `int` and `bit`.\n   - Data movement operations are listed under `copy`.\n   - Branches are listed under `branch`.\n   - Total shared/local memory (L1 cache) accesses in Byte are listed under `cache`, with separate counters for `load` and `store`.\n   - Total global memory (VRAM) accesses in Byte are listed under `memory`, with separate counters for `load`, `cached` (load from VRAM or L2 cache) and `store`.\n4. You can use the counted `flops` and `memory` accesses, together with the measured execution time of the kernel, to place it in a [roofline model](https://en.wikipedia.org/wiki/Roofline_model) diagram.\n\n## Limitations\n- Matrix/tensor operations are not yet supported.\n- Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprojectphysx%2Fptxprofiler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprojectphysx%2Fptxprofiler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprojectphysx%2Fptxprofiler/lists"}