https://github.com/yalue/cuda_scheduling_examiner_mirror

A tool for examining GPU scheduling behavior.
https://github.com/yalue/cuda_scheduling_examiner_mirror
benchmark cuda cuda-kernels gpu gpu-scheduling mandelbrot
Last synced: 5 months ago
JSON representation
A tool for examining GPU scheduling behavior.
Host: GitHub
URL: https://github.com/yalue/cuda_scheduling_examiner_mirror
Owner: yalue
License: other
Created: 2017-03-29T19:36:02.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-08-17T00:37:23.000Z (11 months ago)
Last Synced: 2025-02-05T22:16:24.657Z (5 months ago)
Topics: benchmark, cuda, cuda-kernels, gpu, gpu-scheduling, mandelbrot
Language: Cuda
Homepage:
Size: 51.7 MB
Stars: 71
Watchers: 11
Forks: 18
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        CUDA Scheduling Viewer

======================

About

-----

This project was intended to provide a tool for examining block-level

scheduling behavior and coscheduling performance on CUDA devices. The tool is

capable of running any benchmark which can be self-contained in a shared

library file exporting specific functions. Currently, this tool only runs under

Linux, and is unlikely to support other systems in the future.

To cite this work in academic use, either link to this repository or cite the

[original paper for which it was created](https://cs.unc.edu/~anderson/papers/ospert17.pdf).

```

@inproceedings{otterness2017inferring,

  title={Inferring the Scheduling Policies of an Embedded {CUDA} {GPU}},

  author={Otterness, Nathan and Yang, Ming and Amert, Tanya and Anderson, James H. and Smith, F. D.},

  booktitle={Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT)},

  year={2017}

}

```

If using SM/TPC partitioning, please cite the

[paper for which it was created](https://cs.unc.edu/~jbakita/rtas23.pdf).

```

@inproceedings{bakita2023hardware,

  title={Hardware Compute Partitioning on {NVIDIA} {GPUs}},

  author={Bakita, Joshua and Anderson, James H},

  booktitle={Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)},

  year={2023},

}

```

For Users of AMD GPUs

---------------------

For users of AMD GPUs, or those willing to give up some useful CUDA-specific

features, we developed a port of this project in

the [HIP](https://github.com/ROCm-Developer-Tools/HIP) language. This project

can be found at [https://github.com/yalue/hip_plugin_framework](https://github.com/yalue/hip_plugin_framework).

`hip_plugin_framework` remains nearly identical to `cuda_scheduling_examiner`,

but with some cleaned-up code, more consistent naming conventions, and,

unfortunately, lacking in ability to detect the SMs that blocks are assigned

to, as such a feature is not portable to HIP.

Compilation

-----------

This tool can only be run on a computer with a CUDA-capable GPU and with CUDA

installed. The `nvcc` command must be available on your PATH. The tool has not

been tested with devices earlier than compute capability 5.0 or CUDA versions

earlier than 9.0. GCC version 4.9 or later is required.

Earlier versions of the tool, developed for devices with compute capability 3.0

or CUDA versions 8.0 or earlier, is available by checking out the `older_cuda`

git tag.

To build, clone the repository, `cd` into it, and run `make`.

In order to use SM/TPC partitioning (the `sm_mask` field documented below),

please install [libsmctrl](http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/)

and set `LIBSMCTRL_PATH` to the library's location in this project's Makefile.

Usage

-----

The tool must be provided a JSON configuration file, which will contain

information about which benchmark libraries to run, how to run them, and what

parameters to provide. The file `configs/simple.json` has been provided as a

minimal example, running one instance of the `mandelbrot.so` benchmark. To run

it:

```bash

./bin/runner ./configs/simple.json

```

Additionally, the character `-` may be used in place of a config file name, in

which case the tool will attempt to read a JSON configuration object from

stdin. The file will be read completely before any benchmarks begin execution.

Some scripts have been included to visualize results. They require python,

numpy, and matplotlib. All such scripts are located in the scripts directory.

For example:

```bash

# Run all known configurations

find configs/*.json -exec ./bin/runner {} \;

# Visualize the scheduling timelines for each scenario

python scripts/view_timelines.py

# View the execution timeline of each block

python scripts/view_blocksbysm.py

```

To only plot a subset of the results, many of the aforementioned scripts support

explicitly specifying which output files to plot.

For example:

```bash

# Plot all results of the memset_doesnt_block.json configuration

python scripts/view_blocksbysm.py ./results/test_blocking_memset*

```

Configuration Files

-------------------

The configuration files specify parameters passed to each benchmark along with

some global settings for the entire program.

The layout of each configuration file is as follows:

```

{

  "name": ,

  "max_iterations": ,

  "max_time": ,

  "use_processes": 

  "cuda_device": ,

  "base_result_directory": ,

  "pin_cpus": ,

  "do_warmup": ,

  "sync_every_iteration": ,

  "benchmarks": [

    {

      "filename": ,

      "log_name": ,

      "mps_thread_percentage": ,

      "label:": ,

      "thread_count": ,

      "block_count": ,

      "data_size": ,

      "sm_mask": ,

      "additional_info": ,

      "max_iterations": ,

      "max_time": ,

      "release_time": ,

      "cpu_core": 

      "stream_priority": 

    }

  ]

}

```


Additionally, benchmark configurations support the insertion of comments via

the usage of "comment" keys, which will be ignored at runtime.

Automatic Benchmark Generation

------------------------------

The script located in `scripts/multikernel_generator.py` illustrates how

config generation can be scripted. To run a scenario automatically generated by

this script, run the following command (after running `make`):

```bash

python scripts/multikernel_generator.py | ./bin/runner -

```

Output File Format

------------------

Each benchmark, when run, will generate a JSON log file at the location

specified in the configuration. If the benchmark did not complete successfully,

the JSON file may be in an invalid state. Times will be recorded as

floating-point numbers of seconds. The format of the log file is:

```

{

  "scenario_name": "",

  "benchmark_name": "",

  "label": "",

  "max_resident_threads": ,

  "data_size": ,

  "release_time": ,

  "PID": ,

  "TID": ,

  "times": [

    {},

    {

      "cpu_times": [

        ,

        

      ],

      "copy_in_times": [

        ,

        

      ],

      "execute_times": [

        ,

        

      ],

      "copy_out_times": [

        ,

        

      ]

    },

    {

      "kernel_name": ,

      "block_count": ,

      "thread_count": ,

      "shared_memory": ,

      "cuda_launch_times": [,

        ,

        ],

      "block_times": [, , ...],

      "block_smids": [, , ...],

      "cpu_core": 

    },

    ...

  ]

}

```

Notice that the first entry in the "times" array will be blank and should be

ignored. The times array will contain two types of objects: one will contain

CPU times and one type will apply to kernel times. An object containing CPU

times will contain a `"cpu_times"` key. A single CPU times object will

encompass all kernel times following it, up until another CPU times object.

Creating New Benchmarks

-----------------------

Each benchmark must be contained in a shared library and abide by the interface

specified in `src/library_interface.h`. In particular, the library must export

a `RegisterFunctions` function, which provides the addresses of further

functions to the calling program. Benchmarks, preferably, should never use

global state and instead use the `user_data` pointer returned by the

initialize function to track all state. Global state may function if only one

instance of each benchmark is run at a time, but this will never be a

limitation of the default benchmarks included in this project. All benchmarks

must use a user-created CUDA stream in order to avoid unnecessarily blocking

each other.

The most important piece of information that each benchmark provides is the

`TimingInformation` struct, filled in during the `copy_out` function of each

benchmark. This struct will contain a list of `KernelTimes` structs, one for

each kernel invocation called during `execute`. Each `KernelTimes` struct will

contain the kernel start and end times, individual block start and end times,

and a list of the SM IDs to which blocks were assigned. The benchmark is

responsible for ensuring that the buffers provided in the TimingInformation

struct remain valid at least until another benchmark function is called. They

will not be freed by the caller.

In general, the comments in `library_interface.h` provide an explanation for

the actions that every library-provided function is expected to carry out.

The existing libraries in `src/mandelbrot.cu` and `src/timer_spin.cu` provide

examples of working library implementations.

In addition to `library_interface.h`, `benchmark_library_funcions.h/cu` define

a library of utility functions that may be shared between benchmarks.

Benchmark libraries are invoked by the master process as follows:

 1. The shared library file is loaded using the `dlopen()` function, and the

    `RegisterFunctions` function is located using `dlysym()`.

 2. Depending on the configuration, either a new process or new thread will be

    created for each benchmark.

 3. In its own thread or process, the benchmark's `initialize` function will

    be called, which should allocate and initialize all of the local state

    necessary for one instance of the benchmark.

 4. When the benchmark begins running, a single iteration will consist of the

    benchmark's `copy_in`, `execute`, and `copy_out` functions being called, in

    that order.

 5. When enough time has elapsed or the maximum number of iterations has been

    reached, the benchmark's `cleanup` function will be called, to allow for

    the benchmark to clean up and free its local state.

Coding Style

------------

Even though CUDA supports C++, contributions to this project should use the C

programming language when possible. C or CUDA source code should adhere to the

parts of the [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html)

that apply to the C language.

Scripts should remain in the `scripts/` directory and should be written in

python when possible. For now, there is no explicit style guide for python

scripts apart from trying to maintain a consistent style within each file.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yalue/cuda_scheduling_examiner_mirror

Awesome Lists containing this project

README