https://github.com/southernmethodistuniversity/profiling_applications

Profiling Applications on M2
https://github.com/southernmethodistuniversity/profiling_applications
Last synced: 4 months ago
JSON representation
Profiling Applications on M2
Host: GitHub
URL: https://github.com/southernmethodistuniversity/profiling_applications
Owner: SouthernMethodistUniversity
Created: 2020-04-20T12:59:26.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-04-21T19:07:31.000Z (about 6 years ago)
Last Synced: 2025-03-03T03:13:22.255Z (over 1 year ago)
Size: 73.2 KB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Profiling Applications on M2

## Center for Research Computing (CRC)

* Maintains our primary shared resource for research computing, ManeFrame II (M2),

  in collaboration with OIT

* Provides research computing tools, support, and training to all faculty, staff,

  and students using research computing resources

  [www.smu.edu/csc](https://www.smu.edu/csc) has documentation and news

* [help@smu.edu](mailto:help@smu.edu) or

  [rkalescky@smu.edu](mailto:rkalescky@smu.edu) for help

## CSC Workshop Series

|Date         |Workshop                                                     |

|-------------|-------------------------------------------------------------|

|January 21   |M2 Introduction                                              |

|January 28   |Introduction to LAPACK and BLAS                              |

|February 4   |Text Mining with Python on M2 (Lead by Dr. Eric Godat)       |

|February 11  |Using the New HPC Portal                                     |

|February 18  |Using GitHub                                                 |

|February 25  |Writing Portable Accelerator Code with KOKKOS, RAJA, and OCCA|

|March 3      |M2 Introduction                                              |

|March 10     |Introduction to Parallelization Using MPI                    |

|March 17     |No Workshop Spring Break                                     |

|March 24     |Writing High Performance Python Code                         |

|March 31     |Creating Portable Environments with Docker and Singularity   |

|April 7      |M2 Introduction                                              |

|April 14     |Introduction to Parallelization Using OpenMP and OpenACC     |

|April 21     |Profiling Applications on M2                                 |

|April 28     |Improving Code Vectorization                                 |

## Accessing ManeFrame II (M2) for this Workshop

* Via Terminal or Putty as usual (see [here](http://faculty.smu.edu/csc/documentation/access.html) for details)

* Via the HPC Portal (Note that this doesn't support X11 forwarding)

    1. Go to [hpc.smu.edu](https://hpc.smu.edu/).

    2. Sign in using your SMU ID and SMU password.

    3. Select "ManeFrame II Shell Access" from the "Clusters" drop-down menu.

## Profiling and Performance Analysis with GCC

There are two primary mechanisms for profiling code: determining which

routines take the most time, and determining which specific lines of

code would be best to optimize. Thankfully, the [GNU compiler

collection](http://gcc.gnu.org/) includes utilities for both of these

tasks, as will be illustrated below. Utilities with similar

functionality are included with some other compilers.

### Generating a profile

In the GNU compilers (and many others), you can enable profiling

information through adding in the `-p` compiler flag. Add this compiler

flag to the commands in the `CMakeCache.txt` for the target `mmm`

Profiling information is generated by running the executable once to

completion. 

```

$ module load spack gcc-9.2 armadillo cmake

$ cmake .

$ cmake --build .

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 ./mmm 2000 2000 2000

```

Write down the total runtime required for the program (you will use this

information later on).

When the program has finished, you should see a new file in the

directory called `gmon.out`. This contains the relevant profiling data,

and was written during the execution of the code.

Examine the profiling information by using the program `gprof`. You use

this by calling `gprof`, followed by the executable name. It will

automatically look in the `gmon.out` file in that directory for the

profiling data that relates to the executable. Run the command

```

$ gprof mmm

```

When you run `gprof`, it outputs all of the profiling information to the

screen. To enable easier examination of these results, you should

instead send this data to a file. You can redirect this information to

the file `profiling_data.txt` with the command

```

$ gprof mmm > profiling_data.txt

```

You will then have the readable file `profiling_data.txt` with the

relevant profiling information.

### Identifying bottlenecks

Read through the first table of profiling information in this file. The

first column of this table shows the percentage of time spent in each

function called by the executable. Identify which one takes the vast

majority of the time. This bottleneck should be the first routine that

you investigate for optimization.

Look through the routine identified from the previous step, the

function may be contained in a file with a different name, so you can

use `grep` to find which file contains the routine:

```

$ grep -i  *

```

where `` is the function that you identified from the

previous step.

Once you have determined the file that contains the culprit function,

you can use the second utility routine `gcov` to determine which lines

in the file are executed the most. To use `gcov`, you must modify the

compile line once more, to use the compilation flags

`-fprofile-arcs -ftest-coverage`.

Add these compiler flags to the commands in the `CMakeLists.txt` for the

target `mmm`, recompile, and re-run the executable,

```

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 ./mmm 2000 2000 2000

$ mv ./CMakeFiles/mmm.dir/mmm.cpp.gcno mmm.gcno

$ mv ./CMakeFiles/mmm.dir/mmm.cpp.gcda mmm.gcda

```

You should now see additional files in the directory with extentions

`.gcda` and `.gcno`. If you do not see these files, revisit the above

instructions to ensure that you haven't missed any steps.

You should now run `gcov` on the input file that held the function you

identified from the steps above. For example, if the source code file

was `file.cpp`, you would run

```

$ gcov mmm.cpp

```

This will output some information to the screen, including the name of a

`.gcov` file that it creates with information on the program. Open this

new file using `nano`, and you will see lines like the following:

```

      2001:   10:    for (unsigned long int k = 0; k < p; ++k) {

   4002000:   11:        for (unsigned long int j = 0; j < n; ++j) {

8004000000:   12:            for (unsigned long int i = 0; i < m; ++i) {

8000000000:   13:                C.at(i, k) += A.at(i, j) * B.at(j, k);

```

The first column of numbers on the left signify the number of times each

line of code was executed within the program. The second column of

numbers correspond to the line number within the source code file. The

remainder of each line shows the source code itself. From the above

snippet, we see that lines 54 and 55 were executed 1.01 and 1 million

times, respectively, indicating that these would be prime locations for

code optimization.

Find the corresponding lines of code in the function that you identified

from the preceding step. It is here where you should focus your

optimization efforts.

### Optimizing code

Save a copy of the source code file you plan to modify using the `cp`

command, e.g.

```

$ cp file.cpp file_old.cpp

```

where `file` is the file that you have identified as containing the

bottleneck routine (use the appropriate extension for your coding

language). We will use this original file again later in the session.

Now that you know which lines are executed, and how often, you should

remove the `gcov` compiler options, but keep the `-p` in your

`CMakeLists.txt`.

Determine what, if anything, can be optimized in this routine. The topic

of code optimization is bigger than we can cover in a single workshop

session, but here are some standard techniques.

#### Code optimization techniques

1.  Is there a simpler way that the arithmetic could be accomplished?

    Sometimes the most natural way of writing down a problem does not

    result in the least amount of effort. For example, we may implement

    a line of code to evaluate the polynomial $p(x) =

    2x^4-3x^3+5x^2-8x+7$ using either

    ```

    p = 2.0*x*x*x*x - 3.0*x*x*x + 5.0*x*x - 8*x + 7.0;

    ```

    or

    ```

    p = (((2.0*x - 3.0)*x + 5.0)*x - 8.0)*x + 7.0;

    ```

    The first line requires 10 multiplication and 4 addition/subtraction

    operations, while the second requires only 4 multiplications and 4

    additions/subtractions.

2.  Is the code accessing memory in an optimal manner? Computers store

    and access memory from RAM one \"page\" at a time, meaning that if

    you retrieve a single number, the numbers nearby that value are also

    stored in fast-access cache memory. So, if each iteration of a loop

    uses values that are stored in disparate portions of RAM, each value

    could require retrieval of a separate page. Alternatively, if each

    loop iteration uses values from memory that are stored nearby one

    another, many numbers in a row can be retrieved using a single RAM

    access. Since RAM access speeds are significantly slower than cache

    access speeds, something as small as a difference in loop ordering

    can make a huge difference in speed.

3.  Is the code doing redundant computations? While modern computers can

    perform many calculations in the time it takes to access one page of

    RAM, some calculations are costly enough to warrant computing it

    only once and storing the result for later reuse. This is especially

    pertinent for things that are performed a large number of times. For

    example, consider the following two algorithms:

    ```

    for (i=1; i<10000; i++) {

    d[i] = u[i-1]/h/h - 2.0*u[i]/h/h + u[i+1]/h/h;

    } 

    ```

    and

    ```

    double hinv2 = 1.0/h/h;

    for (i=1; i<10000; i++) {

    d[i] = (u[i-1] - 2.0*u[i] + u[i+1])*hinv2;

    }

    ```

    Since floating-point division is significantly more costly than

    multiplication (roughly $10\times$), and the division by $h^2$ is

    done redundantly both within and between loop iterations, the second

    of these algorithms is typically much faster than the first.

4.  Is the code doing unnecessary data copies? In many programming

    languages, a function can be written to use either *call-by-value*

    or *call-by-reference*.

    In call-by-value, all arguments to a function are copied from the

    calling routine into a new set of variables that are local to the

    called function. This allows the called function to modify the input

    variables without concern about corrupting data in the calling

    routine.

    In call-by-reference, the called function only receives memory

    references to the actual data held by the calling routine. This

    allows the called function to directly modify the data held by the

    calling routine.

    While call-by-reference is obviously more \"dangerous,\" it avoids

    unnecessary (and costly) memory allocation/copying/deallocation in

    the executing code. As such, highly efficient code typically uses

    call-by-reference, with the programmer responsible for ensuring that

    data requiring protection in the calling program is manually copied

    before function calls, or that the functions themselves are

    constructed to avoid modifying the underlying data.

    In C and C++, call-by-value is the default, whereas Fortran uses

    call-by-reference. However in C, pointers may be passed through

    function calls to emulate call-by-reference. In C++, either pointers

    can be sent through function calls, or arguments may be specified as

    being passed by reference (using the `&` symbol).

Find what you can fix, so long as you do not change the mathematical

result. Delete and re-compile the executable,

Re-examine the results using `gprof`, and repeat the optimization

process until you are certain that the code has been sufficiently

optimized. You should be able to achieve a significant performance

improvement (at least 40% faster than the original).

Write down the total runtime required for your hand-optimized program.

Copy your updated code to the file `file_new.cpp` (again, use the

appropriate extension for your coding language).

## Profiling Python Scripts

Like GCC and Python, R has the ability to profile at the line level.

```

$ module purge

$ module load python

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 python3 mmm.py

```

This demonstrates the performance difference between an optimized matrix-matrix

implementation and a simple implementation.

```

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 python3 -m cProfile -o profile_data.txt mmm.py

$ python3 view_profile.py | grep gemm

```

The `view_profile.py` script extracts line-level profile information from the

`profile_data.txt` file.

## Profiling R Scripts

Like GCC and Python, R has the ability to profile at the function and line levels.

```

$ module purge

$ module load r/3.6.2

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 Rscript mmm.R 1

```

This demonstrates the performance difference between an optimized matrix-matrix

implementation and a simple implementation.

```

$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 Rscript profile.R

```

The `profile.R` script runs the profiler twice, once for function-level

information and again for line-level information.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/southernmethodistuniversity/profiling_applications

Awesome Lists containing this project

README